Chemical Prioritization System - ChemPSy The aim of **ChemPSy** (Chemical Prioritization System) is to develop an innovative approach based on several bioinformatics and biostatistics methodologies to analyze and integrate massive toxicogenomics datasets. Specific objectives include: (1) classification of chemicals based on transcriptional signatures, e.g. the set of genes whose expression is known to be positively or negatively altered after an exposure to these compounds; (2) the association of classes with human pathologies or deleterious phenotypes, e.g. classes containing toxicants with well-known effects; (3) the prediction of novel reprotoxicants and/or endocrine disruptors based on transcriptional signature similarities with known chemicals affecting testis development and function.
The aim of ChemPSy (Chemical Prioritization System) is to develop an innovative approach based on several bioinformatics and biostatistics methodologies to analyze and integrate massive toxicogenomics datasets.
Specific objectives include: (1) classification of chemicals based on transcriptional signatures, e.g. the set of genes whose expression is known to be positively or negatively altered after an exposure to these compounds; (2) the association of classes with human pathologies or deleterious phenotypes, e.g. classes containing toxicants with well-known effects; (3) the prediction of novel reprotoxicants and/or endocrine disruptors based on transcriptional signature similarities with known chemicals affecting testis development and function.
##Data format
Each dataset is organizing according to the following:
+-- [Species]
+-- [Tissue]
+--GSEXXX
+--Experimental_conditions
¦ +--Condition 1
¦ +--Condition 2
+--Individual_experiments
¦ +--GSMXXXX.CEL.gz
¦ +--GSMXXXX.CEL.gz
+--GSEXXX.txt
[Species]: Binomial nomenclature for selected species (e.g. Homo sapiens for Human)
[Tissue]: Tested tissue in upper case (e.g. LIVER, KIDNEY or MCF-7, HK-2…)
###Step 1 - Describe your dataset
To describe your dataset please use tabulate .txt file with the following fields (Keep the order):
Fields | Description |
---|---|
Files | CEL file full name (GSM1223.CEL.gz) |
Species | Binomial nomenclature of species where the results come from |
Strain | Species strain (e.g. Sprague-Dawley). Can be not specified * |
Gender | Animal gender(male/female). Can be not specified * |
Experiment | Experiment (E.G. in vitro, in vivo, …). Can be not specified * |
Tissues/Cells | Tissue or cell name where the experiment is performed |
Age | Animal age. Can be not specified * |
Generation | Animal generation (for trans-generational studies). If not specified, please put ‘F0’. |
ChemicalName | Chemical usual / synonym name (only one name) |
CAS | Chemical CAS number |
MESH | Chemical MESH ID |
Dose | Chemical exposition dose |
Duration | Chemical exposition duration |
Route | Chemical route. Can be not specified * |
Vehicule | Chemical vehicle. Can be not specified * |
PMID | Associated publication PubMed ID. Can be not specified * |
GSE | GEO dataset ID |
GSM | GEO profile ID |
GPL | GPL use. Can be not specified * |
Corresponding dataset author mail. Can be not specified * | |
Paired | Paired data (Yes/No) |
Replicates | Replicate number |
Experiment type | Experimental type details (e.g. 'Expression profiling by array’). Can be not specified * |
Design | Experimental design. Can be not specified * |
Treatment protocol | Treatment protocol description. Can be not specified * |
Characteristics | Tissue/cells characteristics. Can be not specified * |
Extraction protocol | Extraction protocol description. Can be not specified * |
Link(s) | Cross-link(s) (e.g. GEO, database, personal website …). Can be not specified * |
Data processing | Data processing description. Can be not specified * |
Sample Treated | Treated or Control sample. Can be not specified |
Associated Ctrl | Associate a unique number to your control and list all control paired with your treated sample (e.g. control1 = 1, control2=2 …, treated_sample1 = 1,2 [this sample is paired with control 1 and 2]). |
Don’t leave empty fields: use ‘NA’ if your field is not specified
‘*’: This field is required for TOXsIgN integration
Each line need to correspond to one and unique sample
###Step 2 - Organize your data
In your GSEXXX directory, save your tabulate .txt file using the same name of your directory: GSEXXX.txt and create a new folder called: Individual_experiments.
Drop in this folder all expression files associated with your study. Please make sur that all yours. CEL file are compressed. If not use the following command:
gzip *.CEL
###Step 3 - Create conditions and treatment.info
To create the Experimental_conditions directory and all conditions sub-directories, use the CreateTreatmentInfo.sh script. This script takes no arguments but load a configuration file: ChemPSy.ini. Please modified this file or change the configuration file load in CreateTreatmentInfo.sh script:
source /home/genouest/irset/tdarde/projects/ChemPSy/20160321/script/ChemPSy_Human.ini
Next adapt the loop according to your datasets:
#! /bin/bash
source /home/genouest/irset/tdarde/projects/ChemPSy/20160321/script/ChemPSy_Human.ini
echo "Reading config...." >&2
#echo "Create treatment.info files for HEPATOCYTES"
for i in $HepatoList
do
python $scriptTreatment -p $i -t HEPATOCYTES -e $fileRemove -s True
done
echo "Create treatment.info files for HK-2"
for i in $HK2List
do
python $scriptTreatment -p $i -t HK-2 -e $fileRemove -s True
done
echo "Create treatment.info files for ISHIKAWA_CELLS"
for i in $IshikawaList
do
python $scriptTreatment -p $i -t ISHIKAWA_CELLS -e $fileRemove -s True
done
echo "Create treatment.info files for JURKAT_CELLS"
for i in $JurkatList
do
python $scriptTreatment -p $i -t JURKAT_CELLS -e $fileRemove -s True
done
echo "Create treatment.info files for MCF-7"
for i in $MCFList
do
python $scriptTreatment -p $i -t MCF-7 -e $fileRemove -s True
done
echo "Create treatment.info files for liver"
for i in $LiverList
do
python $scriptTreatment -p $i -t LIVER -e $fileRemove -s False
done
echo "Create treatment.info files for tg"
for i in $TgList
do
python $scriptTreatment -p $i -t THIGH-MUSCLE -e $fileRemove
done
If you have no error, you may obtain the following directories organization:
+-- [Species]
+-- [Tissue]
+--GSEXXX
+--Experimental_conditions
¦ +--Condition 1
¦ ¦ +--treatment.info
¦ +--Condition 2
¦ ¦ +--treatment.info
+--Individual_experiments
¦ +--GSMXXXX.CEL.gz
¦ +--GSMXXXX.CEL.gz
+--GSEXXX.txt
In each treatment.info you will find the association between treated sample (first column) and control sample (second column):
003016029014.CEL.gz 003016029008.CEL.gz 0
003016029014.CEL.gz 003016029009.CEL.gz 0
003016029015.CEL.gz 003016029008.CEL.gz 0
003016029015.CEL.gz 003016029009.CEL.gz 0
##Run ChemPSy
Before run ChemPSy please be sur that you have the same architecture like previously describe and you have all your conditions with associated treatment.info files
Next run ChemPSy_data_prep.sh
As the previous script, this script uses the same configuration file. So please edit it and/or change the path on source line.
#!/bin/bash
#################################
# Source .ini file #
#################################
echo "############################## ChemPSy ##############################"
echo "--1-- Checking config file"
source /home/genouest/irset/tdarde/projects/ChemPSy/20160321/script/ChemPsy_processing/ChemPSy_Human.ini
echo "Reading config...." >&2
echo "Reading scriptPath: $scriptPath" >&2
echo "Reading config: $Rscript " >&2
echo "Reading dataPath: $dataPath" >&2
echo "Reading config: $processedPath " >&2
###Step 1 - Quality control
The first step of ChemPSy_data_prep.sh is a quality control. Various information will be created for each conditions including the microarray picture.
function step_1 {
echo "--2-- STEP_1 process_data"
for tissue in $tissues
do
echo $tissue
outputT=$processedPath$tissue"/"
mkdir -p $outputT
for gse in $gsePath
do
path=$dataPath$tissue"/"$gse"/"
if [ -d $path ]
then
output=$processedPath$tissue"/"$gse"/Experimental_conditions/"
mkdir -p $output
scriptA=$scriptPath'Rlauncher.sh'
$scriptA -p $path -t $tissue -o $output -c $cdfpath
fi
done
done
while [ $(qstat | grep "ChemPSy_" | wc -l) -ne 0 ]
do
echo "Running --2-- STEP_1 process_data"
sleep 7
done
echo "--2-- STEP_1 process_data finish"
}
To performe the quality control please check each picture one by one and remove microarray with 20% or more of hybridization error.
List all your sample to remove in removeCelFile.txt and run the Step - 3 of Data Format part
+-- [Species]
+-- [Tissue]
+--GSEXXX
+--Experimental conditions
+--Condition 1
¦ +--contrastmatrix.txt
¦ +--normdata.txt
¦ +--designmatrix.txt
¦ +--qc_boxplot_afternormalization.pdf
¦ +--qc_boxplot_beforenormalization.pdf
¦ +--filtration.txt
¦ +--log2fcchangedata.txt
¦ +--qc_corrmatrix_afternormalization.pdf
¦ +--mednormdata.txt
¦ +--qc_image_003016029009.CEL.gz.png
¦ +--qc_image_003016029015.CEL.gz.png
¦ +--qc_image_003016029009.CEL.gz.png
###Step 2 - List all conditions
This step lists all the conditions