ChemPSy

Chemical Prioritization System - ChemPSy The aim of **ChemPSy** (Chemical Prioritization System) is to develop an innovative approach based on several bioinformatics and biostatistics methodologies to analyze and integrate massive toxicogenomics datasets. Specific objectives include: (1) classification of chemicals based on transcriptional signatures, e.g. the set of genes whose expression is known to be positively or negatively altered after an exposure to these compounds; (2) the association of classes with human pathologies or deleterious phenotypes, e.g. classes containing toxicants with well-known effects; (3) the prediction of novel reprotoxicants and/or endocrine disruptors based on transcriptional signature similarities with known chemicals affecting testis development and function.

BioMAs

Chemical Prioritization System - ChemPSy

The aim of ChemPSy (Chemical Prioritization System) is to develop an innovative approach based on several bioinformatics and biostatistics methodologies to analyze and integrate massive toxicogenomics datasets.

Specific objectives include: (1) classification of chemicals based on transcriptional signatures, e.g. the set of genes whose expression is known to be positively or negatively altered after an exposure to these compounds; (2) the association of classes with human pathologies or deleterious phenotypes, e.g. classes containing toxicants with well-known effects; (3) the prediction of novel reprotoxicants and/or endocrine disruptors based on transcriptional signature similarities with known chemicals affecting testis development and function.

##Data format
Each dataset is organizing according to the following:

+-- [Species]  
   +-- [Tissue]
		+--GSEXXX
			+--Experimental_conditions
			¦	+--Condition 1
			¦	+--Condition 2
			+--Individual_experiments
			¦	+--GSMXXXX.CEL.gz
			¦	+--GSMXXXX.CEL.gz  
			+--GSEXXX.txt

[Species]: Binomial nomenclature for selected species (e.g. Homo sapiens for Human)

[Tissue]: Tested tissue in upper case (e.g. LIVER, KIDNEY or MCF-7, HK-2…)

###Step 1 - Describe your dataset
To describe your dataset please use tabulate .txt file with the following fields (Keep the order):

Fields	Description
Files	CEL file full name (GSM1223.CEL.gz)
Species	Binomial nomenclature of species where the results come from
Strain	Species strain (e.g. Sprague-Dawley). Can be not specified *
Gender	Animal gender(male/female). Can be not specified *
Experiment	Experiment (E.G. in vitro, in vivo, …). Can be not specified *
Tissues/Cells	Tissue or cell name where the experiment is performed
Age	Animal age. Can be not specified *
Generation	Animal generation (for trans-generational studies). If not specified, please put ‘F0’.
ChemicalName	Chemical usual / synonym name (only one name)
CAS	Chemical CAS number
MESH	Chemical MESH ID
Dose	Chemical exposition dose
Duration	Chemical exposition duration
Route	Chemical route. Can be not specified *
Vehicule	Chemical vehicle. Can be not specified *
PMID	Associated publication PubMed ID. Can be not specified *
GSE	GEO dataset ID
GSM	GEO profile ID
GPL	GPL use. Can be not specified *
Mail	Corresponding dataset author mail. Can be not specified *
Paired	Paired data (Yes/No)
Replicates	Replicate number
Experiment type	Experimental type details (e.g. 'Expression profiling by array’). Can be not specified *
Design	Experimental design. Can be not specified *
Treatment protocol	Treatment protocol description. Can be not specified *
Characteristics	Tissue/cells characteristics. Can be not specified *
Extraction protocol	Extraction protocol description. Can be not specified *
Link(s)	Cross-link(s) (e.g. GEO, database, personal website …). Can be not specified *
Data processing	Data processing description. Can be not specified *
Sample Treated	Treated or Control sample. Can be not specified
Associated Ctrl	Associate a unique number to your control and list all control paired with your treated sample (e.g. control1 = 1, control2=2 …, treated_sample1 = 1,2 [this sample is paired with control 1 and 2]).

Don’t leave empty fields: use ‘NA’ if your field is not specified
‘*’: This field is required for TOXsIgN integration
Each line need to correspond to one and unique sample

###Step 2 - Organize your data
In your GSEXXX directory, save your tabulate .txt file using the same name of your directory: GSEXXX.txt and create a new folder called: Individual_experiments.
Drop in this folder all expression files associated with your study. Please make sur that all yours. CEL file are compressed. If not use the following command:

gzip *.CEL

###Step 3 - Create conditions and treatment.info
To create the Experimental_conditions directory and all conditions sub-directories, use the CreateTreatmentInfo.sh script. This script takes no arguments but load a configuration file: ChemPSy.ini. Please modified this file or change the configuration file load in CreateTreatmentInfo.sh script:

source /home/genouest/irset/tdarde/projects/ChemPSy/20160321/script/ChemPSy_Human.ini

Next adapt the loop according to your datasets:

#! /bin/bash
source /home/genouest/irset/tdarde/projects/ChemPSy/20160321/script/ChemPSy_Human.ini
echo "Reading config...." >&2
#echo "Create treatment.info files for HEPATOCYTES"
for i in $HepatoList
    do
        python $scriptTreatment -p $i -t HEPATOCYTES -e $fileRemove -s True
done

echo "Create treatment.info files for HK-2"
for i in $HK2List
    do
        python $scriptTreatment -p $i -t HK-2 -e $fileRemove -s True
done

echo "Create treatment.info files for ISHIKAWA_CELLS"
for i in $IshikawaList
    do
        python $scriptTreatment -p $i -t ISHIKAWA_CELLS -e $fileRemove -s True
done

echo "Create treatment.info files for JURKAT_CELLS"
for i in $JurkatList
    do
        python $scriptTreatment -p $i -t JURKAT_CELLS -e $fileRemove -s True
done

echo "Create treatment.info files for MCF-7"
for i in $MCFList
    do
        python $scriptTreatment -p $i -t MCF-7 -e $fileRemove -s True
done

echo "Create treatment.info files for liver"
for i in $LiverList
    do
        python $scriptTreatment -p $i -t LIVER -e $fileRemove -s False
done

echo "Create treatment.info files for tg"
for i in $TgList
    do
        python $scriptTreatment -p $i -t THIGH-MUSCLE -e $fileRemove
done

If you have no error, you may obtain the following directories organization:

+-- [Species]  
   +-- [Tissue]
		+--GSEXXX
			+--Experimental_conditions
			¦	+--Condition 1
			¦	¦	+--treatment.info
			¦	+--Condition 2
			¦	¦	+--treatment.info
			+--Individual_experiments
			¦	+--GSMXXXX.CEL.gz
			¦	+--GSMXXXX.CEL.gz  
			+--GSEXXX.txt

In each treatment.info you will find the association between treated sample (first column) and control sample (second column):

003016029014.CEL.gz	003016029008.CEL.gz	0
003016029014.CEL.gz	003016029009.CEL.gz	0
003016029015.CEL.gz	003016029008.CEL.gz	0
003016029015.CEL.gz	003016029009.CEL.gz	0

##Run ChemPSy
Before run ChemPSy please be sur that you have the same architecture like previously describe and you have all your conditions with associated treatment.info files

Next run ChemPSy_data_prep.sh
As the previous script, this script uses the same configuration file. So please edit it and/or change the path on source line.

#!/bin/bash

#################################
#     Source .ini file          #
#################################

echo "##############################       ChemPSy       ##############################"


echo "--1-- Checking config file"
source /home/genouest/irset/tdarde/projects/ChemPSy/20160321/script/ChemPsy_processing/ChemPSy_Human.ini
echo "Reading config...." >&2
echo "Reading scriptPath: $scriptPath" >&2
echo "Reading config: $Rscript " >&2
echo "Reading dataPath: $dataPath" >&2
echo "Reading config: $processedPath " >&2

###Step 1 - Quality control
The first step of ChemPSy_data_prep.sh is a quality control. Various information will be created for each conditions including the microarray picture.

function step_1 {
	echo "--2-- STEP_1 process_data"
	for tissue in $tissues
		do
			echo $tissue
			outputT=$processedPath$tissue"/"
			mkdir -p $outputT
			
			for gse in $gsePath
				do
					path=$dataPath$tissue"/"$gse"/"
					if [ -d $path ]
						then
							output=$processedPath$tissue"/"$gse"/Experimental_conditions/"
							mkdir -p $output
							scriptA=$scriptPath'Rlauncher.sh'
							$scriptA -p $path -t $tissue -o $output -c $cdfpath
					fi
			done
	done
	while [ $(qstat | grep "ChemPSy_" | wc -l) -ne 0 ]
	do
		echo "Running --2-- STEP_1 process_data"
		sleep 7
	done
	echo "--2-- STEP_1 process_data finish"
}

To performe the quality control please check each picture one by one and remove microarray with 20% or more of hybridization error.
List all your sample to remove in removeCelFile.txt and run the Step - 3 of Data Format part

+-- [Species]  
   +-- [Tissue]
		+--GSEXXX
			+--Experimental conditions
				+--Condition 1
				¦	+--contrastmatrix.txt 
				¦	+--normdata.txt 
				¦	+--designmatrix.txt
				¦	+--qc_boxplot_afternormalization.pdf
				¦	+--qc_boxplot_beforenormalization.pdf
				¦	+--filtration.txt 
				¦	+--log2fcchangedata.txt
				¦	+--qc_corrmatrix_afternormalization.pdf
				¦	+--mednormdata.txt
				¦	+--qc_image_003016029009.CEL.gz.png
				¦	+--qc_image_003016029015.CEL.gz.png
				¦	+--qc_image_003016029009.CEL.gz.png

###Step 2 - List all conditions
This step lists all the conditions