Analyzing gene expression time-courses

Alexander Schliep, Ivan G. Costa, Christine Steinhoff, Alexander Schönhuth


Software

All experiments with HMMs estimations were run with GQL. The GQL-software, licensed under the GPL, is available at http://ghmm.org/gql. It requires a couple of other packages, most importantly Python and the GHMM. Splines and CAGED implementations were obtained from the authors, and a implementation of k-means in python was obtained at Open Source clustering . The python scripts used for generating all experiments are also available for download (see readme.txt for file descriptions).


Data

All data sets used in the experiments can be downloaded from (see http://ghmm.org/gql for format description): https://schlieplab.org/Static/Supplements/ExpAna/data/

Simulated Data 1

The simulated data set SIM 1 consists of a total of 3500 time-courses of length 30 (equal step-width in [2,2*PI]) in six classes. The time-courses were obtained by sampling from the respective class models. The normal distribution is denoted as $N(\mu,\sigma )$.

Class Description Size Function
C1 up-regulation 500 $0.15 \cdot x - 0.7 + N(1,0.3)$
C2 noise 1000 $0 + N(1,0.6)$
C3 down-regulation 500 $-0.3 \cdot x - 0.3 + N(1,0.3)$
C4 cyclic 1 500 $N(1,0.1)\cdot \sin\left(1.2\cdot N(1,0.05)\cdot x +0.8 \cdot 2\pi\right) + N(0,0.4)$
C5 cyclic 1 100 $N(1,0.0075)\cdot \sin\left(1.2 \cdot N(1,0.1)\cdot x +0.6 \cdot 2\pi\right) + N(0,0.5)$
C6 cyclic 1 900 $N(1,0.9)\cdot \sin\left(1.5 \cdot N(1,0.025)\cdot x +0.5\cdot 2\pi\right) + N(0,0.5)$


Simulated Data 2

We selected eight HMMs encoding the possible three-segment regulation behaviors (e.g., down-down-down, up-down-down) and used a Monte-Carlo algorithm (variances of emission probabilities were set to 0.05) to generate 100 time-courses from each of the eight HMMs.

Yeast 5 Data

The Y5 data set was originally downloaded from Yeung Sup. Material. The five classes corresponds to genes listed in Cho et. al 1998 as belonging to the phases Early G1, Late G1, S, G2 and M.


Results

Simulated Data 2

Results on SIM2.
Method CR Specificity Sensitivity
Caged 0.077 0.160 0.834
Splines 0.166 0.308 0.255
k-means 0.326 0.383 0.460
HMM Mix.& RMC 0.500 0.518 0.631
HMM Clu.& RMC 0.520 0.553 0.617
HMM Mix.& KMI 0.563 0.595 0.648
HMM Clu.& KMI 0.579 0.620 0.646
HMM Mix.& BMC 0.586 0.634 0.641
HMM Clu.& BMC 0.587 0.635 0.641
HMM Mix. 2.0% labeled 0.589 0.704 0.593


Again, the partially supervised learning obtained the best result with the presence of only 2% of the labels (CR of 0.589). The estimation of HMMs with BMC initialization obtained also a good results (around 0.58), while K-means, splines and Caged hat very poor results (bellow 0.4). No big distinctions were noticed between mixture and clustering estimation. The main reason for this was the low standard deviation used in the generation of SIM2, which raised no robustness matters.

Go Enrichment

We list here the GO enrichment of clusters from Hela and YSPOR cited in the paper, which were not included in the manuscript. For the complete tables in all threshold see file at: http://algorithmics.molgen.mpg.de/ExpAna/go/ (see the documentation of GOStat for file descriptions).

.

GO Enrichment for clusters three from HeLa for threshold 1.69
GO number Counts p-value GO Term
No enrichment



GO Enrichment for clusters three from HeLa for threshold 0.31
GO number Counts p-value GO Term
GO:0043169 7 of 3442 0.0387 cation binding
GO:0005509 4 of 914 0.0387 calcium ion binding
GO:0046872 7 of 3681 0.0403 metal ion binding
GO:0043167 7 of 3681 0.0403 ion binding


GO Enrichment for clusters six from HeLa for threshold 1.69
GO number Counts p-value GO Term
No enrichment


GO Enrichment for clusters six from HeLa for threshold 0.31
GO number Counts p-value GO Term
GO:0007243 4 of 268 0.00276 protein kinase cascade
GO:0007242 5 of 1268 0.0332 intracellular signaling cascade
GO:0050896 6 of 2354 0.0332 response to stimulus
GO:0006468 4 of 923 0.0332 protein amino acid phosphorylation
GO:0006950 4 of 935 0.0332 response to stress
GO:0000185 1 of 3 0.0332 activation of MAPKKK
GO:0006915 3 of 456 0.0332 apoptosis
GO:0016773 4 of 1007 0.0332 phosphotransferase activity, alcohol group as acceptor
GO:0012501 3 of 459 0.0332 programmed cell death
GO:0007165 7 of 3681 0.0332 signal transduction


GO Enrichment for clusters one from YSPR for threshold 1.3
GO number Counts p-value GO Term
GO:0000279 24 of 259 2.47e-10 M phase
GO:0000280 21 of 243 6.84e-08 nuclear division
GO:0007017 14 of 100 1.43e-05 microtubule-based process
GO:0000072 11 of 58 1.43e-05 M phase specific microtubule process
GO:0000070 8 of 33 8.4e-05 mitotic sister chromatid segregation
GO:0000819 8 of 33 8.4e-05 sister chromatid segregation
GO:0000226 12 of 88 8.96e-05 microtubule cytoskeleton organization and biogenesis
GO:0007067 14 of 133 0.000229 mitosis
GO:0000087 14 of 135 0.000244 M phase of mitotic cell cycle
GO:0000090 8 of 41 0.000296 mitotic anaphase


GO Enrichment for clusters one from YSPR for threshold 0.41
GO number Counts p-value GO Term
GO:0000279 18 of 259 5.64e-05 M phase
GO:0000280 16 of 243 0.000333 nuclear division
GO:0000072 8 of 58 0.000632 M phase specific microtubule process
GO:0007067 11 of 133 0.00101 mitosis
GO:0000087 11 of 135 0.00101 M phase of mitotic cell cycle
GO:0007017 9 of 100 0.00267 microtubule-based process
GO:0007049 20 of 516 0.00269 cell cycle
GO:0000226 8 of 88 0.00538 microtubule cytoskeleton organization and biogenesis
GO:0008283 21 of 589 0.00739 cell proliferation
GO:0000070 5 of 33 0.00875 mitotic sister chromatid segregation


GO Enrichment for clustersfour from YSPR for threshold 1.3
GO number Counts p-value GO Term
No enrichment


GO Enrichment for clustersfour from YSPR for threshold 0.41
GO number Counts p-value GO Term
GO:0008186 3 of 25 0.0689 RNA-dependent ATPase activity
GO:0004004 3 of 25 0.0689 ATP-dependent RNA helicase activity
GO:0005730 7 of 266 0.0907 nucleolus