Mixture Models for the Analysis of Gene Expression: Integration of Multiple Experiments and Cluster Validation

I.G. Costa

Ph.D. Thesis, Freie Universität Berlin, May 2008.

The main focus of this thesis is the problem of finding groups of co-expressed genes from data obtained in DNA microarray experiments. As we assume co-expressed genes to: (1) perform related functional task, and (2) be regulated by the same transcription regulation program, such an analysis is helpful in identifying the biological function and the regulatory roles of genes. One traditional approach for finding co-expressed genes is the use of clustering methods. In this thesis, we use mixture models as a statistical formalism for clustering gene expression data. Mixture models are robust to noise, can model uncertainty about cluster assignments, allow the inclusion of prior knowledge, such as intrinsic dependencies of the experimental design, and offer a flexible framework for integration of additional biological data. In Chapter 2, we introduce the mixture model formalism. Then, in Chapter 3, we describe how mixture models can be used to solve the clustering problem, and how questions as choosing the number of clusters and cluster validation can be answered in the context of mixture models. Additionally, in Chapter 3 we propose a novel external index for validating clusterings computed by mixtures. Mixture models allow, with a proper choice of component models, to make explicit assumptions about the data. We propose here two novel types of components models for analyzing gene expression. The use of hidden Markov models with linear topologies to analyze gene expression time courses will be the focus of Chapter 4. With a benchmark data set, we show that mixture of HMMs have better class recovery than other methods proposed for time course analysis. In Chapter 5, we propose a new type of probabilistic model, dependence trees, to model gene expression profiles during a developmental process. We also explore the benefits of using priors of model parameters to obtain maximum-a-posteriori point estimates, and show how this improves the robustness of the method. For data collected in lymphoid development, mixtures of dependence trees compare favorably to other methods used for finding groups of co-expressed genes. Furthermore, by incorporating microRNA binding data, we identify promising novel regulatory roles of genes and their functional assignments. We propose in Chapter 6 an extension of the mixture model estimation. This semi-supervised learning can integrate additional biological data and improve clusterings of gene expression time-courses. We propose a novel method, which combines gene expression time-courses with spatial patterns of gene expression in Drosophila embryos, for finding groups of syn-expressed genes. Our results demonstrate that the cluster results, obtained after integrating additional data, demonstrate a better recovery of syn-expressed genes then cluster results obtained with the gene expression data alone.

A reprint is available as PDF.

The publication includes results from the following projects or software tools: GQL, GenExpTimecourses, GHMM, CellDiff.

Further publications by Ivan G Costa.