Mixture models for heterogeneous biological data

Invited Talk presented on April 6, 2007 by Alexander Schliep at Symposium on Bioinformatics and Biomathematics, Centrum voor Wiskunde en Informatica, Amsterdam.

Abstract: Recent years have seen many efforts generating biological data on a very large scale, from sequences to transcripts and from genotypes to phenotypes. The integration of these heterogeneous sources of data in order to arrive at conclusive information about a biological process is typically performed manually. We revisit classical mixture models, or convex combinations of density functions, and show how they can effectively model high-dimensional data by use of sufficiently constrained component models. Context-specific independence (CSI) provides a framework for learning relevant variables while avoiding over-fitting. The analysis of heterogeneous data then becomes possible either with a naive Bayes approach or with semi-supervised learning. In semi-supervised learning primary mass data is augmented with labels from possibly sparse secondary data. We will show several case studies, for example the detection of groups of syn-expressed genes from in-situ images and gene expression time-courses during embryogenesis.