On the feasibility of Heterogeneous Analysis of Large Scale Biological Data

I.G. Costa and A. Schliep

In Proceedings of ECML/PKDD 2006 Workshop on Data and Text Mining for Integrative Biology, 55–60, 2006.

Secondary information such as Gene Ontology (GO) annotations or location analysis of transcription factor binding is often relied upon validity of clusters, by considering whether individual terms or factors are significantly enriched in clusters. If such an enrichment indeed supports validity, it should be helpful in finding biologically meaningful clusters in the first place. One simple framework which allows to do so and which does not rely on strong assumptions about the data is semisupervised learning. A primary data source, gene expression time-courses, is clustered and GO annotation or transcription factor binding information, the secondary data, is used to define pairwise constraints for pairs of genes for the computation of clusters. We show that this approach improves performance, when high quality labels are present, but naive use of the heterogeneous data routinely used for cluster validation will actually decrease performance in clustering.

A reprint is available as PDF.

The publication includes results from the following projects or software tools: GenExpTimecourses.

The following presentation(s) are based on this publication: Sept. 18, 2006 by Ivan Costa at ECML Workshop on Data and Text Mining for Integrative Biology (Contributed Talk).

Further publications by Alexander Schliep, Ivan G Costa.