pGQL: A Probabilistic Graphical Query Language for Gene Expression Time Courses

R.B. Schilling, I.G. Costa and A. Schliep

BioData Mining 2011, 4:9.

BACKGROUND: Timeboxes are graphical user interface widgets that were proposed to specify queries on time course data. As queries can be very easily defined, an explorative analysis of time course data is greatly facilitated. While timeboxes are effective, they have no provisions for dealing with noisy data or data with fluctuations along the time axis, which is very common in many applications. In particular, this is true for the analysis of gene expression time courses, which are mostly derived from noisy microarray measurements at few unevenly sampled time points. From a data mining point of view the robust handling of data through a sound statistical model is of great importance. RESULTS: We propose probabilistic timeboxes, which correspond to a specific class of Hidden Markov Models, that constitutes an established method in data mining. Since HMMs are a particular class of probabilistic graphical models we call our method Probabilistic Graphical Query Language. Its implementation was realized in the free software package pGQL. We evaluate its effectiveness in exploratory analysis on a yeast sporulation data set. CONCLUSIONS: We introduce a new approach to define dynamic, statistical queries on time course data. It supports an interactive exploration of reasonably large amounts of data and enables users without expert knowledge to specify fairly complex statistical models with ease. The expressivity of the query is by its statistical nature greater and more robust with respect to amplitude and frequency fluctuation than prior deterministic approaches.

A reprint is available as PDF.

Pubmed ID: 21501515. DOI: 10.1186/1756-0381-4-9.

The publication includes results from the following projects or software tools: GenExpTimecourses.

Further publications by Alexander Schliep, Ivan G Costa, Ruben B. Schilling.