Reduced representations for efficient analysis of genomic data; from microarray to high throughput sequencing

M. Mahmud

Ph.D. Thesis, Oct 2014.

Since the genomics era has started in the ’70s, microarray technologies have been exten- sively used for biological applications such as gene expression profiling, copy number variation (CNV) or Single Neucleotide Polymorphism (SNP) detection. To analyze microarray data, numerous statistical and algorithmic techniques have been developed over the last two decades; specially, for detecting CNV from array comparative genomic hybridization (arrayCGH) data, Hidden Markov Models (HMMs) have been success- fully used. Still, due to computational reasons, the benefits of using Bayesian HMMs have been overlooked, and their use has been, at best, minimal in practice. The large demand for computational resources has also affected the analysis of high throughput sequencing (HTS) data, which, over the last few years, has started to revolutionize the field of computational biology. For example, the most sensitive tools for mapping HTS data to reference genomes are generally ignored in favor of fast, less accurate ones. In this dissertation, we strive for reduced representations of biological data which enable us to perform efficient computations on large datasets. Since biological datasets often contain repetitive, sometimes redundant, elements, it is a natural idea to identify groups of similar elements and directly perform computations on these groups. Usually, the relevant type of similarity is specific to the type of data and application in hand. Specifically, we make the following four contributions in this thesis. First, we show that, by exploiting repetition in discrete sequences, Markov Chain Monte Carlo (MCMC) simulations of Bayesian HMM can be accelerated, which can then be applied to the DNA segmentation problem. Second, in case of Gaussian observations repre- senting copy number ratio data, we show that, through pre-computing similar, contigu- ous observations into blocks, MCMC for Bayesian HMM can be well-approximated. Third, by representing sequences to multi-dimensional vectors, we introduce a nearest neighbor based novel technique for mapping HTS data to reference genome. Fi- nally, we present a highly efficient clustering approach for HTS data, which allows us to speed-up computationally demanding, sensitive tools for mapping HTS data.

A reprint is available as PDF.

The publication includes results from the following projects or software tools: BayesianHMM, TreQ, TreqCG, AlgoEngineering.

Further publications by Md P. Mahmud.