Statistical Bioinformatics for the Post-Genomic Age
Invited Talk presented on Jan. 27, 2014 by Alexander Schliep at Institut de Génétique et de Biologie Moléculaire et Cellulaire.
Abstract: High-throughput sequencing (HTS), a technology to unravel gene sequences on a large scale, is pervasive in clinical and biological applications such as cancer research and basic science, and is expected to gain enormous momentum in future personalized medicine applications. This deluge of data can be addressed with new methods which operate directly on reduced representations of the data and enable the use of advanced statistics even on very large data sets. For identifying Copy Number Variants (CNV) our approach accelerated full Bayesian methods to the point of matching Maximum-likelihood methods. Typical data sets consist of up to 2 billions of sequencing reads, and large studies might provided hundreds of such data sets. A core part of the analysis is read error correction, mapping to the reference genome and identifying genetic variations. We arrive at a reduced representation of HTS data sets by a clustering method able to cluster billions of reads. Adaptations of downstream algorithms operate directly on the clustered representations, thus enabling compressive genomics increasing the fidelity of the analysis at constant or lowered costs. We will illustrate the general principles behind this approach and present as case studies both ongoing projects–the "noise" level of structural variants in breast cancer as a potential marker of cancer subtypes, assembly of an eukaryotic microbe from single-cell genomics data—and future projects, such as assessing CNV progression in pedigrees or, generally, across multiple conditions. Complementing this genome-level research, is an on-going project on using molecular imaging to understand dysregulation of lipid metabolism in TB-infected macrophages.