Statistical Bioinformatics using Reduced Representations
Invited Talk presented on Jan. 22, 2014 by Alexander Schliep at Luxembourg Centre for Systems Biomedicine University of Luxembourg, Luxembourg.
Abstract: High-throughput sequencing (HTS), a technology to unravel gene sequences on a large scale, is pervasive in clinical and biological applications such as cancer research, and expected to gain enormous momentum in future personalized medicine applications. This deluge of data can be addressed with new methods which operate directly on reduced representations of the data and enable the use of advanced statistics even on very large data sets. Typical data sets consist of up to 2 billions of sequencing reads, and large studies might provided hundreds of such data sets. A core part of the analysis is read error correction, mapping to the reference genome and identifying genetic variations. We arrive at a reduced representation of HTS data sets by a clustering method able to cluster billions of reads. Adaptations of downstream algorithms operate directly on the clustered representations, thus enabling compressive genomics increasing the fidelity of the analysis at constant or lowered costs.