Compressive Genomics for High-Throughput Sequencing Data
Invited Talk presented on Sept. 13, 2013 by Alexander Schliep at Institute for Advanced Simulation (IAS), Jülich Supercomputing Centre (JSC), Germany.
Abstract: High-throughput sequencing (HTS), a technology to unravel gene sequences on a large scale, is pervasive in clinical and biological applications such as cancer research, and expected to gain enormous momentum in future personalized medicine applications. In order to facilitate computations, new computer algorithms have to be developed which work directly on a compressed version of the data (compressive genomics), thus simplifying and accelerating analysis and advancing further into the era of personal genomics. Typical data sets consist of up to 2 billions of sequencing reads, and large studies might provide hundreds of such data sets. A core part of the analysis is read error correction, mapping to the reference genome and identifying genetic variations. Our compressive genomics approach is based on a greedy clustering method able to cluster billions of reads, and adaptations of downstream algorithms to directly work on the clustered representations.