TreqCG: Clustering Accelerates High-Throughput Sequencing Read Mapping

As high-throughput sequencers become standard equipment outside of sequencing centers, there is an increasing need for efficient methods for pre-processing and primary analysis. While a vast literature proposes methods for NGS data analysis, we argue that significant improvements can still be gained by exploiting expensive pre-processing steps which can be amortized with savings from later stages.

We propose a method to accelerate and improve read mapping based on an initial clustering of up to billions of high-throughput sequencing reads yielding clusters of high stringency and a high degree of overlap. This clustering improves on the state-of-the-art in running time for small datasets and, for the first time, makes clustering high-coverage human libraries feasible. Given the efficiently computed clusters, only one representative read from each cluster needs to be mapped using a traditional readmapper such as BWA, instead of individually mapping all reads.

On human reads, all processing steps, including clustering and mapping, only require 11%–71% of the time for individually mapping all reads, achieving speed-ups for all readmappers, with minimally affecting mapping quality. This accelerates a highly sensitive readmapper such as Stampy to be competitive with a fast readmapper such as BWA on unclustered reads.

TreQ-CG can be downloaded from here.

For further information contact Md P. Mahmud (pavelm@cs.rutgers.edu). This software is a result of or used in the following projects: TreQ, HTSMethods, AlgoEngineering.

Team

Members: Md P. Mahmud, Alexander Schliep.

Publications

Mahmud. Reduced representations for efficient analysis of genomic data; from microarray to high throughput sequencing. Ph.D. Thesis, Oct 2014.

Mahmud et al.. TreQ-CG: Clustering Accelerates High-Throughput Sequencing Read Mapping. Technical report, 2014.