Improving genome assembly by identifying reliable sequencing data

R.S. Roy

Ph.D. Thesis, Oct 2014.

De novo Genome assembly and k-mer frequency counting are two of the classical problems of Bioinformatics. k-mer counting helps to identify genomic k-mers from sequenced reads which may then inform read correction or genome assembly. Genome assembly has two major subproblems: contig construction and scaffolding. A contig is a continuous sub-sequence of the genome assembled from sequencing reads. Scaffolding attempts to construct a linear sequence of contigs (with possible gaps in between) using paired reads (two reads whose distance on the genome is approximately known). In this thesis I will present a new computationally efficient tool for identifying frequent k-mers which are more likely to be genomic, and a set of linear inequalities which can improve scaffolding (which is known to be NP-hard) by identifying reliable paired reads. Identifying reliable k-mers from whole genome sequencing data is more challenging compared to multi-cell data due to the coverage variation introduced by the amplification step (MDA, MALBEC, etc.), which implies that applying a simple k-mer frequency cutoff is unreasonable. We observed that with sufficient coverage, using partial reads (read prefix of a certain length) of length approximately twice or less than that of the k-mer length recovers a large proportion of genomic k-mers while keeping the proportion of erroneous k-mers low. We show that using partial reads for assembly and gene prediction recovers a significant proportion of genes and propose to use this approach for rapid pathogen detection in combination with single cell genomics. Thanks to single cell genomics, it is now possible to isolate one single cell from environmental sample, extract its DNA and perform genetic sequencing without any need for culturing the cell in the lab. We show that current bioinformatic tools are capable of characterizing a novel organism by producing a draft genome assembly and gene annotation from single cell data of a MAST-4 stramenopile. This demonstrates the potential of SCG for genetic study of the vast majority of environmental organisms that has so far eluded scientists as they cannot be brought into culture, typically a necessity for future studies.

A reprint is available as PDF.

The publication includes results from the following projects or software tools: SCG.

Further publications by Rajat S Roy.