SCG: Bioinformatics for Single-Cell Genomics
Genome assembly is one of the fundamental problems in Bioinformatics. Assembly can be either reference guided--when we have a reference genome that is similar to the genome we want to assemble--or de novo - when the genome is reconstructed only from reads available from sequencing machines. With sequencing getting cheaper by the day, researchers are interested in assembling genomes of more and more organisms. The main bottleneck here is the lack of reliable de novo assembly tools for Next Generation Sequencing data (the cheaper but shorter reads). We wish to investigate various aspects of the de novo assembly problem such as read filtering and correcting, contig building, scaffolding, etc.
In collaboration with Prof. Debashish Bhattacharya of SEBS, Rutgers, we explored the effectiveness of single cell assembly tools to produce a draft genome assembly of an unknown wild-caught marine diatom. We showed that if the genomic material is largely free of contaminants, we may reliably perform the organism's phylogenetic and evolutionary analysis, protein prediction and annotation and metabolic pathway analysis. Currently, we are exploring the possibility of performing a similar evolutionary analysis of Picobiliphytes (a recently discovered group of algae).
Whole genome amplified (WGA) single cell (SC) sequencing data is notorious for large coverage variation and errors. The frequent k-mer observation problem can be viewed as a generalized `Coupon collecting' problem where coupons appear with probabilities following a certain distribution. Since reads contain more errors towards the end, the rate of false frequent k-mers increases with increasing read length. Our goal is to predict the False Discovery Rate (FDR) of the observed frequent k-mers for a particular prefix of the read (or partial read) and thereby suggest a prefix length for a given value of k for k-mer based downstream analysis like assembly. Working with partial reads provides the possibility of performing preliminary analysis even before the sequencing is complete. This will facilitate rapid pathogen detection for diseases where the ability to rapidly administer the correct antimicrobial drug has a profound effect on patient outcome.
Members: Alexander Schliep, Md P. Mahmud, Rajat S Roy. Collaborators: Debashish Bhattacharya (Department of Ecology, Evolution, and Natural Resources, Rutgers), Kevin Chen (Department of Genetics, Rutgers), Anirvan Sengupta (Department of Physics, Rutgers), Jeff M Boyd (Department of Microbiology and Biochemistry, Rutgers), Tom Kirn (Clinical Pathology, Robert Wood Johnson Medical School, Rutgers).
PublicationsRoy et al.. Single cell genome analysis of an uncultured heterotrophic stramenopile. Sci Rep 2014, 4:4780. Roy. Improving genome assembly by identifying reliable sequencing data. Ph.D. Thesis, Oct 2014. Roy et al.. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics 2014, 14:30, 1950–7. Roy et al.. Turtle: Identifying frequent k-mers with cache-efficient algorithms.. Technical report, May 2013. Arxiv. Roy et al.. SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding. Journal of Computational Biology 2012, 19, 1162–75. Roy et al.. SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding. Technical report, Nov 2011. Arxiv.