GenomicSignatures: Alignment-free genome analysis
In an age of global pandemics, studying how viruses and their genomes evolve is of great importance. It has previously been found that genomes of many eukaryotes and prokaryotes have specific preferences for nucleotides, dinucleotides, and codons. Such preferences are characterized by the selective pressure acting on the genomes and are referred to as specific genomic signatures. However, it is not clear to what extent viruses have genomic signatures or to which extent they are shaped by the specific host of the virus or other biological factors.
Firstly, we have determined that many viruses have genomic signatures - which indicates the presence of mechanisms that shape the genomes of viruses to prefer specific oligonucleotides over others. Specifically, if you divide a viral genome into two parts and compare these parts, we find that the two parts are, in many cases, remarkably similar.
To study these genomic signatures, we use alignment-free methods. Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has 4^k formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters.
To handle modern datasets, we have developed new algorithms to build and compare VLMCs. Specifically, we have developed one implementation based on lazy suffix trees and one significantly faster implementation based on k-mer counts. The latter of these can run in external memory, which enables VLMCs to be applied to arbitrarily large genomes as well as sequencing datasets, even on typical laptops.