DNA Sequence Classification Using Variable Length Markov Models

S. Norlin

Master's Thesis, Chalmers University of Technology, Jun 2019.

Pathogens such as viruses and bacteria are a major health concern today. To effectively treat these it is important to identify known pathogens and potential new ones from DNA samples. Modern methods are however not good enough at classifying rare, previously undocumented pathogens. This thesis explores nearest neighbor classification using variable length Markov chains (VLMC) as a possible solution. A vantage point tree is used to store the database of VLMC being queried against. This gives promising results when classifying VLMC from complete genomes or chromosomes. Multiple techniques, both greedy approximations and new lower bounds are explored. This results in order of magnitude faster classification than previous research. However the technique ultimately fails at classifying shorter DNA sequences of lengths typically found when sequencing DNA. Multiple reasons for this are given with a possible way forward if further research is deemed relevant.

Further publications by Sebastian Norlin.