Computer Science and Engineering
 Gothenburg University | Chalmers

Home page

Home page  Contact us  Site map 

 

 

DNA Sequence Classification Using Variable Length Markov Models

S. Norlin

Master's Thesis, Chalmers University of Technology, Jun 2019.

Pathogens such as viruses and bacteria are a major health concern today. To effec- tively treat these it is important to identify known pathogens and potential new ones from DNA samples. Modern methods are however not good enough at classifying rare, previously undocumented pathogens. This thesis explores nearest neighbor classification using variable length Markov chains (VLMC) as a possible solution. A vantage point tree is used to store the database of VLMC being queried against. This gives promising results when clas- sifying VLMC from complete genomes or chromosomes. Multiple techniques, both greedy approximations and new lower bounds are explored. This results in order of magnitude faster classification than previous research. However the technique ulti- mately fails at classifying shorter DNA sequences of lengths typically found when sequencing DNA. Multiple reasons for this are given with a possible way forward if further research is deemed relevant.