Scalable methods for k-mer based biological sequence analysis
Scalable analysis of biological sequences often starts by breaking long strings into their constituent k-mers. A k-mer is simply a substring of a short fixed length k. Compact data structures and efficient algorithms for storing and analyzing k-mer datasets have therefore become one of the bottlenecks for biological discovery. In this talk, I will present several techniques we have developed to push the boundaries of what is possible with such datasets. I will present the spectrum-preserving string set representation (RECOMB 2020, best paper award) as well as space-efficient data structures for querying large sequence archives (RECOMB 2017). Time permitting, I will also present our work on the use of sketching algorithms to estimate sequence similarity from k-mer sets (RECOMB 2021 and ISMB 2022).