PhD Defense: Parallel and External-Memory Algorithms for Large-Scale Genomic Sequence-Graph and -Index Construction
IRB IRB-3137
https://umd.zoom.us/j/7464632053
Rapid developments in the throughput and affordability of modern genome-sequencing technologies have made the generation of billions of sequencing reads from biological samples highly time- and cost-efficient. Concurrent development of efficient algorithms and tools for genome assembly have also vastly expanded the number of available assembled genome sequences. This fast-paced and sustained increase in available genomic data has resulted in situations where, in many cases, computational data analysis has become more resource-expensive than the experiments generating these data. This is particularly true since genomic data are generated once, but are typically stored indefinitely and subsequently reanalyzed many times. As such, the development of algorithmic approaches to process these data in a fundamentally scalable manner has been a key focus of modern computational genomics.This dissertation focuses on developing scalable, parallel algorithms and data structures for compressing, organizing, and indexing high-throughput and large-scale genomic data using external-memory, in ways so as to increase the scale at which existing analyses can be performed. Toward these objectives, a mathematical object called the de Bruijn graph—and the associated data structure and variants—have become a compact data representation of increasing utility across computational genomics. Genomic data are expensive in terms of space, and building a compact representation of sequences of interest while retaining relevant features is a first step in many computational genomics pipelines. The majority of this dissertation focuses on construction algorithms for such a representation—a compact lossless variant of the de Bruijn graph, used to represent sets of genomic sequences.This dissertation first presents a SotA parallel algorithm, Cuttlefish, to construct the contracted de Bruijn graph from genome references with superior performance to pre-existing methods—where we propose a novel modeling approach for each graph vertex as an individual finite-state automaton, and treat the graph as a collection of automata. Cuttlefish is applicable for a specific class of input data, namely assembled genome sequences. The dissertation then presents our subsequent work, where we extend this method to be more general and enable it take any form of input. We developed a new SotA algorithm, Cuttlefish 2—through generalization of the earlier algorithm—to also facilitate construction from sequencing read sets as input, which is fundamentally a tougher problem due to structural properties of raw sequencing data.Both Cuttlefish and Cuttlefish 2 construct the de Bruijn graph without source-annotations for the vertices. The dissertation then presents Cuttlefish 3, an even faster, more memory-frugal, and further scalable algorithm to construct the contracted de Bruijn graph, while also incorporating source-annotation information (i.e. inverted-indexing) into the graph. Orthogonal to the strategy of working with the whole graph at all times, Cuttlefish 3 adopts an established paradigm of partitioning the original graph into subgraphs, contracting the subgraphs separately, and then merging the local solutions appropriately—with novel algorithmic contributions in key steps. It speeds up the construction process of the local contractions non-trivially, proposes a novel algorithm to join the local solutions based on parallel tree contraction, and presents a graph sparsification strategy to vastly reduce the amount of vertices to be tracked when inverted-indexing the graph.Finally, the dissertation presents a simple and highly parallel algorithm, CaPS-SA, to construct the classic Suffix Array data structure to index genomic sequences, advancing the SotA parallel solutions for modern architectures.