RawAlign: Accurate, Fast, and Scalable Raw Nanopore Signal Mapping via Combining Seeding and Alignment

8 Oct 2023 · Joël Lindegger, Can Firtina, Nika Mansouri Ghiasi, Mohammad Sadrosadati, Mohammed Alser, Onur Mutlu ·

Nanopore-based sequencers generate a series of raw electrical signal measurements that represent the contents of a biological sequence molecule passing through the sequencer's nanopore. If the raw signal is analyzed in real-time, an irrelevant molecule can be ejected from the nanopore before it is completely sequenced, reducing sequencing time. To meet the low-latency and high-throughput requirements of the real-time analysis, a number of recent works propose the direct analysis of raw nanopore signals instead of traditional basecalling-based analysis approaches. We observe that while existing proposals for raw signal read mapping typically do well in all metrics for small reference databases (e.g., viral genomes), they all fail to scale to large reference databases (e.g., the human genome) in some aspect. Our goal is to analyze raw nanopore signals with high accuracy, high throughput, low latency, low memory usage, and needing few bases to be sequenced for a wide range of reference database sizes. To this end, we propose RawAlign, the first Seed-Filter-Align mapper for raw nanopore signals. Our evaluation shows that RawAlign is the only tool that can map raw nanopore signals to large reference databases $\geq$3117Mbp with high accuracy. Our evaluation shows that RawAlign generalizes well to a wide range of reference database sizes. In particular, RawAlign has a similar throughput to the overall prior state-of-the-art RawHash (between 0.80$\times$-1.08$\times$) while improving accuracy on all datasets (between 1.02$\times$-1.64$\times$ F-1 score). RawAlign provides a 2.83$\times$ (2.06$\times$) speedup over Uncalled (Sigmap) on average (geo. mean) while improving accuracy by 1.35$\times$ (1.34$\times$) in terms of F-1 score on average (geo. mean). Availability: https://github.com/cmu-safari/RawAlign

PDF Abstract