VMware Research | Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index (journal version)

Abstract

Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, decrease from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6–108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days.

Date

June, 2018

Authors

Prashant Pandey
Fatemah Almodaresi
Michael A. Bender
Michael Ferdman
Rob Johnson
Rob Patro

Related projects

Scalable Tools for Computational Biology

Research Areas

Computational Biology

Type

Article

Journal

Cell Systems