VMware Research | Scalable Tools for Computational Biology

Introduction

Tools for indexing and searching terabytes of genomic and transcriptomic data

Summary

Scientists need tools for quickly indexing, searching, and analyzing terabytes, or even petabytes, of raw genomic and trancriptomic data. This project is developing multiple tools for efficiently representing biological data so that huge data sets can be processed in RAM, providing orders-of-magnitude increases in scalability and performance. This project has produced several open-source tools.

Squeakr, a fast and compact CQF-based k-mer counter for computational biology applications.
deBGR, a compact nearly exact representation of de Bruijn graphs of k-mers.
Mantis, an indexing system for searching for sequences in large-scale databases of genomic and transcriptomic data.

News

Our 2021 bioRxiv preprint describes how to combine Mantis with LSM trees to create a scalable and incrementally updatable index on raw sequencing data.