Abstract
The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used
in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large
scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the
cdbg, the color information comes to dominate the space required to represent this data structure.
In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding
that exploits correlations among color classes — patterns of color occurrence — present in the de Bruijn
graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage
of such correlations is determining which color classes are close to each other in the high-dimensional space of
possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search
for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color
information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential
as the number of potential colors (i.e. samples or references) grows to thousands of experiments.
We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale
sequence search index, Mantis, as well as the encoding of color information used in population-level variation
detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size
and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved
more than 11× better compression compared to RRR.