Abstract
We engineer a GPU implementation of a dynamic B-Tree that
supports concurrent queries (point, range, and successor)
and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree
(LSM) and a GPU sorted array. In particular, point and range
queries are significantly faster than in a GPU LSM (the GPU
LSM does not implement successor queries). Furthermore,
B-Tree insertions are also faster than LSM and sorted array
insertions unless insertions come in batches of more than
roughly 100k insertions. Because we cache the upper levels of the tree, we achieve lookup throughput that exceeds
the DRAM bandwidth of the GPU. We demonstrate that the
key limiter of performance on a GPU is contention and describe the design choices that allow us to achieve this high
performance.