Data structures for real-time processing of huge streams


Classic streaming problems include tasks such as, "Find the items that occur most frequently in this data stream," or "Raise an alert whenever a particular event occurs in a given data stream." Theoretical models of streaming problems typically assume that the problem is much, much larger than available storage. As a result, they compromise on accuracy by developing approximate solutions that use only a small amount of space, so that the algorithm can work in RAM.

This project is developing new algorithms that perform stream processing using disks (NVMes, SSDs, or HDDs), without compromising on speed or timeliness of event reporting.

We have new data structures that enable us to do real-time event detection in huge streams with perfect accuracy using SSDs. Our implementation has performance that is comparable to a state-of-the-art in-memory solution.

This work also has applications to standing queries in databases, i.e. queries that continually update a result as the database is modified by other transactions. This problem is particularly challenging in write-optimized databases, such as LSM-trees and Bε-trees, since they can insert new data far faster than they can query old data, meaning that a standing query cannot be implemented by simply querying the database after every insert. Our work shows that, nonetheless, it is possible to implement efficient standing queries in write-optimized databases.


  • Our SIGMOD 2020 paper presents new data structures for finding heavy hitters precisely in huge streams by using disks efficiently.


External Researchers

  • Cynthia A. Phillips
  • Jon Berry
  • Martin Farach-Colton
  • Michael A. Bender
  • Prashant Pandey
  • Shikha Singh
  • Tom Kroeger