Introduction
Stable and consistent membership for large-scale distributed systems
Summary
This project focuses on membership management and failure detection for large-scale distributed systems.
We observe that datacenter failure scenarios are not always crash failures, but commonly involve misconfigured firewalls, one-way connectivity loss, flip-flops in reachability, and some-but-not-all packets being dropped. However, existing membership solutions struggle with these common failure scenarios, despite being able to cleanly detect crash faults. In particular, existing tools take long to, or never converge to, a stable state where the faulty processes are removed.
To address the above challenge, we present Rapid, a scalable, distributed membership system that is stable in the face of a diverse range of failure scenarios, and provides participating processes a strongly consistent view of the system's membership.
Details
The source code is available on
Github
Researchers
External Researchers
- Ivan Porto Carreiro
- Zeeshan Lokhandwala