Introduction

Stable and consistent membership for large-scale distributed systems

Summary

This project focuses on membership management and failure detection for large-scale distributed systems.

We observe that datacenter failure scenarios are not always crash failures, but commonly involve misconfigured firewalls, one-way connectivity loss, flip-flops in reachability, and some-but-not-all packets being dropped. However, existing membership solutions struggle with these common failure scenarios, despite being able to cleanly detect crash faults. In particular, existing tools take long to, or never converge to, a stable state where the faulty processes are removed.

To address the above challenge, we present Rapid, a scalable, distributed membership system that is stable in the face of a diverse range of failure scenarios, and provides participating processes a strongly consistent view of the system's membership.

Details

The source code is available on Github

Researchers

External Researchers

  • Ivan Porto Carreiro
  • Zeeshan Lokhandwala

Related Publications

Category

  • Graduated Research Projects