VMware Research

Introduction

Stable and consistent membership for large-scale distributed systems

Summary

This project focuses on membership management and failure detection for large-scale distributed systems.

We observe that datacenter failure scenarios are not always crash failures, but commonly involve misconfigured firewalls, one-way connectivity loss, flip-flops in reachability, and some-but-not-all packets being dropped. However, existing membership solutions struggle with these common failure scenarios, despite being able to cleanly detect crash faults. In particular, existing tools take long to, or never converge to, a stable state where the faulty processes are removed.

To address the above challenge, we present Rapid, a scalable, distributed membership system that is stable in the face of a diverse range of failure scenarios, and provides participating processes a strongly consistent view of the system's membership.

Details

The source code is available on Github

External Researchers

Ivan Porto Carreiro
Zeeshan Lokhandwala

Related Publications

Stable and Consistent Membership at Scale with Rapid July, 2018

Rapid

Introduction

Summary

Details

Researchers

Dahlia Malkhi

Lalith Suresh

External Researchers

Related Publications

Category