VMware Research

Introduction

Automated reliability testing for Kubernetes controllers

Summary

The Kubernetes ecosystem has thousands of controller implementations for different applications and platform capabilities. A controller’s correctness is critical as it manages the application's deployment, scaling and configurations. However, a controller's correctness can be compromised by myriad factors, such as asynchrony, unexpected failures, networking issues, and controller restarts. This in turn can lead to severe safety and liveness violations. Sieve is a tool to help developers test their controllers by deterministically injecting faults and detecting dormant bugs at development time. Sieve does not require the developers to modify the controller. Crucially, it can reliably reproduce the bugs it finds.

Details

To use Sieve, developers need to port their controllers and provide end-to-end test cases (see getting started for more information). Sieve will automatically instrument the controller by intercepting the event handlers in client-go and controller-runtime. Sieve runs in two stages. In the learning stage, Sieve will run a test case and identify promising points in an execution to inject faults. It does so by analyzing the sequence of events traced by the instrumented controller. The learning produces test plans that are then executed in the testing stage. A test plan tells Sieve of the type of fault and the point in the execution to inject the fault. These steps are detailed in https://github.com/sieve-project/sieve/blob/main/docs/port.md.

Key results

See the following URL for a list of all controllers we have tested and found bugs in: https://github.com/sieve-project/sieve/blob/main/docs/bugs.md. As of now, Sieve has been applied to ten different Kubernetes controllers and it has found more than 46 bugs that have severe consequences like security vulnerabilities, data loss, resource leaks and more.

More information