How etcd Solved Its Knowledge Drain with Deterministic Testing

Key Concepts

etcd: A distributed key-value store, serving as the source of truth for infrastructure, notably used by Kubernetes.
Linearizability: A property of distributed systems where the system behaves as if it were a single node, providing a high guarantee of correctness.
Robustness Testing: A testing methodology focused on validating the correctness of distributed systems, especially under failure conditions.
Deterministic Simulation Testing: A testing approach that ensures reproducible execution paths, allowing for the identification of complex bugs.
Chaos Engineering: A discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.
Witness Node: A component in a distributed system that observes the state of other nodes without participating in consensus, enabling reduced cluster size for high availability.
Property Testing: A testing paradigm where the program is encouraged to take any possible path, and assertions are made about the properties the system should maintain.

etcd: Enhancing Robustness Through Advanced Testing

This episode of The New Stack Makers features Mark Shakovich, a Senior Software Engineer at Google and lead maintainer of etcd, discussing recent developments in etcd's testing methodologies and their impact on the project's robustness. The conversation highlights the challenges of testing complex distributed systems and the benefits of adopting advanced testing frameworks.

Understanding etcd and its Role

etcd is a distributed key-value store that has been in existence for over 12 years, predating the Cloud Native Computing Foundation (CNCF) and Kubernetes. Its core principle is to act as a single source of truth for all infrastructure operations. Kubernetes, for instance, relies on etcd as its "brain" to manage upgrades, downgrades, and ensure that changes are coordinated and tracked, preventing catastrophic failures during rolling updates. The knowledge of intentional infrastructure states originates within etcd.

Shakovich's Journey with etcd

Mark Shakovich's involvement with etcd began four years ago after a seven-year tenure maintaining the metric server for Kubernetes. Drawn to the inherent complexity and challenge of distributed systems, he transitioned to the etcd project at Google, focusing on maintaining etcd within Google Kubernetes Engine (GKE). He notes that managing state in distributed systems, especially when combined with distributed consensus algorithms like Raft, presents significant technical hurdles.

Recent Advancements in etcd's Robustness Testing

The discussion centers on "robustness testing," a concept that emerged from challenges etcd faced with reliability and correctness approximately two to three years prior. A significant knowledge drain occurred when several maintainers left the project, taking with them implicit knowledge about testing and correctness guarantees. This led to the release of a version with critical issues, where inconsistencies could arise, causing one member of a three-member etcd cluster to behave erratically and disrupt higher-level systems like Kubernetes.

To address this, the etcd team invested heavily in preventing future occurrences. They developed their own framework, inspired by tools like Jepsen, to validate not just basic correctness but also the correctness of distributed systems, aiming for the highest guarantee of correctness: linearizability. Linearizability ensures that a distributed system behaves as if it were a single node, a "holy grail" of distributed systems that is notoriously difficult to validate. This involved creating failure injection mechanisms and educating the community on debugging these complex scenarios.

Collaboration with Antithesis

The immense challenges in knowledge sharing and reproducing issues led the etcd team to collaborate with Antithesis, a company specializing in deterministic simulation testing. This approach allows for the creation of a single, reproducible execution path for even seemingly chaotic distributed systems.

Key benefits of this collaboration include:

Reproducibility: Antithesis's platform linearizes execution, making every bug reproducible. This contrasts with traditional methods where finding bugs often relies on luck, trying to hit specific race conditions or lock contention.
Knowledge Codification: Properties previously documented or held only in maintainers' minds were translated into assertions within the testing framework. While etcd had assertions before, they rarely triggered because the specific failure combinations required to activate them were difficult to engineer manually.
Failure Injection: Antithesis provides a robust failure injection mechanism, saving the etcd team from building complex, PhD-level integrations with custom file systems for simulating disk failures, for example.

Shakovich clarifies that this is akin to chaos engineering, but within a single, reproducible virtual machine environment. Unlike traditional chaos engineering, which often injects one failure at a time, deterministic simulation allows for stacking multiple failures, uncovering new execution paths and bugs that might occur with extremely low probability (e.g., once in a million hours). This significantly contributes to achieving higher levels of availability ("more nines").

Impact on Enterprise and Production

For enterprises using etcd, scalability and reliability are paramount. Shakovich recounts experiencing one corruption event per year in GKE, requiring expert intervention. The shift to advanced testing, like that provided by Antithesis, allows for "shifting left" – simulating potential future production issues in a testing environment rather than discovering them after hours or years of production runtime.

The results of this testing have been integrated into the latest etcd releases. The team began by reproducing all historical issues, confirming the effectiveness of the new testing approach. This built trust that the codebase was adequately covered, preventing regressions from new contributors.

Challenges in Testing Large Open-Source Projects

Shakovich identifies the primary obstacle for most large, complex open-source projects as the difficulty in sharing knowledge about advanced testing methodologies. Concepts like robustness testing go beyond typical university curricula or common industry practices. It involves testing for rare events, such as a single node returning outdated information due to a disk failure.

The hope is that tools and dashboards provided by platforms like Antithesis can democratize this knowledge. For open-source projects, maintaining the core components is crucial, as they are often sensitive and interconnected. Advanced testing allows maintainers to write rules that act as safeguards, catching mistakes made by contributors, even if those mistakes are subtle or not fully understood by the reviewer.

The Potential Role of AI in Testing

The conversation touches upon the potential of AI in robustness testing. AI could enhance the exploratory nature of property testing by making the exploration smarter. Instead of just covering trivial cases, AI could identify correlations between events, code lines, and bugs. This could lead to AI not only pinpointing the exact lines of code that trigger a bug but potentially even fixing them, given the wealth of data available.

etcd's Roadmap and Future Developments

The roadmap for etcd focuses on leveraging the newfound trust in the project to enhance its core functionalities. A significant upcoming feature is the support for two-node etcd clusters. Currently, Kubernetes control planes require three nodes for high availability due to consensus mechanisms needing a majority. By introducing a "witness node" – a simple observer that doesn't participate in consensus (e.g., an event in a GCS bucket) – etcd can operate with just two nodes, potentially saving 33% of costs. This requires substantial changes across consensus, state machines, and all layers of the system. The advanced testing framework is crucial for validating these complex changes without relying on the myth of 100% unit test coverage, which doesn't guarantee resilience against failures like network disruptions.

Getting Involved with etcd

Interested individuals can get involved with etcd by visiting its GitHub page, which provides links to its Slack channel (on the Kubernetes Slack) and mailing list. Joining the mailing list grants access to bi-weekly community meetings, triage meetings, and robustness testing sessions, offering a direct view into debugging live distributed system issues.