Consensus Systems in Production: etcd, Consul, and ZooKeeper
LESSON
Consensus Systems in Production: etcd, Consul, and ZooKeeper
The core idea: etcd, Consul, and ZooKeeper turn expensive consensus into practical control-plane primitives, and the main trade-off is paying quorum cost for small authoritative metadata instead of treating them like general-purpose databases.
Core Insight
A platform team needs one place to record who owns a leader lease, which services are healthy, which configuration version is active, and which controllers should react to a change. That state is small compared with user data, but it is high value: if two nodes disagree about it, the platform can split brain.
This is the job of production coordination systems. They are not attractive because they are convenient key-value stores. They are attractive because they give engineers an API for small pieces of state that must be ordered, watched, leased, or tied to session membership under failure.
etcd, Consul, and ZooKeeper all sit in that family, but they do not invite the same design. etcd feels like a replicated control-plane key-value store with watches, leases, and compare-and-swap style transactions. Consul combines coordinated server-side state with service discovery, health checks, sessions, and operational datacenter workflows. ZooKeeper offers a hierarchical coordination namespace with sessions, watches, ephemeral nodes, and a long history in distributed infrastructure.
The mistake is to choose one because "it stores data." The better question is: what coordination shape does this platform need, and is the state important enough to pay consensus cost?
The Control-Plane Workload
Imagine a scheduler managing a fleet of workers. It needs to know which scheduler instance is leader, which workers are alive, which jobs are assigned, and which configuration revision is active. Those decisions affect many services, so the system needs one authoritative story.
That does not mean the coordination store should carry every request, metric, or business event. It should hold metadata whose disagreement is dangerous:
- leader-election records
- leases and ownership claims
- service registration and health metadata
- cluster membership and naming state
- small configuration values that must not split brain
- controller progress markers and coordination locks
The workload is usually read-heavy, watch-heavy, and small-object oriented. Writes are meaningful and relatively scarce. If a team routes hot application traffic or large payloads through the same store, it turns a carefully protected control plane into a bottleneck.
What These Systems Package
All three systems provide a strongly coordinated core, but the surface primitives matter as much as the consensus protocol beneath them.
| System | Natural Mental Model | Coordination Style |
|---|---|---|
| etcd | Replicated control-plane KV | Watches, leases, revisions, transactions |
| Consul | Discovery and health-aware coordination | Service catalog, health checks, KV, sessions |
| ZooKeeper | Coordination tree | Znodes, watches, sessions, ephemeral nodes |
etcd is strongly associated with Kubernetes-style control loops. Controllers watch a keyspace, observe revisions, compare current state with desired state, and write updates through a Raft-backed API. This fits systems that need a compact, strongly consistent metadata store for controllers.
Consul is often a natural fit when service discovery and health are central to the workload. Its coordination story includes KV and sessions, but its operational value often comes from combining service registration, health checks, DNS/API discovery, and datacenter-aware workflows.
ZooKeeper uses a tree of znodes and session-oriented primitives. Ephemeral nodes disappear when a session ends, making them useful for membership, presence, and leader-election patterns. Watches let clients react to changes, though watch semantics must be understood carefully rather than treated as a magical event stream.
Primitives Shape System Design
The coordination API changes how applications express ownership and change.
controller / service / client
|
v
[coordinated metadata system]
| | |
watches leases sessions
revisions health ephemeral nodes
A lease says, "this ownership claim is valid only while renewal succeeds." That is useful for leader election and lock-like behavior, but it is not a timeless global mutex. A paused process can believe it still owns something after another node has legitimately taken over, so downstream systems may still need fencing tokens or revision checks.
A watch says, "tell me when this coordinated state changes." That is useful for controllers and service discovery, but watches are not a substitute for durable event processing. Clients must handle reconnects, missed ranges, compaction, and resync from current state.
A compare-and-swap transaction says, "write this only if the state still matches what I observed." That is powerful for safe updates, but it assumes the state being guarded is small and worth coordinating.
Worked Example: A Control Plane Choice
Suppose a platform needs three things:
- Kubernetes-style controllers reconciling desired state
- health-aware service discovery for application traffic
- classic session-based leader election for older distributed jobs
If the central design is a Kubernetes-like control plane, etcd is the natural mental model: controllers read revisions, watch changes, and write small metadata updates through a replicated key-value API.
If the central design is a service catalog with health checks, discovery, and operational integration across services, Consul may fit better. The service registry and health model are part of the product shape, not add-ons.
If the platform already relies on tree-structured coordination, ephemeral membership nodes, and session semantics, ZooKeeper may be the clearer fit, especially in ecosystems that already speak its patterns.
The choice is not a universal ranking. It is a fit between primitives and workload.
The Production Trade-Off
Consensus-backed coordination buys one authoritative answer for critical metadata. The price is real:
- writes pay quorum communication and durable log cost
- slow disks or slow peers can hurt the whole cluster
- quorum loss limits progress
- compaction, snapshots, and watch history need operational care
- client behavior under timeouts and session loss matters
That trade-off is worth paying for a leader lease, cluster membership, or control-plane configuration that must not split brain. It is usually not worth paying for user profiles, telemetry, large documents, queue payloads, or high-volume business events.
The rule of thumb is blunt: if the data is large, high-churn, user-facing, or merely convenient to store, it probably does not belong in the consensus store. If disagreement about it can break the control plane, it may.
Choosing Among Them
| Need | Likely Fit | Reason |
|---|---|---|
| Kubernetes-style controller metadata | etcd | Revisioned KV, watches, leases, and Raft-backed control-plane state |
| Integrated service discovery and health catalog | Consul | Service registration, health checks, discovery APIs, KV, and sessions |
| Session and ephemeral-node coordination tree | ZooKeeper | Znodes, watches, sessions, ephemeral presence, and classic coordination recipes |
This table is not a replacement for operational evaluation. It is a way to start from the coordination shape rather than from branding. The real decision should also include operator familiarity, ecosystem integration, client library behavior, backup/restore procedures, and failure-mode testing.
Common Misreadings
These systems are not general-purpose application databases. They expose storage APIs, but they are optimized for small, critical metadata that benefits from strong coordination.
A distributed lock is not automatically safe. Most lock-like APIs are lease or session based, so long pauses, timeouts, and delayed clients still require fencing, ownership checks, or idempotent downstream operations.
Watches are not durable queues. They are coordination notifications that need resync logic, especially after reconnects, compaction, or long client pauses.
Connections
The previous lesson on exactly-once, idempotency, and deduplication matters here because control-plane primitives often create the identities and ownership boundaries that make retries safe. A lease or compare-and-swap can coordinate a step, but the side effect still needs a safe boundary.
The next lesson on Jepsen-style verification follows naturally. Systems that hold leader elections, leases, watches, and critical metadata need evidence that their observable behavior still satisfies the contract under partitions, pauses, failover, and client retries.
Resources
- [DOC] etcd Documentation
- Focus: Watch, lease, transaction, revision, snapshot, and operational behavior for Raft-backed control-plane state.
- [DOC] Consul Documentation
- Focus: Service discovery, health checks, sessions, KV, and datacenter-oriented operations.
- [DOC] Apache ZooKeeper Documentation
- Focus: Znodes, sessions, watches, ephemeral nodes, and coordination recipes.
- [PAPER] ZooKeeper: Wait-free Coordination for Internet-scale Systems
- Focus: The coordination-service model and why ZooKeeper exposes primitives instead of a generic database.
Key Takeaways
- etcd, Consul, and ZooKeeper are coordination systems first; storage is the surface through which they expose strongly ordered metadata.
- The right choice depends on primitives and workload shape: controller KV, discovery and health, or session-oriented coordination tree.
- Consensus cost is worth paying for small critical control-plane state, not for bulk application data or high-throughput hot paths.