Final Capstone: Consensus-Backed Control Plane Architecture
LESSON
Final Capstone: Consensus-Backed Control Plane Architecture
The core idea: A consensus-backed control plane is ready when it can defend exactly which state needs one authoritative history, with a trade-off between strong coordination guarantees and the latency, operational discipline, recovery complexity, and failure testing those guarantees require.
Core Insight
Imagine you are designing the control plane for a regional compute platform. It stores desired deployments, elects controllers, publishes service endpoints, recovers from member failure, and exposes watches to agents across the fleet. Some state needs one authoritative story. Some state is important but should stay out of the consensus path.
The design task is to defend that boundary. Consensus is valuable when disagreement about authority would make the platform unsafe. It is harmful when teams push high-volume, low-authority data through it because "important" was confused with "must be serialized."
The misconception is that a consensus-backed design is correct once it names Raft, Paxos, etcd, or ZooKeeper. The actual design must specify the committed state, the API semantics, the read and lease guarantees, the recovery path, and the evidence that proves the system survives failure.
Strong coordination buys clarity of authority, but it spends latency, operational discipline, and a tighter failure envelope. A good architecture makes that exchange visible enough to review.
Scenario and Constraints
The platform needs these capabilities:
- store desired deployment state,
- elect one active scheduler per shard,
- expose watches so controllers can reconcile,
- assign monotonically increasing revisions,
- recover from one zone failure,
- avoid stale controllers mutating external resources,
- keep telemetry and bulk logs outside the consensus store.
The system has three zones. The first target is a three-member consensus cluster, one member per zone. Writes must commit through quorum. Controllers use watches from known revisions. Lease grants include fencing tokens that downstream resources check.
That architecture is intentionally narrow. It uses consensus for control decisions, not for every event the platform observes.
Authority Boundary
Start with the boundary:
inside consensus:
desired deployment specs
scheduler lease ownership
rollout phase and gates
service endpoint authority
membership metadata
monotonically increasing revisions
outside consensus:
metrics
logs
traces
large artifacts
image layers
derived caches
high-volume status samples
The inside set is small because it answers authority questions. Who owns the scheduler shard? Which deployment spec is current? Which rollout gate is open? Which member set can decide history?
The outside set can be large, useful, and operationally critical without needing consensus serialization. Logs, metrics, and artifacts need durability and retrieval, but they usually do not need one global order that every controller must agree on before acting.
The review question is:
Would two different answers make the platform unsafe?
If yes, the state may belong in consensus. If no, a cheaper replicated store, event pipeline, object store, or cache is probably the better fit.
API Walkthrough
The control plane exposes a small API surface:
put-if-revision(key, expected_revision, value)
grant-lease(role, ttl) -> token
watch(prefix, from_revision)
snapshot-status()
member-status()
Desired state changes use conditional writes so operators and automation do not overwrite stale state silently. Scheduler leadership uses leases with fencing tokens. Watchers resume from revisions and rebuild local caches after disconnect. Snapshot, compaction, and membership status are first-class because recovery time matters.
The API should force clients to handle stale assumptions. A failed put-if-revision means the client must re-read and recompute. A lost lease means the controller must stop acting until it obtains a fresh token. A watch that falls behind compaction must return a clear resync signal instead of silently skipping history.
Worked Flow: Deployment Update
operator submits desired deployment v12
client reads current revision 481
client writes put-if-revision(/deploy/app, 481, v12)
consensus commits revision 482
watchers receive revision 482
active scheduler reconciles desired state to actual resources
downstream writes include scheduler fencing token
This flow ties each action to evidence. The desired state update is guarded by a revision. Watchers know which committed revision they are processing. The scheduler acts only while it has current lease authority. Downstream resources can reject stale scheduler writes.
The same flow also shows what should not be in consensus. Per-pod logs, trace spans, image blobs, and high-frequency health samples may be related to the deployment, but they do not define the authoritative desired state. They should not share the critical write path.
Failure Review
A credible design names the failures it expects and the evidence that preserves safety.
Partition between zones:
- a majority side continues,
- minority-side leaders cannot commit new authority,
- stale actors are fenced by newer tokens,
- clients on the minority side receive errors or stale-only responses.
Slow disk on the leader:
- commit latency rises,
- alerts fire on fsync and proposal latency,
- leadership movement is investigated, not treated as ordinary noise,
- controllers back off instead of creating retry storms.
Watcher falls behind compaction:
- client receives a clear compaction or revision-too-old error,
- client reloads from snapshot or current state,
- reconciliation remains idempotent.
Controller pauses after receiving a lease:
- lease expires,
- a newer controller receives a newer token,
- downstream systems reject the old token when the paused controller resumes.
Permanent quorum loss:
- ordinary progress stops,
- the disaster recovery runbook distinguishes member replacement from forced recovery,
- operators state which snapshot or survivor is being trusted,
- downstream teams know whether acknowledged writes may be lost.
The design is not trying to hide these trade-offs. It is trying to make them reviewable.
Invariants and Tests
The architecture is not ready until its guarantees are testable.
Core invariants include:
- at most one active scheduler per shard can successfully mutate downstream resources,
- every applied deployment revision came from a committed
put-if-revision, - watchers either process every relevant revision or perform a full resync,
- stale lease tokens are rejected outside the consensus store,
- member replacement preserves committed history unless the runbook explicitly enters forced recovery,
- telemetry and artifacts cannot block consensus commits by sharing the critical path.
Failure tests should exercise partitions, process pauses, slow disks, compaction gaps, client retries, leader changes, member replacement, snapshot restore, and forced recovery drills. Jepsen-style history checking is useful because it tests the claim the architecture makes, not just the happy path implementation.
Crash-Fault or Byzantine?
For this regional compute platform, crash-fault consensus is probably the right default if all consensus members live inside one administrative trust boundary. The dominant risks are slow disks, partitions, bad placement, stale controllers, operator mistakes, and recovery ambiguity.
Byzantine consensus becomes relevant if the control plane spans organizations, untrusted operators, public validators, or adversarial infrastructure. Then the design must add stable identities, key management, authenticated votes, quorum certificates, and a stronger threat model.
The capstone decision is not "which protocol is more advanced?" It is "which fault model matches the trust boundary we actually have?"
Readiness Check
Before this control plane is ready, the team should be able to answer:
- Which state is authoritative and why?
- Which operations require linearizable reads?
- Which leases need fencing tokens?
- What metrics warn that consensus latency is leaving the safe envelope?
- How does a watcher resume after disconnect or compaction?
- What is the exact recovery path after quorum loss?
- Which invariants will be tested under partitions, pauses, retries, and slow disks?
- What state is deliberately outside consensus, and what stores it instead?
- What is the trust boundary: crash-fault only or Byzantine?
If any answer is vague, the design is not finished.
Resources
- [DOC] etcd Documentation
- Focus: Use the API and operations docs as a concrete reference for revisions, watches, leases, and cluster operation.
- [DOC] Kubernetes API Concepts
- Focus: Study resource versions, watches, and reconciliation patterns.
- [PAPER] The Chubby Lock Service for Loosely-Coupled Distributed Systems
- Focus: Compare lock service API semantics with the capstone boundary.
- [DOC] Jepsen Analyses
- Focus: Use failure analyses as prompts for invariant and history testing.
Key Takeaways
- A consensus-backed control plane should protect authority, not absorb every important byte.
- API semantics, leases, watches, fencing, membership, and recovery are part of the consensus design.
- The architecture needs explicit boundaries for authoritative state, high-volume non-authoritative data, and trust assumptions.
- The design is ready only when its invariants and recovery claims can be tested under realistic failure.