LESSON
Day 005: Consensus Algorithms: Raft and Paxos
Consensus is what you pay for when “close enough” is no longer safe.
Today's "Aha!" Moment
Picture a five-node control-plane cluster managing which machines belong to the platform and which node is currently the leader. One node crashes. Messages are delayed. Two other nodes stop hearing from each other. The cluster is now under pressure to answer one question that cannot tolerate ambiguity: what is the official decision?
This is the real shape of consensus. It is not generic distributed voting, and it is not “replication but stricter.” It is the mechanism you need when the system must preserve one authoritative decision or one authoritative history even while nodes fail and the network becomes untrustworthy.
That is why consensus shows up in metadata stores, control planes, and replicated logs. Those systems are not trying to spread information eventually. They are trying to prevent the existence of two incompatible truths. If the system can tolerate temporary divergence and later repair, you usually do not want consensus. If the system cannot tolerate split authority, you probably do.
Signals that consensus is actually the right topic:
- the system must elect one leader, not several
- the system must commit one ordered log, not multiple competing histories
- a minority of nodes must not be able to declare truth by themselves
- the safe response to some failures is to stop and wait rather than guess
The common mistake is to treat consensus as a fancy default for distributed systems. It is not. It is expensive on purpose. The cost buys one precious thing: the ability to say "this is the official system state" without lying to yourself during failure.
Why This Matters
Many distributed systems only need propagation, caching, or eventual convergence. Consensus is overkill for those cases. But some subsystems are different: cluster membership, leader election, lock ownership, metadata changes, and ordered commit decisions all break badly if two conflicting answers become valid at the same time.
That is why consensus matters so much in practice. It is often not on the hot path for user data, but it is very often on the hot path for control. Kubernetes, etcd, ZooKeeper-style systems, and replicated log architectures all depend on the idea that there is one official history that later decisions must respect.
If you do not understand consensus, two kinds of design errors become likely. The first is underusing it and accidentally allowing split-brain behavior. The second is overusing it and forcing expensive quorum coordination into places where looser guarantees would have been enough. Good engineering depends on knowing the difference.
Learning Objectives
By the end of this session, you will be able to:
- State the consensus problem precisely - Explain why some distributed decisions require one agreed answer or one agreed order.
- Recognize the safety backbone shared by Raft and Paxos - Explain quorum overlap, leadership rounds, and why later decisions must respect earlier chosen state.
- Choose consensus intentionally - Distinguish problems that need consensus from problems that only need dissemination or eventual repair.
Core Concepts Explained
Concept 1: Consensus Protects Invariants That Cannot Split
Suppose a cluster scheduler stores all placement decisions in a replicated metadata log. If two leaders both believe they are authoritative and both accept conflicting updates, the cluster does not get "eventual consistency." It gets incompatible commands and undefined ownership.
That is the kind of invariant consensus exists to protect. Some truths may be allowed to drift temporarily. Leader identity, committed log position, and authoritative metadata often are not.
The key idea is that the system would rather delay progress than accept two incompatible official outcomes. That sounds conservative, but it is exactly what safety demands. Consensus is the discipline of preferring temporary unavailability over contradictory authority.
A useful mental split is:
replication = copy state to multiple nodes
consensus = decide when one state or one log entry is officially chosen
Replication alone does not answer "which version is the official one?" Consensus does.
The trade-off is direct. You gain a defensible definition of official system truth. You pay with coordination cost, quorum waits, and the possibility that the system must pause some operations when it cannot form a safe majority.
Concept 2: Quorum Overlap Is the Core Safety Lever
The deepest safety idea in both Raft and Paxos is simpler than the surrounding protocol details: two majorities in the same cluster must overlap. That overlap is what lets the system carry forward previous decisions instead of allowing two incompatible committed outcomes.
In a five-node cluster:
majority A = {1, 2, 3}
majority B = {3, 4, 5}
overlap = {3}
That shared node is not magic by itself, but the protocol rules force later rounds to account for what previous majorities may already have accepted. This is why quorums matter so much more than raw message exchange.
The beginner temptation is to think consensus works because one leader tells everyone what to do. That is only part of the story. The real safety story is that any new leader or proposer must respect what earlier quorums may already have made official.
This is also why minority partitions cannot safely keep committing decisions. They lack the overlap guarantee that keeps history coherent.
The trade-off is that quorum rules give strong safety with simple mathematical structure, but they also make some failures visible as halted progress. If a majority cannot communicate, safety wins and availability loses for consensus-governed writes.
Concept 3: Raft and Paxos Package the Same Demand Differently
Raft is usually taught operationally. You think in terms of leader election, follower heartbeats, replicated logs, commit indexes, and terms. It tells a concrete story engineers can narrate step by step.
Paxos is usually taught more abstractly. You think in terms of proposers, acceptors, proposal numbers, and the rule that later rounds must not violate what an earlier quorum may already have chosen. That abstraction is why Paxos often feels harder at first.
The important point is not to memorize them as rival tribes. Both are solving the same safety problem:
- no two conflicting values should both become chosen
- later rounds must preserve earlier chosen state
- progress should continue when enough healthy nodes remain
Raft gives you a clean leader-centered operational model:
followers wait
leader fails
new election starts
new leader replicates log entries
entry commits after majority acknowledgment
Paxos gives you a quorum-centered proof model:
proposer asks acceptors to promise
quorum responds
proposer must respect any previously accepted value
new value can be proposed only under those constraints
If you keep the shared safety demand in view, the family resemblance becomes obvious. Raft is often easier to operate and teach. Paxos is often more compact and more general as a reasoning framework. The choice is not about which one is “real consensus.” Both are.
The trade-off is pedagogical and operational. Raft gives clearer narrative and implementation structure. Paxos gives a more minimal and general safety formulation. Either way, consensus remains expensive because the invariant is expensive.
Troubleshooting
Issue: "Consensus is just replication with a leader."
Why it happens / is confusing: Raft is often introduced through replicated logs, so the extra safety rules can disappear behind the implementation story.
Clarification / Fix: Replication copies data. Consensus decides when one value or one log entry is officially chosen and prevents conflicting official histories.
Issue: "If a node is fast and trusted, why not let it decide alone?"
Why it happens / is confusing: A single fast authority seems simpler in normal conditions.
Clarification / Fix: Consensus is designed for failures. A single node cannot provide the overlap guarantees needed when the system loses messages, leaders, or connectivity.
Issue: "Consensus should be used everywhere because stronger correctness is always better."
Why it happens / is confusing: The word "correctness" makes looser coordination sound irresponsible.
Clarification / Fix: Consensus is the right cost only for invariants that truly require one official answer. Many problems only need dissemination, caching, or eventual convergence.
Advanced Connections
Connection 1: Consensus <-> Control Plane Design
The parallel: Consensus is most often used to protect coordination state rather than every piece of user data.
Real-world case: Systems like etcd use Raft for cluster metadata because split authority there would destabilize the whole platform.
Connection 2: Quorums <-> Fault Models
The parallel: Quorum mathematics only makes sense relative to the kinds of failures the system is assuming it can survive.
Real-world case: A five-node majority system can survive some crash failures, but it cannot keep safely committing writes if the network removes majority communication.
Resources
Optional Deepening Resources
- [PAPER] In Search of an Understandable Consensus Algorithm (Raft)
- Link: https://raft.github.io/raft.pdf
- Focus: Read the introduction plus leader election and log replication sections to see consensus from an operational angle.
- [INTERACTIVE] The Secret Lives of Data: Raft
- Link: https://thesecretlivesofdata.com/raft/
- Focus: Step through elections and log replication to build intuition before worrying about proofs.
- [PAPER] Paxos Made Simple
- Link: https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
- Focus: Read it after Raft and look for the shared quorum-safety idea rather than the notation differences.
- [DOC] etcd: Why etcd Uses Raft
- Link: https://etcd.io/docs/v3.5/learning/why/
- Focus: Connect consensus theory to a production control-plane system.
Key Insights
- Consensus is for invariants that cannot split - It exists when the system must defend one leader, one decision, or one committed history.
- Quorum overlap is the safety backbone - The protocol details differ, but both Raft and Paxos rely on later majorities respecting what earlier majorities may already have chosen.
- Consensus is intentionally expensive - The latency and availability costs are the price of preserving one official truth during failure.