Consensus Problem and Fault Models

LESSON

Consistency and Replication

056 30 min advanced

Day 487: Consensus Problem and Fault Models

The core idea: Consensus lets a cluster make one durable decision under failure, but the guarantee is only as strong as the fault model the protocol assumes.


Today's "Aha!" Moment

In 055.md, Harbor Point learned why 2PC blocks when the coordinator disappears after participants have prepared. The next design move is not "replicate the coordinator somehow" in the abstract. It is to build a small cluster that can answer one concrete question safely: who is allowed to decide the next irreversible booking action for suite S12 when one node crashes or a cross-region link starts dropping packets?

The non-obvious insight is that consensus is not the same thing as "all replicas eventually contain the same bytes." Harbor Point already knew how to replicate state. The hard part is making sure the cluster never decides two different outcomes for the same slot in the decision log. If harbor-1 decides "assign suite S12 to booking #8841" while harbor-2 decides the same suite belongs to booking #8849, replication has not helped; it has simply spread the inconsistency.

That is why fault models are part of the problem statement, not background detail. A protocol that works when nodes only crash and recover with intact disks is not automatically valid when nodes can equivocate, corrupt state, or send conflicting messages. "Consensus" without an explicit failure model is as incomplete as "transaction" without an isolation level.


Why This Matters

For Harbor Point, the production consequence is immediate. The cluster that owns booking authority must never double-sell the last suite on a ship, and it must not allow a rebooted follower to resurrect stale leadership after a partition heals. Those are safety requirements, not mere performance goals.

Consensus is the mechanism that turns "one of these replicas should probably coordinate" into a defensible contract: one ordered log, one chosen value per slot, and one current leadership term that stale nodes cannot overrule. The trade-off is equally concrete. To get that safety, Harbor Point gives up the ability for any isolated minority replica to keep taking writes, and every committed decision now depends on quorum health, durable metadata, and careful timeout handling.

This is why teams use consensus sparingly. It is the right tool for cluster metadata, shard ownership, leader election, lock services, and any workflow where "there must be exactly one authoritative decision path" matters more than squeezing every write through the cheapest route. It is the wrong tool when the system can tolerate divergent local updates and reconcile later.

Core Walkthrough

Part 1: What the consensus problem actually is

Harbor Point replaces the single 2PC coordinator with a three-node booking authority cluster:

Every booking decision becomes a proposal for the next log slot. For slot 812, two frontends might race:

proposal A: reserve suite S12 for booking #8841
proposal B: reserve suite S12 for booking #8849

Consensus is the problem of ensuring the cluster chooses at most one of those values for slot 812, and that every healthy replica eventually learns the same chosen value. That sounds simple until you remove the assumptions a single-process program quietly relies on:

From that environment come the core properties engineers care about:

The safety properties are the reason Harbor Point can trust the cluster with suite inventory. Liveness is what keeps the booking desks moving. The important discipline is that safety is non-negotiable, while liveness is always conditional on the fault model and the current network.

Part 2: Why the fault model changes the protocol

Consensus algorithms are built around explicit assumptions about what can go wrong. Harbor Point's design review needs to name those assumptions before it argues about Raft, Paxos, or replica counts.

Fault model What it means in Harbor Point's cluster Design consequence
Crash-stop A node halts or becomes unreachable, but it does not fabricate messages Majority quorum protocols can tolerate f crashes with 2f + 1 replicas
Crash-recovery A node reboots later and must not forget accepted terms or log entries Durable metadata on disk becomes part of correctness, not an optimization
Network delay / partition A healthy leader may look dead to some peers for a while Timeouts can trigger elections, but they cannot be treated as proof of failure
Byzantine A node may lie, equivocate, or send different messages to different peers Standard Raft/Paxos assumptions no longer hold; you need BFT protocols and more replicas

The first three rows describe the world most production consensus systems target: crash faults in an asynchronous network that is usually stable enough for timeouts and leader elections to work. In that world, the mechanism is built around terms or ballots, durable voting state, and quorum intersection.

Harbor Point can reason about one log slot like this:

1. A leader for term 41 proposes "reserve S12 for #8841" in slot 812.
2. A majority persists the proposal with term 41.
3. Because any future majority overlaps with this one, a later leader cannot safely choose a conflicting value for slot 812.
4. Once the entry is known to be chosen, replicas apply it in order to the booking state machine.

That overlap rule is the heart of crash-fault consensus. It is what lets the cluster survive leader changes without losing the one-decision property. A future leader may be slower, newer, or located in another region, but it still has to respect what an intersecting quorum already made durable.

Timing assumptions matter just as much. In a fully asynchronous system, a timeout cannot distinguish "peer crashed" from "packet delayed for 30 seconds." That is why practical consensus systems promise safety always, but liveness only when the network eventually behaves well enough for one leader to gather a quorum. Timeouts are for suspicion and recovery, not for proving facts about the world.

Part 3: Production implications and trade-offs

Once Harbor Point sees consensus as "one chosen log under a defined failure model," design decisions become less mystical.

The cluster can safely own:

The cluster should not become the path for every analytics event or every low-value write. Consensus adds a real trade-off:

This trade-off is why production systems often separate the consensus-backed control plane from the high-throughput data plane. Harbor Point uses consensus to decide who owns the shard and which booking command is authoritative. It does not run every downstream read model or customer activity stream through the same replicated decision path.

That boundary also explains why consensus is the answer to the 2PC limitation from the previous lesson, but not a universal replacement for every distributed coordination problem. Consensus protects the decision log itself. The application still has to design retries, idempotency, and state-machine commands carefully on top of that log.


Failure Modes and Misconceptions

Connections

Connection 1: 055.md showed why a single durable coordinator decision can block

Consensus generalizes that lesson by replicating the decision path itself. Instead of one coordinator log that can strand prepared work, the cluster uses quorum intersection so a later leader can recover the already chosen history.

Connection 2: 057.md turns this abstract problem into the engineer's Raft/Paxos mental model

This lesson defined the problem and the failure assumptions. The next lesson maps those assumptions to concrete protocol pieces such as leader election, ballots or terms, log matching, and commit rules.

Connection 3: ../consensus-and-coordination/001.md gives the formal safety/liveness vocabulary

If you want the proof-oriented framing, revisit that lesson after this one. The definitions of safety, liveness, and fault tolerance there are exactly the constraints Harbor Point is making operational here.


Resources


Key Takeaways

  1. Consensus is about choosing one authoritative value per decision slot, not merely keeping replicas eventually similar.
  2. The fault model is part of the protocol contract; crash faults, partitions, and Byzantine behavior lead to materially different designs.
  3. Safety and liveness are not symmetric goals in production consensus systems: safety must hold even when the network is ugly, while liveness depends on eventual quorum communication.
  4. The trade-off for a single authoritative decision path is quorum latency and reduced write availability during minority partitions, which is why consensus belongs in the control plane more often than the data plane.

PREVIOUS Distributed Transactions and 2PC Limits NEXT Raft/Paxos Mental Model for Engineers

← Back to Consistency and Replication

← Back to Learning Hub