Consensus Foundations: Safety, Liveness, and Fault Models
LESSON
Consensus Foundations: Safety, Liveness, and Fault Models
The core idea: Consensus buys one authoritative decision under uncertainty, and the central trade-off is preserving safety even when progress has to wait for better conditions.
Core Insight
Imagine a small control plane that stores cluster membership. Three replicas normally agree that node-a, node-b, and node-c are the current voters. Then a network partition splits the cluster. One side can still talk to node-a; another side can still talk to node-b. Both sides receive an operator request to promote a replacement node.
If this were soft status information, temporary disagreement might be tolerable. A monitoring dashboard can show slightly stale health data and correct itself later. Membership is different. If both sides commit different configurations, the system now has two incompatible stories about who is allowed to make future decisions. That is not ordinary staleness. It is an authority failure.
Consensus exists for that boundary. It is the machinery a distributed system uses when "we will probably converge later" is not enough, because the system must avoid confirming two incompatible truths. The important misconception to correct early is that consensus is just a slower form of replication. Replication copies data. Consensus decides which value, command, or history is allowed to become authoritative.
The Decision Consensus Protects
The simplest useful consensus question is:
Which value is chosen for this decision point?
In a real system, the value might be a log command:
index 57 = add node-d as a voter
Several replicas may hear proposals. Messages may arrive late. A node may crash after sending one reply but before sending another. The protocol has to make sure the cluster cannot safely treat two conflicting values as chosen for the same decision point.
That is why consensus often sits under:
- leader election
- replicated logs
- configuration changes
- lock and lease services
- metadata stores
- control planes
All of those use cases need one protected answer more than they need cheap information spread. Gossip can be excellent for disseminating observations. Consensus is for decisions where authority must not split.
Safety Comes Before Liveness
Two words carry much of the track:
- Safety: the bad thing does not happen.
- Liveness: the good thing eventually happens.
For consensus, a safety promise usually sounds like this:
The system never chooses two conflicting values for the same slot.
A liveness promise sounds different:
The system eventually chooses a value when conditions are good enough.
Those are not the same kind of guarantee. If a cluster refuses to commit while it is uncertain, users may see an outage or a stalled control plane, but the system has not corrupted authority. If the cluster commits conflicting values just to stay responsive, it may look healthy briefly and then leave operators with two histories that cannot both be true.
This is the core trade-off in consensus design:
preserve safety under uncertainty
make progress when assumptions allow it
That priority explains a lot of behavior that otherwise feels frustrating. A safe protocol may stop making progress during a bad partition. A timeout may trigger a retry or election, but it does not prove that a remote node is dead. The system is using suspicion to recover liveness, while the safety rules still prevent stale or conflicting authority from being accepted.
Fault Models Shape the Whole Protocol
A fault model names what kind of bad behavior the protocol is designed to survive. This choice is not cosmetic.
Crash-fault consensus assumes nodes may stop, restart, or fail to respond. It does not assume a faulty node is intentionally lying. Many systems in one administrative domain use this model.
Omission and network faults cover delayed, dropped, duplicated, or reordered messages. They are why timeout behavior must be treated carefully: a missing response can mean crash, delay, pause, packet loss, or partition.
Byzantine faults are stronger. A Byzantine node may lie, equivocate, or send different claims to different peers. That requires a different family of protocols, stronger evidence, and usually more replicas.
At a high level:
crash-fault consensus:
often uses quorum intersection with 2f + 1 replicas
Byzantine-fault consensus:
often needs stronger evidence and 3f + 1 replicas
The numbers are not trivia. They show that "works under failure" only means something after the failure model is explicit. A crash-tolerant protocol is not automatically Byzantine-tolerant. A timeout-based election is not proof of failure. A majority quorum protects a different world than a quorum certificate with signed votes.
Worked Example: The Split-Brain Leader
Suppose a three-replica service uses a leader to serialize writes:
A, B, C
The network splits:
A | B, C
Replica A has not crashed. It is just isolated. If A keeps accepting committed writes as leader while B and C elect a new leader and also commit writes, the service now has two authoritative histories.
A consensus system prevents this by tying commitment to evidence from a quorum. With three replicas, a majority requires two. The isolated A can still be alive, but it cannot gather a majority. The B, C side can make progress because it has quorum. If A later returns, it must learn the committed history rather than continue its isolated one.
The trade-off is visible:
- the minority side loses availability for protected writes
- the system avoids two committed histories
- recovery has a clear direction because quorum-backed history wins
That is the basic shape behind many later details in Paxos, Raft, ZAB, leases, and reconfiguration.
Common Misreadings
"The cluster eventually agrees, so safety held" is not reliable. A system can commit conflicting decisions and later hide one during repair. Consensus safety cares about whether incompatible decisions were ever both chosen.
"A timeout proves failure" is also wrong. A timeout proves only that the observer waited longer than its policy allowed. The remote node might be crashed, paused, partitioned, slow, or healthy but unreachable.
"One consensus protocol handles every fault model" is the third trap. The protocol, replica count, evidence rules, and trust assumptions all change when the fault model changes.
Connections
This lesson connects directly to the previous gossip track. Gossip is useful when temporary disagreement about observations is acceptable. Consensus is useful when disagreement about authority would be unsafe.
It also sets up the next lesson on FLP and failure detectors. Once safety and liveness are separate, the next question is whether a protocol can always guarantee both in a fully asynchronous system with failures. FLP explains why that answer is no in the strongest model, and failure detectors explain how real systems recover practical progress.
Resources
- [PAPER] Paxos Made Simple
- Focus: Watch how the protocol protects safety before worrying about common-case performance.
- [PAPER] In Search of an Understandable Consensus Algorithm
- Focus: Use Raft to connect safety, liveness, leadership, and replicated logs.
- [PAPER] The Byzantine Generals Problem
- Focus: Read for how changing the fault model changes the agreement problem.
Key Takeaways
- Consensus protects authoritative decisions, not ordinary information spread.
- Safety and liveness are different promises; consensus usually preserves safety even when progress stalls.
- The fault model determines what evidence, replica count, and protocol family are appropriate.