Day 209: Consensus Foundations: Safety, Liveness, and Fault Models

Consensus starts where "eventual agreement is probably fine" stops being acceptable. It exists for the moments when a distributed system must act as if there were only one authoritative history.

Today's "Aha!" Moment

Last month, gossip taught us how systems can spread knowledge cheaply while tolerating temporary disagreement. Consensus lives on the other side of that boundary. It appears when some disagreement is no longer harmless.

If several replicas are merely a little out of date about soft state, the system may still be okay. But if those replicas must decide who the leader is, which configuration is current, or which command is the next committed step in a log, then "we will probably converge later" is not good enough. The system needs one answer that everyone can eventually treat as authoritative.

That is the aha for consensus foundations. Before we study Paxos, Raft, or FLP, we need to separate three ideas that beginners often blur together:

safety: what must never happen
liveness: what we still want to eventually happen
fault model: what kinds of bad behavior we are willing to assume and design for

Once those are separated, consensus protocols become much easier to read. They stop looking like ceremonial voting rituals and start looking like careful attempts to preserve safety under uncertainty while still making progress when conditions are good enough.

Why This Matters

Imagine a replicated control plane that stores cluster membership and leader information. Two nodes briefly lose contact with each other. If both can independently decide they are the new leader and commit conflicting changes, the system has not suffered a minor inconsistency. It has suffered a split-brain authority failure.

This is exactly where consensus matters. It is not just about "getting agreement eventually." It is about making sure the system does not confirm two incompatible truths at once.

That is why the foundations matter so much:

if you misunderstand safety, you may build a system that can commit contradictory decisions
if you misunderstand liveness, you may expect progress in situations where the theory does not permit it
if you misunderstand the fault model, you may use the wrong protocol entirely

In production, these confusions show up in very concrete ways:

assuming timeouts prove failure rather than only suggesting it
assuming crash-tolerant consensus also handles malicious nodes
assuming a protocol that preserves safety must always remain available

This lesson gives us the vocabulary to avoid those mistakes before the month gets more technical.

Learning Objectives

By the end of this session, you will be able to:

State the core problem consensus solves - Explain why some distributed decisions require one authoritative outcome rather than eventual soft convergence.
Differentiate safety, liveness, and fault models - Describe what each property means and why protocols trade them differently.
Read later consensus protocols more intelligently - Recognize quorum rules, timeout behavior, and replica counts as consequences of the chosen fault model.

Core Concepts Explained

Concept 1: Consensus Exists to Choose One Authoritative Decision Under Uncertainty

Concrete example / mini-scenario: A replicated log backs a metadata service. Clients submit commands like "add node X" or "rotate leader to Y." Several replicas receive proposals, messages may be delayed, and some nodes may fail partway through the process.

The system is not trying to learn every fact in the world. It is trying to answer a sharper question:

what value or command is the system allowed to treat as chosen?

That is the consensus problem in its simplest useful form. Given several participants, unreliable timing, and possible failures, can the system end up with one agreed decision rather than several incompatible ones?

This is why consensus is different from broadcast or replication in general. Replication copies state. Broadcast moves messages. Consensus establishes authority.

You can picture the pressure like this:

many proposals
    -> uncertainty and delay
    -> one value becomes chosen
    -> replicas learn and extend the same history

The important mental shift is that consensus is fundamentally about committed history, not just communication. A system can communicate plenty and still fail consensus if different nodes are allowed to commit different next truths.

That is also why consensus often sits under:

replicated logs
leader election
configuration changes
metadata services
control planes

These are all places where the system needs one answer more than it needs cheap eventual spread.

Concept 2: Safety and Liveness Are Different Promises, and Safety Usually Wins First

Concrete example / mini-scenario: A three-node cluster loses one node and experiences a long delay between the remaining two. Should it keep trying to make progress, or should it avoid committing anything uncertain?

This is where safety and liveness must be separated.

Safety means something bad never happens. In consensus, that usually means the system never chooses two conflicting values for the same decision point.

Liveness means something good eventually happens. In consensus, that usually means a value can eventually be chosen if conditions are favorable enough.

Those are not the same kind of promise.

If a system preserves safety but temporarily stalls, operators may complain about availability, but the system has not corrupted authority. If a system violates safety to keep moving, it may look available right until it creates irreversible inconsistency.

That is why many consensus protocols are designed with this priority:

first: do not choose conflicting truths
then: make progress when assumptions permit

This also explains why timeouts should be interpreted carefully. In many protocols, timeouts are not proofs of failure. They are liveness tools that help the cluster retry, re-elect, or move leadership when communication seems broken or too slow.

So when students hear:

"the protocol is safe"
"the protocol may stop making progress under some conditions"

that is not a contradiction. It is often the intended design.

Concept 3: The Fault Model Quietly Determines the Whole Shape of the Protocol

Concrete example / mini-scenario: Two teams both say they need consensus. One assumes nodes may crash and recover. The other assumes compromised nodes may lie, equivocate, or send different messages to different peers. Those teams do not actually need the same protocol family.

This is where fault models enter.

A fault model defines what kinds of failure we assume:

crash faults: nodes stop or restart
omission/network faults: messages are delayed, dropped, or reordered
Byzantine faults: nodes may behave arbitrarily or maliciously

That choice changes almost everything:

what evidence is needed before treating a value as chosen
how many replicas are required
whether signatures or stronger quorum logic are needed
what guarantees are realistic under network uncertainty

At a high level:

crash-fault consensus:
    usually built around quorum intersection with 2f + 1 replicas

Byzantine-fault consensus:
    usually needs stronger rules and often 3f + 1 replicas

The specific protocols will come later. For now, the important lesson is that protocols are answers to assumptions. If we silently change the fault model, we silently change the meaning of "works."

This is also the bridge to the next lesson. Once we ask for both safety and liveness in an asynchronous world with failures, theory pushes back. FLP is the formal expression of that pushback, and failure detectors are one of the practical ways systems cope with it.

Troubleshooting

Issue: "If the cluster eventually agrees, doesn't that mean safety was preserved?"

Why it happens / is confusing: Eventual convergence sounds like proof that all is well.

Clarification / Fix: No. A system may temporarily choose conflicting truths and only later heal or overwrite one side. Consensus safety cares about whether incompatible decisions were ever both committed, not whether the cluster later looked tidy again.

Issue: "If a timeout fires, doesn't that prove the other node failed?"

Why it happens / is confusing: In everyday engineering, timeouts are often treated like evidence of death.

Clarification / Fix: In distributed systems, a timeout usually proves only that you waited too long. The node may be slow, partitioned, paused, or healthy but unreachable from your point of view.

Issue: "Why can't one protocol just handle all fault models?"

Why it happens / is confusing: "Consensus" sounds like one universal category.

Clarification / Fix: Different fault models require different assumptions, quorum sizes, and trust boundaries. Crash-tolerant consensus is not automatically Byzantine-tolerant consensus.

Advanced Connections

Connection 1: Consensus Foundations <-> Gossip

The parallel: Gossip tolerates temporary disagreement about soft state; consensus exists when disagreement about authority becomes unacceptable.

Real-world case: A system may use gossip to spread observations cheaply, but rely on consensus for leader election or committed configuration changes.

Connection 2: Consensus Foundations <-> FLP and Failure Detectors

The parallel: This lesson defines the promises we want. The next lesson explains why some combinations of those promises cannot be guaranteed in a fully asynchronous world with failures, and why practical systems lean on timing assumptions or failure detectors.

Real-world case: Election timeouts in Raft-like systems are not magic proofs of failure; they are practical liveness tools built on incomplete information.

Resources

Optional Deepening Resources

[PAPER] Paxos Made Simple
- Link: https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
- Focus: Good first read for seeing how safety dominates the design of a crash-fault consensus protocol.
[PAPER] In Search of an Understandable Consensus Algorithm (Raft)
- Link: https://raft.github.io/raft.pdf
- Focus: Useful for connecting the core concepts here to a more operationally readable protocol.
[PAPER] The Byzantine Generals Problem
- Link: https://lamport.azurewebsites.net/pubs/byz.pdf
- Focus: Read this to see how changing the fault model changes the consensus problem itself.

Key Insights

Consensus is about authority, not just communication - The real question is which value or command may be treated as chosen.
Safety and liveness are different promises - Protocols often prioritize "never commit conflicting truths" before "always keep moving."
Fault models shape protocol families - Crash-tolerant and Byzantine-tolerant consensus solve related but not identical problems.

Knowledge Check (Test Questions)

What is the clearest reason a system needs consensus?
- A) To spread information to all nodes cheaply.
- B) To choose one authoritative decision under failures and uncertainty.
- C) To guarantee that every network packet arrives in order.
Which statement best captures the difference between safety and liveness?
- A) Safety is about something bad never happening; liveness is about something good eventually happening.
- B) Safety and liveness are just two words for availability.
- C) Liveness is always more important than safety in control planes.
Why does the fault model matter so much?
- A) Because it determines what kinds of failure the protocol must survive and therefore what evidence and quorum structure are needed.
- B) Because it only affects performance tuning, not correctness.
- C) Because all consensus protocols assume the same kinds of failures.

Answers

1. B: Consensus is needed when the system must pick one authoritative outcome instead of merely exchanging information.

2. A: Safety forbids conflicting committed truths; liveness concerns eventual progress when conditions allow it.

3. A: The fault model determines what "survive failure" actually means, and that changes quorum sizes, assumptions, and protocol structure.

← Back to Learning