Distributed Testing, Simulation, and Deterministic Replay: Testing Consensus, Replication, and Membership Protocols
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Testing Consensus, Replication, and Membership Protocols
Core Insight
In ConfigStore, three replicas hold the cluster's control-plane state: routing rules, shard owners, and membership. A test that starts one leader, writes one value, and checks that followers eventually receive it is useful, but it barely touches the dangerous part. The hard question is whether the protocol still preserves its safety claims while leaders change, messages reorder, disks lose unflushed state, and membership changes overlap with partitions.
Consensus, replication, and membership protocols are not tested mainly by checking final values. They are tested by checking invariants over histories: which value could be committed, which quorum proved it, which term or ballot authorized it, which nodes were members at that point, and what persisted across crashes. A green end state can hide a split-brain path that briefly committed two incompatible decisions.
The trade-off is exploration depth versus protocol fidelity. A small deterministic harness can explore many schedules and shrink failures, but it must model the details that define the protocol's guarantees: quorum membership, log positions, term or ballot monotonicity, durable votes, commit rules, and reconfiguration boundaries. If those are simplified away, the test may become fast and stable while proving the wrong property.
Protocol Tests Need History Oracles
For ordinary application tests, it is often enough to assert the final response:
write x
read x
assert x is visible
Protocol tests need stronger oracles. They ask whether every observed transition was legal under the protocol.
leader election:
at most one leader can be valid for a term
log replication:
committed entries at the same index cannot conflict
quorum commit:
every committed value has evidence from a valid quorum
durability:
a node does not forget a vote or committed log entry after restart
membership:
configuration changes preserve quorum intersection
The history matters because protocol bugs often recover into a normal-looking state. A cluster can elect a later leader, converge, and pass a final read while still having committed an unsafe value earlier.
That is why the harness should record decisions, not only states:
- election attempts
- votes granted and rejected
- terms, epochs, or ballots
- append entries and acknowledgements
- commit-index movement
- durable writes and fsync boundaries
- snapshots and truncation
- membership proposals and activations
- node crashes and restarts
- network partitions and heals
The oracle then checks the recorded evidence against the protocol claim.
Consensus Invariants
Consensus tests usually focus on safety before liveness. Liveness matters, but a test that sometimes fails to make progress is different from a test that commits two incompatible values.
Useful consensus invariants include:
- a node's term or ballot never moves backward
- a node grants at most one vote per term unless the protocol explicitly allows more
- a leader must have valid election evidence for its term
- two leaders for the same term cannot both be authoritative
- a committed log entry is never replaced by a conflicting entry
- a later leader contains all entries that were committed under earlier leaders
- a node's durable vote and log prefix survive crash and restart
The harness needs to model persistence carefully. Many consensus bugs live between "sent to disk" and "actually durable."
1 B grants vote to A for term 7
2 B buffers vote record
3 B crashes before durable write
4 B restarts
5 B grants vote to C for term 7
If the production protocol relies on durable votes, the simulator must distinguish buffered state from durable state. If it treats every write as instantly durable, it cannot test that boundary.
Replication Invariants
Replication tests ask whether state moves without violating the consistency claim.
For primary-backup replication, important checks include:
- followers apply entries in log order
- acknowledgements mean what the commit rule says they mean
- failover does not lose acknowledged committed writes
- stale leaders cannot keep accepting writes after losing authority
- reads are served only from replicas that satisfy the read guarantee
For multi-leader or eventually consistent replication, the checks are different:
- conflict detection or resolution is deterministic
- convergence happens after partitions heal
- causal dependencies are not applied out of order when the model requires them
- duplicate messages do not duplicate non-idempotent effects
- tombstones, versions, or vector metadata are retained long enough to resolve conflicts
The lesson from simulation fidelity is direct: the invariant must match the real claim. "All replicas eventually hold the same value" is too weak if the product promises linearizable writes. "No committed write is lost across failover" is too weak if clients also require monotonic reads.
Membership Invariants
Membership is the set of nodes allowed to participate in a protocol decision. It sounds administrative, but it is part of the safety mechanism.
A common mistake is treating membership as a side table that can change instantly:
old config: A, B, C
new config: B, C, D
If one side of the cluster thinks the old config is active while another side thinks the new config is active, the protocol may accidentally create quorums that do not intersect in the way the safety proof needs.
Membership tests should check:
- who is allowed to vote
- which configuration authorized each commit
- when a new configuration becomes active
- whether old and new quorums overlap during transition
- whether removed nodes can still lead, vote, or serve reads
- whether joining nodes receive enough state before becoming voters
- whether suspected nodes are merely suspected or actually removed by a committed decision
Suspicion is not membership. A failure detector can suspect a node, but a consensus-backed membership protocol usually needs a committed configuration change before that node stops counting for quorum logic.
Worked Example
ConfigStore starts with three voting replicas:
old config: A, B, C
quorum size: 2
leader: A
The system wants to add D and remove A:
target config: B, C, D
A safe protocol usually uses a transition that preserves quorum intersection, such as a joint configuration or another explicit reconfiguration rule. The bug is that the implementation activates the new configuration locally before the transition is committed everywhere.
The deterministic harness explores this schedule:
1 A proposes config change old(A,B,C) -> new(B,C,D)
2 A appends config entry at log index 40
3 A sends append(index 40) to B and C
4 B receives index 40 and marks new config active too early
5 network partitions A from C and D
6 A and B commit command X under old config evidence
7 C hears from B that new config is active
8 C and D elect C under new config evidence
9 C and D commit command Y at the same logical index
10 partition heals
11 invariant fails: conflicting committed entries
The final state may be confusing. After healing, the cluster might pick one leader and overwrite local state until most nodes agree. A final read could show only X or only Y. The protocol bug is still real because two incompatible commands were committed under incompatible evidence.
The oracle should not wait for the final read. It should check the history:
for every committed entry:
record index, term, value, config, quorum evidence
for any two committed entries at same index:
assert value is identical
assert quorum evidence is compatible with the active config transition
The replay record should preserve the exact schedule that made the bug possible:
message deliveries:
append index 40 delivered to B
append index 40 delayed to C
vote request from C delivered to D
state transitions:
B activates new config before commit
A commits X with A,B
C commits Y with C,D
faults:
partition A from C,D
That record gives the shrinker a real target. It can remove unrelated client operations, reduce delays, and keep only the reconfiguration race that violates the invariant.
Harness Design for Protocols
A protocol harness needs explicit control over four boundaries.
The network boundary controls delivery, delay, duplication, loss, and reordering. Protocols should be tested under partitions and delayed heals, not only dropped messages.
The clock boundary controls election timeouts, lease expiry, heartbeat intervals, suspicion timers, and retry backoff. If leases or failure detectors are involved, host-clock time is usually the wrong test clock.
The disk boundary controls durable state. The harness should distinguish memory, buffered write, durable write, snapshot, truncation, and restart recovery whenever the protocol relies on persistence.
The membership boundary controls which nodes count as voters, learners, observers, removed members, or joining members. Tests should not let a node participate just because it exists in the process list.
These boundaries turn protocol claims into testable events. Without them, the harness can still run the code, but it cannot reliably say what evidence made a decision legal.
Liveness Without Hiding Safety
Protocol tests also need liveness checks:
- a healthy cluster eventually elects a leader
- committed entries eventually replicate to available followers
- a joining member eventually catches up
- a healed partition eventually converges
- membership changes eventually complete
Liveness tests are useful, but they should not blur safety failures. A common anti-pattern is retrying operations until the cluster becomes healthy and then asserting the final state. That can hide a temporary split brain, lost commit, or illegal leader.
Separate the checks:
safety:
no conflicting commits ever occur
no stale leader serves authoritative writes
liveness:
after faults stop, the cluster eventually elects one leader
after faults stop, committed entries eventually replicate
The safety check watches the whole history. The liveness check applies after the fault schedule reaches a stable recovery phase.
Common Failure Modes
One mistake is testing only the happy leader. Most protocol bugs involve leader changes, failed persistence, delayed messages, or reconfiguration.
Another mistake is making membership a test fixture instead of a protocol state. If membership cannot change inside the harness, reconfiguration bugs are invisible.
A third mistake is treating durable state as ordinary memory. Crashes should reveal which facts were truly persisted.
A fourth mistake is asserting only eventual convergence. Convergence after a split-brain commit does not make the split brain safe.
A fifth mistake is using one generic invariant for every protocol. Consensus, eventual replication, primary-backup replication, and membership gossip have different claims.
Practice
Design a protocol test for one replicated control-plane service.
- What are the safety claims?
- What are the liveness claims?
- Which facts must be recorded in the history?
- Which network events must the harness control?
- Which clock events must the harness control?
- Which state is volatile, buffered, or durable?
- Which nodes are voters, learners, joining members, or removed members?
- Which final states could hide an unsafe history?
Then write one invariant that checks evidence, not just outcome. For example: "Every committed entry at index i has the same value and was committed by a quorum valid for the configuration active at i."
Connections
- Builds on Simulation Fidelity, Model Drift, and False Confidence, because protocol tests must preserve the fidelity details that define the protocol guarantee.
- Prepares for Testing Client Semantics, Idempotency, and Exactly-Once Claims, where the focus moves from internal protocol evidence to what clients can safely assume.
- Connects to
consistency-and-replication, because the same quorum, log, and membership concepts become test oracles here.
Resources
- [PAPER] In Search of an Understandable Consensus Algorithm
- [PAPER] Paxos Made Simple
- [DOC] Jepsen Analyses
- [PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store
Key Takeaways
- Protocol tests need history oracles that check decisions, evidence, and legal transitions, not only final values.
- Consensus, replication, and membership each require different invariants, even when they run in the same service.
- A useful harness controls network, clock, disk, and membership boundaries so protocol claims become observable.
- Safety checks should watch the whole execution history; liveness checks should run after the fault schedule reaches recovery.
← Back to Distributed Testing, Simulation, and Deterministic Replay