Distributed Testing, Simulation, and Deterministic Replay: Testing Consensus, Replication, and Membership Protocols

LESSON

Distributed Testing, Simulation, and Deterministic Replay

018 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Testing Consensus, Replication, and Membership Protocols

Core Insight

In ConfigStore, three replicas hold the cluster's control-plane state: routing rules, shard owners, and membership. A test that starts one leader, writes one value, and checks that followers eventually receive it is useful, but it barely touches the dangerous part. The hard question is whether the protocol still preserves its safety claims while leaders change, messages reorder, disks lose unflushed state, and membership changes overlap with partitions.

Consensus, replication, and membership protocols are not tested mainly by checking final values. They are tested by checking invariants over histories: which value could be committed, which quorum proved it, which term or ballot authorized it, which nodes were members at that point, and what persisted across crashes. A green end state can hide a split-brain path that briefly committed two incompatible decisions.

The trade-off is exploration depth versus protocol fidelity. A small deterministic harness can explore many schedules and shrink failures, but it must model the details that define the protocol's guarantees: quorum membership, log positions, term or ballot monotonicity, durable votes, commit rules, and reconfiguration boundaries. If those are simplified away, the test may become fast and stable while proving the wrong property.

Protocol Tests Need History Oracles

For ordinary application tests, it is often enough to assert the final response:

write x
read x
assert x is visible

Protocol tests need stronger oracles. They ask whether every observed transition was legal under the protocol.

leader election:
  at most one leader can be valid for a term

log replication:
  committed entries at the same index cannot conflict

quorum commit:
  every committed value has evidence from a valid quorum

durability:
  a node does not forget a vote or committed log entry after restart

membership:
  configuration changes preserve quorum intersection

The history matters because protocol bugs often recover into a normal-looking state. A cluster can elect a later leader, converge, and pass a final read while still having committed an unsafe value earlier.

That is why the harness should record decisions, not only states:

The oracle then checks the recorded evidence against the protocol claim.

Consensus Invariants

Consensus tests usually focus on safety before liveness. Liveness matters, but a test that sometimes fails to make progress is different from a test that commits two incompatible values.

Useful consensus invariants include:

The harness needs to model persistence carefully. Many consensus bugs live between "sent to disk" and "actually durable."

1  B grants vote to A for term 7
2  B buffers vote record
3  B crashes before durable write
4  B restarts
5  B grants vote to C for term 7

If the production protocol relies on durable votes, the simulator must distinguish buffered state from durable state. If it treats every write as instantly durable, it cannot test that boundary.

Replication Invariants

Replication tests ask whether state moves without violating the consistency claim.

For primary-backup replication, important checks include:

For multi-leader or eventually consistent replication, the checks are different:

The lesson from simulation fidelity is direct: the invariant must match the real claim. "All replicas eventually hold the same value" is too weak if the product promises linearizable writes. "No committed write is lost across failover" is too weak if clients also require monotonic reads.

Membership Invariants

Membership is the set of nodes allowed to participate in a protocol decision. It sounds administrative, but it is part of the safety mechanism.

A common mistake is treating membership as a side table that can change instantly:

old config: A, B, C
new config: B, C, D

If one side of the cluster thinks the old config is active while another side thinks the new config is active, the protocol may accidentally create quorums that do not intersect in the way the safety proof needs.

Membership tests should check:

Suspicion is not membership. A failure detector can suspect a node, but a consensus-backed membership protocol usually needs a committed configuration change before that node stops counting for quorum logic.

Worked Example

ConfigStore starts with three voting replicas:

old config: A, B, C
quorum size: 2
leader: A

The system wants to add D and remove A:

target config: B, C, D

A safe protocol usually uses a transition that preserves quorum intersection, such as a joint configuration or another explicit reconfiguration rule. The bug is that the implementation activates the new configuration locally before the transition is committed everywhere.

The deterministic harness explores this schedule:

1   A proposes config change old(A,B,C) -> new(B,C,D)
2   A appends config entry at log index 40
3   A sends append(index 40) to B and C
4   B receives index 40 and marks new config active too early
5   network partitions A from C and D
6   A and B commit command X under old config evidence
7   C hears from B that new config is active
8   C and D elect C under new config evidence
9   C and D commit command Y at the same logical index
10  partition heals
11  invariant fails: conflicting committed entries

The final state may be confusing. After healing, the cluster might pick one leader and overwrite local state until most nodes agree. A final read could show only X or only Y. The protocol bug is still real because two incompatible commands were committed under incompatible evidence.

The oracle should not wait for the final read. It should check the history:

for every committed entry:
  record index, term, value, config, quorum evidence

for any two committed entries at same index:
  assert value is identical
  assert quorum evidence is compatible with the active config transition

The replay record should preserve the exact schedule that made the bug possible:

message deliveries:
  append index 40 delivered to B
  append index 40 delayed to C
  vote request from C delivered to D

state transitions:
  B activates new config before commit
  A commits X with A,B
  C commits Y with C,D

faults:
  partition A from C,D

That record gives the shrinker a real target. It can remove unrelated client operations, reduce delays, and keep only the reconfiguration race that violates the invariant.

Harness Design for Protocols

A protocol harness needs explicit control over four boundaries.

The network boundary controls delivery, delay, duplication, loss, and reordering. Protocols should be tested under partitions and delayed heals, not only dropped messages.

The clock boundary controls election timeouts, lease expiry, heartbeat intervals, suspicion timers, and retry backoff. If leases or failure detectors are involved, host-clock time is usually the wrong test clock.

The disk boundary controls durable state. The harness should distinguish memory, buffered write, durable write, snapshot, truncation, and restart recovery whenever the protocol relies on persistence.

The membership boundary controls which nodes count as voters, learners, observers, removed members, or joining members. Tests should not let a node participate just because it exists in the process list.

These boundaries turn protocol claims into testable events. Without them, the harness can still run the code, but it cannot reliably say what evidence made a decision legal.

Liveness Without Hiding Safety

Protocol tests also need liveness checks:

Liveness tests are useful, but they should not blur safety failures. A common anti-pattern is retrying operations until the cluster becomes healthy and then asserting the final state. That can hide a temporary split brain, lost commit, or illegal leader.

Separate the checks:

safety:
  no conflicting commits ever occur
  no stale leader serves authoritative writes

liveness:
  after faults stop, the cluster eventually elects one leader
  after faults stop, committed entries eventually replicate

The safety check watches the whole history. The liveness check applies after the fault schedule reaches a stable recovery phase.

Common Failure Modes

One mistake is testing only the happy leader. Most protocol bugs involve leader changes, failed persistence, delayed messages, or reconfiguration.

Another mistake is making membership a test fixture instead of a protocol state. If membership cannot change inside the harness, reconfiguration bugs are invisible.

A third mistake is treating durable state as ordinary memory. Crashes should reveal which facts were truly persisted.

A fourth mistake is asserting only eventual convergence. Convergence after a split-brain commit does not make the split brain safe.

A fifth mistake is using one generic invariant for every protocol. Consensus, eventual replication, primary-backup replication, and membership gossip have different claims.

Practice

Design a protocol test for one replicated control-plane service.

  1. What are the safety claims?
  2. What are the liveness claims?
  3. Which facts must be recorded in the history?
  4. Which network events must the harness control?
  5. Which clock events must the harness control?
  6. Which state is volatile, buffered, or durable?
  7. Which nodes are voters, learners, joining members, or removed members?
  8. Which final states could hide an unsafe history?

Then write one invariant that checks evidence, not just outcome. For example: "Every committed entry at index i has the same value and was committed by a quorum valid for the configuration active at i."

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: Simulation Fidelity, Model Drift, and False Confidence NEXT Distributed Testing, Simulation, and Deterministic Replay: Testing Client Semantics, Idempotency, and Exactly-Once Claims