Distributed Testing, Simulation, and Deterministic Replay: Failure Models for Distributed Test Harnesses
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Failure Models for Distributed Test Harnesses
Core Insight
A distributed test harness cannot test "failure" in the abstract. It needs a failure model: a precise statement of which bad events are allowed to happen, which components can suffer them, and what the system may still assume while recovering.
Return to the CheckoutService from the previous lessons. A simulation that drops one payment reply is exercising a different world from a simulation that corrupts a payment record, pauses a node for ninety seconds, duplicates an idempotent request, or partitions two replicas while clients continue writing. Each world creates different obligations for the oracle. If the service only promises correctness under crash-recovery and message loss, a test that injects arbitrary Byzantine corruption is not a stronger version of the same test; it is testing a different contract.
The non-obvious insight is that failure models are part of the design of the test, not just part of the system being tested. The harness must make failures controllable enough to replay while still realistic enough to expose bugs that production could actually produce. The trade-off is coverage versus interpretability: a broad model explores more unpleasant behavior, but failures become harder to explain, minimize, and tie back to a promised guarantee.
What a Failure Model Names
A failure model is a vocabulary for bad behavior. It says what can go wrong without turning every run into an unconstrained disaster.
Useful distributed test models usually specify several dimensions:
- Process failures: a node crashes, pauses, restarts, loses volatile memory, or recovers from persisted state.
- Network failures: messages are delayed, dropped, duplicated, reordered, or blocked across a partition.
- Storage failures: writes are delayed, reads see old data, durable records survive restart, or a disk reports an error.
- Clock failures: time jumps, stalls, drifts, or differs between nodes.
- Client failures: clients retry, timeout, disconnect, or submit the same operation more than once.
- Dependency failures: an external payment, queue, lock service, or metadata store becomes slow or unavailable.
The model also names exclusions. A crash-stop model may say nodes fail by stopping and never returning. A crash-recovery model allows restart from persistent state. A non-Byzantine model assumes components do not lie, forge messages, or intentionally violate the protocol. A byzantine model removes that assumption and requires much stronger machinery.
These distinctions are not academic. If a harness claims to validate crash-recovery behavior, it must decide what survives a crash. Does the node lose in-memory locks? Does it keep a write-ahead log? Can an outbound message be sent just before the crash while the local commit is lost? The answer changes which histories are legal and which invariants the oracle should check.
From Model to Harness Controls
The model becomes useful when the harness can turn each failure type into a controlled event.
For a deterministic simulator, failures are often represented as scheduled actions:
step 17: deliver reserve(item-17) to inventory-primary
step 18: drop replicate(reservation-77) to inventory-replica
step 19: crash inventory-primary after durable commit
step 20: retry reserve(item-17) through inventory-replica
step 21: restart inventory-primary from disk snapshot
This schedule is valuable because replay can reproduce it. The test harness can run the same actions again, the oracle can re-check the same observable outcomes, and the engineer can ask whether the failure is a real bug or an invalid assumption.
The harness should keep the controls close to the model:
- A message-drop model needs message identities and delivery decisions.
- A crash-recovery model needs explicit persistence boundaries.
- A clock-skew model needs controllable reads of time rather than direct calls to the host clock.
- A client-timeout model needs client-visible uncertainty, not just server-side exceptions.
- A dependency-failure model needs controllable responses from the dependency boundary.
When the controls are too coarse, failures become hard to interpret. "Kill a random container" may reveal a bug, but the replay record might not explain whether the bug came from lost volatile state, a half-sent message, a lease timeout, or a delayed retry. Deterministic testing works best when the failure event names the mechanism it is disturbing.
Worked Example
Suppose CheckoutService stores order state on three replicas. The team wants to test that a successful confirmation remains recoverable after failover. The first candidate failure model is:
Model: crash-recovery with unreliable network
Allowed:
- messages may be delayed, dropped, duplicated, and reordered
- one replica may crash and later restart from durable storage
- clients may timeout and retry with the same idempotency key
Excluded:
- replicas do not corrupt durable records
- replicas do not forge messages
- two replicas do not crash at the same time
- external payment capture is either accepted, rejected, or unavailable
Now the harness can produce a targeted run:
t1 client A submits confirm(order-101, key=k1)
t2 order-primary writes committed order-101 to disk
t3 harness drops replication message to replica-2
t4 order-primary crashes before replying to client A
t5 client A times out and retries confirm(order-101, key=k1)
t6 replica-2 becomes leader and does not see order-101
t7 replica-2 accepts the retry as a new confirmation
t8 order-primary restarts with order-101 committed
The failure model makes the bug understandable. If the service promised idempotent confirmation across crash-recovery, the retry must converge on the same order outcome. The oracle from lesson 2 can now check the observable history for duplicate confirmations, missing durable success, or inconsistent reads after recovery.
The same run would mean something different under a different model. If the service explicitly requires quorum replication before acknowledging success, crashing after a local-only write may not violate the contract because the client never received success. If the service claims exactly-once external payment capture, the dependency boundary must also be modeled so the harness can observe whether one or two captures occurred.
Boundaries and Blind Spots
Every failure model creates blind spots. A crash-recovery model will not find bugs that require disk corruption. A non-Byzantine model will not find forged leader messages. A single-fault model can miss bugs that only occur when a restart overlaps with a partition and a client retry.
Those blind spots are acceptable when they are explicit. They become dangerous when the team treats a passing test suite as proof against failures the harness never generated.
The model should therefore be written next to the test strategy:
We test crash-recovery, message loss, delay, duplication, reordering,
client timeout, and idempotent retry.
We do not test Byzantine behavior, durable-storage corruption,
malicious clients, or simultaneous loss of quorum.
This statement is not an apology. It is an engineering boundary. It lets reviewers decide whether the model matches the production risk, whether additional test layers are needed, and whether the system's public promises are too broad for the evidence available.
Common Failure Model Mistakes
One mistake is treating failures as a checklist of dramatic events. A harness that can kill nodes, drop packets, and skew clocks is not automatically strong. The useful question is whether those events match the system's promises and can be tied to observable outcomes.
Another mistake is injecting failures below a boundary the system cannot control. If an application calls the host clock directly, a clock-skew test may be unreplayable. If the database driver hides retries, the harness may not know which operation committed. Deterministic testing often requires dependency seams where time, network, storage, and clients can be controlled.
A third mistake is mixing failure models inside one oracle. If a test allows stale reads under eventual consistency but also expects every successful write to be immediately visible everywhere, the result is confusion rather than signal. The model and oracle must agree about what histories are legal.
The last common mistake is making the model so broad that every failure is plausible and no result is actionable. "Anything can happen" is rarely a useful test model. Useful harnesses start from the failure modes the system claims to tolerate, then expand carefully when production evidence shows a wider risk.
Practice
Pick a distributed component such as a replicated cache, a queue worker, a lease service, or an order workflow. Write a failure model with four parts:
- Which process failures are allowed?
- Which network behaviors are allowed?
- What survives restart?
- Which failures are explicitly out of scope?
Then choose one invariant from lesson 2 and ask whether your harness records enough events to judge it under that model. If the answer is no, either the model is underspecified or the harness is missing a control point.
Connections
- Builds on Test Oracles, Invariants, and Observable Outcomes by defining the failure worlds those oracles are allowed to judge.
- Prepares for Logical Time, Clocks, and Event Ordering, because clock behavior is one of the most important assumptions a failure model must make explicit.
- Connects to consensus and replication work because crash-stop, crash-recovery, partial synchrony, and Byzantine assumptions change which protocols are valid.
Resources
- [PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store
- [PAPER] Lineage-driven Fault Injection
- [BOOK] Designing Data-Intensive Applications
- [DOC] Jepsen
Key Takeaways
- A failure model defines which bad events a distributed test harness is allowed to generate.
- The oracle, replay log, and harness controls must agree with the same model.
- Explicit exclusions are useful because they prevent a passing suite from claiming more evidence than it has.
- Broad failure coverage is valuable only when failures remain reproducible, interpretable, and tied to system promises.
← Back to Distributed Testing, Simulation, and Deterministic Replay