Distributed Testing, Simulation, and Deterministic Replay: Capstone: Build a Deterministic Distributed Test Lab
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Capstone: Build a Deterministic Distributed Test Lab
Core Insight
CheckoutService has reached the point where isolated tests no longer tell the whole truth. A provider contract test can prove that retries are accepted. A deterministic simulator can prove that one schedule is safe. A production trace can show how an incident happened. None of those artifacts, by itself, proves that the team can find, reproduce, reduce, and permanently guard against the next distributed failure.
A deterministic distributed test lab is the system that connects those pieces. It defines what the service promises, which parts of the world the harness controls, which faults are worth exploring, which evidence gets recorded, and how a failure becomes a stable regression. The non-obvious insight is that the lab is not mainly a bigger test suite. It is a controlled evidence pipeline.
The hard trade-off is breadth versus determinism. The more realistic the lab becomes, the more dependency behavior, process boundaries, and storage details it can observe. The more controlled it becomes, the easier it is to replay and shrink failures. A strong lab makes that boundary explicit instead of hiding it inside flaky CI runs.
Capstone Scenario
Build the design for a deterministic test lab around a three-replica CheckoutService.
The service accepts confirm_order(order_id, idempotency_key), writes an outcome record, and calls a payment provider. The team wants this claim to be true:
For a scoped idempotency key, CheckoutService creates at most one external provider capture,
even when clients retry, replicas fail, messages are delayed, and provider responses are ambiguous.
The lab must also support a second claim:
When a known failure has been reduced to a replay packet, any engineer can rerun it locally
and CI can reject a change that reintroduces the same bug.
Your deliverable is a reviewable lab design, not a complete implementation. It should be precise enough that another engineer can see what would be controlled, what would be observed, and what evidence would make the claims believable.
Lab Architecture
The lab needs one controlled boundary around the service and one evidence path out of it.
clients
|
v
workload generator
|
v
deterministic harness
|-- logical clocks and timers
|-- scheduler and thread/task interleavings
|-- network delay, drop, duplicate, reorder
|-- crash and restart points
|-- durable-state flush and recovery boundaries
|-- scripted payment-provider responses
|
v
CheckoutService replicas A, B, C
|
v
event log + effect log + invariant results + replay packet
The harness should own every source of nondeterminism that matters to the claims:
- client request timing
- retry timing
- replica scheduling
- message delivery order
- leader or coordinator changes
- local clock reads
- timeout firing
- process crash points
- fsync or durable-write boundaries
- dependency responses
It does not have to own everything in production. It does have to say what is outside the model. For example, it might model provider responses but not provider fraud rules, or model local disk durability points but not the exact production storage engine. Those exclusions are acceptable only when the design explains how production evidence will calibrate them.
Required Claims And Oracles
A lab without oracles only produces interesting logs. Start by writing the claims in a form the harness can falsify.
Use at least these oracles:
safety:
provider_captures(scope, idempotency_key).count <= 1
request integrity:
same idempotency key with a different request hash is rejected or isolated
outcome durability:
after a confirmed outcome is returned, recovery preserves enough state to avoid a second capture
replay fidelity:
replay(seed, schedule, fault_plan, dependency_script) produces the same invariant result
Notice what the first oracle counts: external captures, not just local database rows. A common exactly-once mistake is to assert internal state while the duplicate side effect has already escaped. The effect log is part of the truth source.
Each oracle should include its evidence fields:
oracle_name
scope
history_window
events_read
effects_read
pass_or_fail
counterexample_pointer
That shape lets a failing run become inspectable without asking someone to manually interpret raw logs.
Fault And Schedule Plan
The fault plan should cover the causal shapes most likely to break the claim. It should not spray random chaos at the system and hope confidence appears.
Start with fixed schedules for known dangerous boundaries:
schedule 1:
client sends confirm_order
replica A writes pending outcome
A calls provider
provider captures payment
A crashes before durable confirmed outcome
client retries against B
schedule 2:
client sends confirm_order
provider returns unknown response
retry races with replication of the outcome record
delayed message from A arrives after B starts recovery
schedule 3:
same idempotency key is reused with a different request hash
one replica has stale idempotency metadata
network heals after both replicas accepted work
Then add bounded exploration around those schedules:
- vary which replica receives the retry
- vary crash point before and after durable write
- vary provider response as success, timeout, unknown, duplicate-key replay
- vary message delay around outcome replication
- vary timeout order relative to client retry
- vary whether recovery reads local state before or after membership change
The schedule plan should identify which combinations run in pull-request CI and which belong in nightly search. The pull-request profile should be small, deterministic, and mandatory. The nightly profile can explore more interleavings and upload only minimized failures as required regressions.
Replay And Reduction Deliverables
Every failing run must produce a replay packet.
{
"test": "checkout_confirm_order_idempotency",
"seed": 482991,
"workload": "retry_confirm_order_v3",
"schedule": "logical-event-log",
"fault_plan": "crash_after_provider_capture_before_outcome_fsync",
"dependency_script": "provider_success_then_unknown",
"model_version": "checkout-lab-2026-06",
"invariant": "provider_captures_per_scoped_key_le_1",
"rerun": "lab replay artifacts/482991/replay.json"
}
The packet should be enough to reproduce the failure without the original CI worker. Include the harness version or model version, because deterministic replay can lie when the replay code changes.
The lab should also include reduction rules. A useful reducer tries to remove:
- unrelated client operations
- faults that do not affect the invariant
- duplicate network perturbations
- delays that preserve the same causal order
- dependency responses not read by the system under test
The reduced counterexample should keep the failing oracle and the causal order that made it fail. A smaller replay is easier to debug, easier to explain in a runbook, and more likely to become a permanent regression instead of a one-off artifact.
CI And Operations Deliverables
The lab should have three execution profiles.
pull_request:
fixed replays for known bugs
small deterministic smoke schedules
strict runtime budget
required artifact upload on failure
nightly:
randomized schedule exploration
broader fault matrix
reducer enabled
new failures opened with replay packets
incident_regression:
replay packets derived from production incidents
dependency scripts checked in
runbook linked to expected symptoms and fix
Each profile needs a clear owner. A suite that fails without ownership turns into background noise. The runbook should say how to rerun a failure, how to inspect the effect log, how to decide whether the replay is still valid, and how to promote a reduced failure into required CI.
Good lab output looks like this:
failed invariant:
provider_captures_per_scoped_key_le_1
counterexample:
artifacts/checkout/482991/minimized-replay.json
rerun:
lab replay artifacts/checkout/482991/minimized-replay.json
effect summary:
capture_id=cap_183 order=o_77 key=k_91 replica=A
capture_id=cap_184 order=o_77 key=k_91 replica=B
suspected boundary:
crash after provider capture before durable outcome record
That output turns a distributed bug from an argument about timing into a concrete object the team can rerun.
Review Checklist
A reviewable lab design answers these questions:
- Which exact user-visible or external-effect claims are being tested?
- Which clocks, timers, schedulers, network events, durable-state boundaries, and dependency responses are controlled?
- Which claims are covered by fixed replay, deterministic simulation, randomized exploration, integration testing, and production-derived replay?
- What does each oracle read, and what does it intentionally ignore?
- What artifacts are uploaded on failure?
- What command reruns the minimized case locally?
- What evidence shows that the model has not drifted too far from production?
- Which tests are required in pull-request CI, and which are exploratory?
- Who owns failures from each profile?
The strongest submissions are explicit about boundaries. They do not claim that the lab proves the whole production system correct. They show that specific claims receive specific evidence under controlled sources of nondeterminism.
Worked Submission Outline
A compact submission can use this structure:
1. Claim
At most one provider capture per scoped idempotency key.
2. System model
Three replicas, replicated outcome table, scripted provider, retrying clients.
3. Controlled boundaries
Scheduler, logical time, network, crash/restart, fsync points, provider responses.
4. Oracles
External capture count, request hash isolation, outcome durability, replay fidelity.
5. Fault matrix
Crash after external effect, delayed replication, provider unknown response, stale retry target.
6. Replay packet
Seed, schedule, workload, fault plan, dependency script, model version, rerun command.
7. Reduction
Remove irrelevant requests, faults, delays, and dependency responses while preserving failure.
8. CI placement
Required fixed replays in pull requests, randomized exploration nightly, incident replays after production bugs.
9. Runbook
Inspect effect log, rerun minimized packet, check model version, promote regression.
This outline is small enough to review but specific enough to expose shallow designs.
Common Failure Modes
- Oracle watches the wrong surface: The test asserts local rows but ignores provider captures.
- Replay packet omits a controlled input: The rerun command uses the same seed but not the same dependency script or scheduler choices.
- Faults are too broad: The lab injects random crashes without naming the claim each crash can falsify.
- Reduction changes the bug: The minimized run still fails, but through a different causal path than the original incident.
- CI profile is unrealistic: Pull-request tests are so expensive that teams skip them, or so tiny that they do not protect known regressions.
- Model drift is invisible: Production incidents keep involving behavior the lab intentionally excluded, but no calibration path updates the model.
Practice
Design the deterministic test lab for the CheckoutService scenario.
Your answer should include:
- one primary safety claim and one replay claim
- the system model and controlled boundaries
- at least three fixed dangerous schedules
- the oracle definitions and evidence fields
- the replay packet format
- the reduction strategy
- CI placement for fixed replay, randomized exploration, and incident regression
- a short runbook for a failing replay
Then critique your own design. Name one behavior that is intentionally outside the model, why that exclusion is acceptable for the claim, and what production evidence would force you to add it.
Connections
011.mdgives the history and workload vocabulary needed to make the capstone oracles precise.013.mdexplains the replay boundary for inputs, time, and scheduling.023.mdprovides the strategy-selection review that the lab design turns into a concrete system.
Resources
- [BOOK] Designing Data-Intensive Applications
- Focus: Use the chapters on replication, transactions, and consistency to keep the service claims honest.
- [DOC] Jepsen Analyses
- Focus: Study how histories, faults, and externally visible effects are turned into falsifiable claims.
- [PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store
- Focus: Look for how deterministic simulation changes the cost of finding and reproducing distributed bugs.
- [DOC] OpenTelemetry Traces
- Focus: Use trace structure as inspiration for incident packets that preserve causal context.
Key Takeaways
- A deterministic distributed test lab is an evidence pipeline that connects claims, controlled nondeterminism, oracles, replay packets, and CI ownership.
- The most important boundary is not the biggest test environment; it is the boundary where the harness controls the timing, faults, effects, and dependency behavior needed to falsify the claim.
- Replay packets must include every controlled input needed to reproduce the failure, including schedule choices, dependency scripts, fault plans, and model version.
- A useful capstone design is explicit about what it proves, what it excludes, and what production evidence would force the model to change.
← Back to Distributed Testing, Simulation, and Deterministic Replay