Distributed Testing, Simulation, and Deterministic Replay: Capstone: Build a Deterministic Distributed Test Lab

LESSON

Distributed Testing, Simulation, and Deterministic Replay

024 30 min intermediate CAPSTONE

Distributed Testing, Simulation, and Deterministic Replay: Capstone: Build a Deterministic Distributed Test Lab

Core Insight

CheckoutService has reached the point where isolated tests no longer tell the whole truth. A provider contract test can prove that retries are accepted. A deterministic simulator can prove that one schedule is safe. A production trace can show how an incident happened. None of those artifacts, by itself, proves that the team can find, reproduce, reduce, and permanently guard against the next distributed failure.

A deterministic distributed test lab is the system that connects those pieces. It defines what the service promises, which parts of the world the harness controls, which faults are worth exploring, which evidence gets recorded, and how a failure becomes a stable regression. The non-obvious insight is that the lab is not mainly a bigger test suite. It is a controlled evidence pipeline.

The hard trade-off is breadth versus determinism. The more realistic the lab becomes, the more dependency behavior, process boundaries, and storage details it can observe. The more controlled it becomes, the easier it is to replay and shrink failures. A strong lab makes that boundary explicit instead of hiding it inside flaky CI runs.

Capstone Scenario

Build the design for a deterministic test lab around a three-replica CheckoutService.

The service accepts confirm_order(order_id, idempotency_key), writes an outcome record, and calls a payment provider. The team wants this claim to be true:

For a scoped idempotency key, CheckoutService creates at most one external provider capture,
even when clients retry, replicas fail, messages are delayed, and provider responses are ambiguous.

The lab must also support a second claim:

When a known failure has been reduced to a replay packet, any engineer can rerun it locally
and CI can reject a change that reintroduces the same bug.

Your deliverable is a reviewable lab design, not a complete implementation. It should be precise enough that another engineer can see what would be controlled, what would be observed, and what evidence would make the claims believable.

Lab Architecture

The lab needs one controlled boundary around the service and one evidence path out of it.

clients
  |
  v
workload generator
  |
  v
deterministic harness
  |-- logical clocks and timers
  |-- scheduler and thread/task interleavings
  |-- network delay, drop, duplicate, reorder
  |-- crash and restart points
  |-- durable-state flush and recovery boundaries
  |-- scripted payment-provider responses
  |
  v
CheckoutService replicas A, B, C
  |
  v
event log + effect log + invariant results + replay packet

The harness should own every source of nondeterminism that matters to the claims:

client request timing
retry timing
replica scheduling
message delivery order
leader or coordinator changes
local clock reads
timeout firing
process crash points
fsync or durable-write boundaries
dependency responses

It does not have to own everything in production. It does have to say what is outside the model. For example, it might model provider responses but not provider fraud rules, or model local disk durability points but not the exact production storage engine. Those exclusions are acceptable only when the design explains how production evidence will calibrate them.

Required Claims And Oracles

A lab without oracles only produces interesting logs. Start by writing the claims in a form the harness can falsify.

Use at least these oracles:

safety:
  provider_captures(scope, idempotency_key).count <= 1

request integrity:
  same idempotency key with a different request hash is rejected or isolated

outcome durability:
  after a confirmed outcome is returned, recovery preserves enough state to avoid a second capture

replay fidelity:
  replay(seed, schedule, fault_plan, dependency_script) produces the same invariant result

Notice what the first oracle counts: external captures, not just local database rows. A common exactly-once mistake is to assert internal state while the duplicate side effect has already escaped. The effect log is part of the truth source.

Each oracle should include its evidence fields:

oracle_name
scope
history_window
events_read
effects_read
pass_or_fail
counterexample_pointer

That shape lets a failing run become inspectable without asking someone to manually interpret raw logs.

Fault And Schedule Plan

The fault plan should cover the causal shapes most likely to break the claim. It should not spray random chaos at the system and hope confidence appears.

Start with fixed schedules for known dangerous boundaries:

schedule 1:
  client sends confirm_order
  replica A writes pending outcome
  A calls provider
  provider captures payment
  A crashes before durable confirmed outcome
  client retries against B

schedule 2:
  client sends confirm_order
  provider returns unknown response
  retry races with replication of the outcome record
  delayed message from A arrives after B starts recovery

schedule 3:
  same idempotency key is reused with a different request hash
  one replica has stale idempotency metadata
  network heals after both replicas accepted work

Then add bounded exploration around those schedules:

vary which replica receives the retry
vary crash point before and after durable write
vary provider response as success, timeout, unknown, duplicate-key replay
vary message delay around outcome replication
vary timeout order relative to client retry
vary whether recovery reads local state before or after membership change

The schedule plan should identify which combinations run in pull-request CI and which belong in nightly search. The pull-request profile should be small, deterministic, and mandatory. The nightly profile can explore more interleavings and upload only minimized failures as required regressions.

Replay And Reduction Deliverables

Every failing run must produce a replay packet.

{
  "test": "checkout_confirm_order_idempotency",
  "seed": 482991,
  "workload": "retry_confirm_order_v3",
  "schedule": "logical-event-log",
  "fault_plan": "crash_after_provider_capture_before_outcome_fsync",
  "dependency_script": "provider_success_then_unknown",
  "model_version": "checkout-lab-2026-06",
  "invariant": "provider_captures_per_scoped_key_le_1",
  "rerun": "lab replay artifacts/482991/replay.json"
}

The packet should be enough to reproduce the failure without the original CI worker. Include the harness version or model version, because deterministic replay can lie when the replay code changes.

The lab should also include reduction rules. A useful reducer tries to remove:

unrelated client operations
faults that do not affect the invariant
duplicate network perturbations
delays that preserve the same causal order
dependency responses not read by the system under test

The reduced counterexample should keep the failing oracle and the causal order that made it fail. A smaller replay is easier to debug, easier to explain in a runbook, and more likely to become a permanent regression instead of a one-off artifact.

CI And Operations Deliverables

The lab should have three execution profiles.

pull_request:
  fixed replays for known bugs
  small deterministic smoke schedules
  strict runtime budget
  required artifact upload on failure

nightly:
  randomized schedule exploration
  broader fault matrix
  reducer enabled
  new failures opened with replay packets

incident_regression:
  replay packets derived from production incidents
  dependency scripts checked in
  runbook linked to expected symptoms and fix

Each profile needs a clear owner. A suite that fails without ownership turns into background noise. The runbook should say how to rerun a failure, how to inspect the effect log, how to decide whether the replay is still valid, and how to promote a reduced failure into required CI.

Good lab output looks like this:

failed invariant:
  provider_captures_per_scoped_key_le_1

counterexample:
  artifacts/checkout/482991/minimized-replay.json

rerun:
  lab replay artifacts/checkout/482991/minimized-replay.json

effect summary:
  capture_id=cap_183 order=o_77 key=k_91 replica=A
  capture_id=cap_184 order=o_77 key=k_91 replica=B

suspected boundary:
  crash after provider capture before durable outcome record

That output turns a distributed bug from an argument about timing into a concrete object the team can rerun.

Review Checklist

A reviewable lab design answers these questions:

Which exact user-visible or external-effect claims are being tested?
Which clocks, timers, schedulers, network events, durable-state boundaries, and dependency responses are controlled?
Which claims are covered by fixed replay, deterministic simulation, randomized exploration, integration testing, and production-derived replay?
What does each oracle read, and what does it intentionally ignore?
What artifacts are uploaded on failure?
What command reruns the minimized case locally?
What evidence shows that the model has not drifted too far from production?
Which tests are required in pull-request CI, and which are exploratory?
Who owns failures from each profile?

The strongest submissions are explicit about boundaries. They do not claim that the lab proves the whole production system correct. They show that specific claims receive specific evidence under controlled sources of nondeterminism.

Worked Submission Outline

A compact submission can use this structure:

1. Claim
   At most one provider capture per scoped idempotency key.

2. System model
   Three replicas, replicated outcome table, scripted provider, retrying clients.

3. Controlled boundaries
   Scheduler, logical time, network, crash/restart, fsync points, provider responses.

4. Oracles
   External capture count, request hash isolation, outcome durability, replay fidelity.

5. Fault matrix
   Crash after external effect, delayed replication, provider unknown response, stale retry target.

6. Replay packet
   Seed, schedule, workload, fault plan, dependency script, model version, rerun command.

7. Reduction
   Remove irrelevant requests, faults, delays, and dependency responses while preserving failure.

8. CI placement
   Required fixed replays in pull requests, randomized exploration nightly, incident replays after production bugs.

9. Runbook
   Inspect effect log, rerun minimized packet, check model version, promote regression.

This outline is small enough to review but specific enough to expose shallow designs.

Common Failure Modes

Oracle watches the wrong surface: The test asserts local rows but ignores provider captures.
Replay packet omits a controlled input: The rerun command uses the same seed but not the same dependency script or scheduler choices.
Faults are too broad: The lab injects random crashes without naming the claim each crash can falsify.
Reduction changes the bug: The minimized run still fails, but through a different causal path than the original incident.
CI profile is unrealistic: Pull-request tests are so expensive that teams skip them, or so tiny that they do not protect known regressions.
Model drift is invisible: Production incidents keep involving behavior the lab intentionally excluded, but no calibration path updates the model.

Practice

Design the deterministic test lab for the CheckoutService scenario.

Your answer should include:

one primary safety claim and one replay claim
the system model and controlled boundaries
at least three fixed dangerous schedules
the oracle definitions and evidence fields
the replay packet format
the reduction strategy
CI placement for fixed replay, randomized exploration, and incident regression
a short runbook for a failing replay

Then critique your own design. Name one behavior that is intentionally outside the model, why that exclusion is acceptable for the claim, and what production evidence would force you to add it.

Connections

011.md gives the history and workload vocabulary needed to make the capstone oracles precise.
013.md explains the replay boundary for inputs, time, and scheduling.
023.md provides the strategy-selection review that the lab design turns into a concrete system.

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Use the chapters on replication, transactions, and consistency to keep the service claims honest.
[DOC] Jepsen Analyses
- Focus: Study how histories, faults, and externally visible effects are turned into falsifiable claims.
[PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store
- Focus: Look for how deterministic simulation changes the cost of finding and reproducing distributed bugs.
[DOC] OpenTelemetry Traces
- Focus: Use trace structure as inspiration for incident packets that preserve causal context.

Key Takeaways

A deterministic distributed test lab is an evidence pipeline that connects claims, controlled nondeterminism, oracles, replay packets, and CI ownership.
The most important boundary is not the biggest test environment; it is the boundary where the harness controls the timing, faults, effects, and dependency behavior needed to falsify the claim.
Replay packets must include every controlled input needed to reproduce the failure, including schedule choices, dependency scripts, fault plans, and model version.
A useful capstone design is explicit about what it proves, what it excludes, and what production evidence would force the model to change.

← Back to Distributed Testing, Simulation, and Deterministic Replay

← Back to Distributed Systems

← Back to Learning Hub