Distributed Testing, Simulation, and Deterministic Replay: Capstone: Build a Deterministic Distributed Test Lab

LESSON

Distributed Testing, Simulation, and Deterministic Replay

024 30 min intermediate CAPSTONE

Distributed Testing, Simulation, and Deterministic Replay: Capstone: Build a Deterministic Distributed Test Lab

Core Insight

CheckoutService has reached the point where isolated tests no longer tell the whole truth. A provider contract test can prove that retries are accepted. A deterministic simulator can prove that one schedule is safe. A production trace can show how an incident happened. None of those artifacts, by itself, proves that the team can find, reproduce, reduce, and permanently guard against the next distributed failure.

A deterministic distributed test lab is the system that connects those pieces. It defines what the service promises, which parts of the world the harness controls, which faults are worth exploring, which evidence gets recorded, and how a failure becomes a stable regression. The non-obvious insight is that the lab is not mainly a bigger test suite. It is a controlled evidence pipeline.

The hard trade-off is breadth versus determinism. The more realistic the lab becomes, the more dependency behavior, process boundaries, and storage details it can observe. The more controlled it becomes, the easier it is to replay and shrink failures. A strong lab makes that boundary explicit instead of hiding it inside flaky CI runs.

Capstone Scenario

Build the design for a deterministic test lab around a three-replica CheckoutService.

The service accepts confirm_order(order_id, idempotency_key), writes an outcome record, and calls a payment provider. The team wants this claim to be true:

For a scoped idempotency key, CheckoutService creates at most one external provider capture,
even when clients retry, replicas fail, messages are delayed, and provider responses are ambiguous.

The lab must also support a second claim:

When a known failure has been reduced to a replay packet, any engineer can rerun it locally
and CI can reject a change that reintroduces the same bug.

Your deliverable is a reviewable lab design, not a complete implementation. It should be precise enough that another engineer can see what would be controlled, what would be observed, and what evidence would make the claims believable.

Lab Architecture

The lab needs one controlled boundary around the service and one evidence path out of it.

clients
  |
  v
workload generator
  |
  v
deterministic harness
  |-- logical clocks and timers
  |-- scheduler and thread/task interleavings
  |-- network delay, drop, duplicate, reorder
  |-- crash and restart points
  |-- durable-state flush and recovery boundaries
  |-- scripted payment-provider responses
  |
  v
CheckoutService replicas A, B, C
  |
  v
event log + effect log + invariant results + replay packet

The harness should own every source of nondeterminism that matters to the claims:

It does not have to own everything in production. It does have to say what is outside the model. For example, it might model provider responses but not provider fraud rules, or model local disk durability points but not the exact production storage engine. Those exclusions are acceptable only when the design explains how production evidence will calibrate them.

Required Claims And Oracles

A lab without oracles only produces interesting logs. Start by writing the claims in a form the harness can falsify.

Use at least these oracles:

safety:
  provider_captures(scope, idempotency_key).count <= 1

request integrity:
  same idempotency key with a different request hash is rejected or isolated

outcome durability:
  after a confirmed outcome is returned, recovery preserves enough state to avoid a second capture

replay fidelity:
  replay(seed, schedule, fault_plan, dependency_script) produces the same invariant result

Notice what the first oracle counts: external captures, not just local database rows. A common exactly-once mistake is to assert internal state while the duplicate side effect has already escaped. The effect log is part of the truth source.

Each oracle should include its evidence fields:

oracle_name
scope
history_window
events_read
effects_read
pass_or_fail
counterexample_pointer

That shape lets a failing run become inspectable without asking someone to manually interpret raw logs.

Fault And Schedule Plan

The fault plan should cover the causal shapes most likely to break the claim. It should not spray random chaos at the system and hope confidence appears.

Start with fixed schedules for known dangerous boundaries:

schedule 1:
  client sends confirm_order
  replica A writes pending outcome
  A calls provider
  provider captures payment
  A crashes before durable confirmed outcome
  client retries against B

schedule 2:
  client sends confirm_order
  provider returns unknown response
  retry races with replication of the outcome record
  delayed message from A arrives after B starts recovery

schedule 3:
  same idempotency key is reused with a different request hash
  one replica has stale idempotency metadata
  network heals after both replicas accepted work

Then add bounded exploration around those schedules:

The schedule plan should identify which combinations run in pull-request CI and which belong in nightly search. The pull-request profile should be small, deterministic, and mandatory. The nightly profile can explore more interleavings and upload only minimized failures as required regressions.

Replay And Reduction Deliverables

Every failing run must produce a replay packet.

{
  "test": "checkout_confirm_order_idempotency",
  "seed": 482991,
  "workload": "retry_confirm_order_v3",
  "schedule": "logical-event-log",
  "fault_plan": "crash_after_provider_capture_before_outcome_fsync",
  "dependency_script": "provider_success_then_unknown",
  "model_version": "checkout-lab-2026-06",
  "invariant": "provider_captures_per_scoped_key_le_1",
  "rerun": "lab replay artifacts/482991/replay.json"
}

The packet should be enough to reproduce the failure without the original CI worker. Include the harness version or model version, because deterministic replay can lie when the replay code changes.

The lab should also include reduction rules. A useful reducer tries to remove:

The reduced counterexample should keep the failing oracle and the causal order that made it fail. A smaller replay is easier to debug, easier to explain in a runbook, and more likely to become a permanent regression instead of a one-off artifact.

CI And Operations Deliverables

The lab should have three execution profiles.

pull_request:
  fixed replays for known bugs
  small deterministic smoke schedules
  strict runtime budget
  required artifact upload on failure

nightly:
  randomized schedule exploration
  broader fault matrix
  reducer enabled
  new failures opened with replay packets

incident_regression:
  replay packets derived from production incidents
  dependency scripts checked in
  runbook linked to expected symptoms and fix

Each profile needs a clear owner. A suite that fails without ownership turns into background noise. The runbook should say how to rerun a failure, how to inspect the effect log, how to decide whether the replay is still valid, and how to promote a reduced failure into required CI.

Good lab output looks like this:

failed invariant:
  provider_captures_per_scoped_key_le_1

counterexample:
  artifacts/checkout/482991/minimized-replay.json

rerun:
  lab replay artifacts/checkout/482991/minimized-replay.json

effect summary:
  capture_id=cap_183 order=o_77 key=k_91 replica=A
  capture_id=cap_184 order=o_77 key=k_91 replica=B

suspected boundary:
  crash after provider capture before durable outcome record

That output turns a distributed bug from an argument about timing into a concrete object the team can rerun.

Review Checklist

A reviewable lab design answers these questions:

The strongest submissions are explicit about boundaries. They do not claim that the lab proves the whole production system correct. They show that specific claims receive specific evidence under controlled sources of nondeterminism.

Worked Submission Outline

A compact submission can use this structure:

1. Claim
   At most one provider capture per scoped idempotency key.

2. System model
   Three replicas, replicated outcome table, scripted provider, retrying clients.

3. Controlled boundaries
   Scheduler, logical time, network, crash/restart, fsync points, provider responses.

4. Oracles
   External capture count, request hash isolation, outcome durability, replay fidelity.

5. Fault matrix
   Crash after external effect, delayed replication, provider unknown response, stale retry target.

6. Replay packet
   Seed, schedule, workload, fault plan, dependency script, model version, rerun command.

7. Reduction
   Remove irrelevant requests, faults, delays, and dependency responses while preserving failure.

8. CI placement
   Required fixed replays in pull requests, randomized exploration nightly, incident replays after production bugs.

9. Runbook
   Inspect effect log, rerun minimized packet, check model version, promote regression.

This outline is small enough to review but specific enough to expose shallow designs.

Common Failure Modes

Practice

Design the deterministic test lab for the CheckoutService scenario.

Your answer should include:

Then critique your own design. Name one behavior that is intentionally outside the model, why that exclusion is acceptable for the claim, and what production evidence would force you to add it.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: Design Review for Testing Strategy Selection