Distributed Testing, Simulation, and Deterministic Replay: CI Integration, Runtime Budgets, and Failure Triage

LESSON

Distributed Testing, Simulation, and Deterministic Replay

021 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: CI Integration, Runtime Budgets, and Failure Triage

Core Insight

In CheckoutService, the deterministic lab can find rare duplicate-capture schedules, shrink them, and replay production incidents from observability packets. That is powerful, but it creates a new operational question: which of those tests should run on every pull request, which should run nightly, and what should happen when a randomized exploration finds a new counterexample at 3 a.m.?

Distributed test suites fail when they are either too weak for CI or too expensive for CI. If every commit runs only shallow unit tests, dangerous interleavings escape. If every commit runs hours of randomized simulation, engineers ignore the signal or bypass the suite. CI integration is the discipline of placing each test mode where its cost, signal, and reproducibility fit.

The trade-off is speed versus search depth. Pull-request checks need fast, deterministic, high-signal failures. Longer jobs can explore more schedules, workloads, and fault combinations, but they must produce replay artifacts that a developer can run locally. A CI failure without a replay command is an alert; a CI failure with a seed, minimized history, and invariant report is a debugging entry point.

Test Modes In The Pipeline

Distributed testing usually needs several CI modes, not one giant job.

Fast deterministic checks run on every pull request:

scope: fixed replays, core invariants, small topologies
budget: seconds to a few minutes
failure: blocks merge
artifact: exact replay command

These checks should be boring in the best sense. They replay known incidents, known minimized counterexamples, and a small set of high-value schedules.

Bounded randomized exploration runs on pull requests when it is cheap enough:

scope: small seed set, limited schedules, short logical time
budget: a few minutes
failure: blocks merge only if replay is deterministic
artifact: seed plus generated scenario

The key is determinism after discovery. If a randomized job cannot reproduce its own failure with the same seed and schedule, it should be triaged as a harness problem before it becomes a product blocker.

Deep exploration runs outside the critical merge path:

scope: larger topologies, longer histories, fault combinations
budget: nightly or continuous background
failure: opens issue or pages owning team by policy
artifact: replay packet, minimized history, invariant result

Production incident replays should run as regression tests once they are distilled:

scope: one incident shape
budget: short and deterministic
failure: blocks merge
artifact: link to incident replay record

The pipeline is stronger when it separates these modes explicitly. Otherwise a slow exploratory job and a deterministic regression test look like the same kind of failure, even though they need different responses.

Runtime Budgets

A runtime budget is not just a timeout. It is a promise about how much search the job can perform and how useful the result must be.

Useful budget dimensions include:

For example:

pull request:
  seeds: 20
  replicas: 3
  clients: 2
  steps: 200
  faults: crash, delay, drop
  max runtime: 4 minutes
  shrink: 30 seconds

nightly:
  seeds: 2000
  replicas: 3 to 7
  clients: 1 to 20
  steps: 2000
  faults: crash, delay, drop, partition, restart
  max runtime: 3 hours
  shrink: 10 minutes per new failure

The small job is not a weaker version of the big job. It is a different contract. The small job protects the merge path from known and common failures. The big job searches for new rare failures and must feed its discoveries back into the fast suite.

Budget decisions should be visible in code or configuration:

ci_profile: pull_request
exploration_profile: nightly
replay_profile: regression

Hidden budgets create confusion. A developer should know whether a failure came from a fixed replay, bounded exploration, or deep search.

Failure Artifacts

Every CI failure from a distributed harness should produce a small, durable artifact set.

At minimum:

For CheckoutService, a useful failure artifact might look like:

failure:
  invariant: at_most_one_provider_capture_per_key
  seed: 91827
  profile: nightly
  replicas: A,B
  key: m1/k1
  external_effects: p778,p779
  minimized_replay: artifacts/replays/91827-min.json
  rerun:
    ./testlab replay artifacts/replays/91827-min.json

The rerun command matters. Developers should not have to reconstruct the lab setup from CI logs.

Artifacts should also distinguish discovery from confirmation:

discovery run:
  randomized search found failure

confirmation run:
  replay of recorded schedule reproduced failure

shrink run:
  minimized replay still reproduced same invariant

If discovery fails but confirmation does not reproduce, CI should report a harness nondeterminism problem, not a product regression.

Triage Categories

Failure triage starts by naming what failed.

A product bug means the harness found a real invariant violation and replay confirms it.

action:
  block merge or open incident-level issue
  attach replay
  add minimized replay to regression suite after fix

A harness bug means the failure cannot be replayed because the test did not control or record an input.

action:
  fix clock, scheduler, dependency stub, cleanup, or replay record
  do not weaken the product invariant

A model bug means the simulator behavior is wrong or stale relative to the claim.

action:
  update model or claim boundary
  calibrate against integration, contract, or production evidence

An environment bug means CI infrastructure changed the result without representing the system under test.

action:
  isolate resources
  pin dependencies
  remove shared state
  record environment metadata

A known failure means CI rediscovered an open bug.

action:
  link to known issue
  avoid creating duplicate noise
  keep replay in regression tracking

Triage should be mechanical enough that two engineers reading the same artifact reach the same first classification.

Worked Example

A pull request changes retry backoff in CheckoutService. The PR job runs a small profile:

profile: pull_request
seeds: 20
replicas: 2
steps: 200
faults: delayed replication and crash

It fails:

invariant: at_most_one_provider_capture_per_key
seed: 1044
failure: duplicate captures p1,p2 for m1/k1
confirmation replay: passed

That is not yet a product blocker. The discovery run failed, but confirmation did not reproduce. The right triage is harness nondeterminism.

The artifact shows the missing input:

recorded:
  seed
  generated operations
  network schedule

missing:
  provider stub retry response order

The fix is to record dependency responses. After that, the same PR job fails again:

discovery: failed
confirmation replay: failed
shrink: reduced to 11 events
rerun: ./testlab replay artifacts/replays/1044-min.json

Now CI has a real product signal. The team can block the PR, attach the replay, and later add the minimized case to the regression profile.

The important point is that CI did not merely say "random test failed." It separated discovery, confirmation, shrinking, and classification.

Keeping CI Actionable

CI should make the next action obvious.

Good failure output says:

FAILED invariant at_most_one_provider_capture_per_key
profile nightly
confirmed deterministic replay
minimal replay artifacts/replays/1044-min.json
rerun ./testlab replay artifacts/replays/1044-min.json
suspected owner checkout-idempotency

Weak failure output says:

test failed after timeout
see logs

For distributed tests, "see logs" is rarely enough. The failure needs a replay path, a classification, and enough metadata to route ownership.

CI also needs aging rules. A failure discovered in nightly should either become a tracked issue, a regression replay, a harness fix, or a deliberate model-boundary update. Leaving it as repeated nightly noise teaches teams to ignore the suite.

Common Failure Modes

One mistake is running deep randomized exploration directly in the required merge path. The result is slow CI, noisy failures, and pressure to delete the tests.

Another mistake is treating every random discovery as a blocker before confirming replay. If replay cannot reproduce it, the first bug is in the harness or artifact capture.

A third mistake is failing to promote found bugs into fixed regression replays. Deep exploration should feed the fast suite.

A fourth mistake is hiding budgets in job scripts. Engineers need to know which profile failed and what level of evidence it promises.

A fifth mistake is keeping known failures unclassified. Repeated rediscovery without ownership creates noise and weakens trust.

Practice

Design CI placement for one deterministic distributed test lab.

  1. Which fixed incident replays should run on every pull request?
  2. How many seeds and logical steps fit the pull-request budget?
  3. Which deeper search should move to nightly or background runs?
  4. What artifacts must every failure upload?
  5. How does CI distinguish discovery failure from confirmed replay failure?
  6. What triage categories will the team use?
  7. Which failures block merge immediately?
  8. Which discoveries become regression tests after a fix?

Then write one CI failure message that includes a rerun command. If the failure message does not tell a developer how to reproduce the bug, the CI integration is not finished.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: Observability for Reproducible Distributed Bugs NEXT Distributed Testing, Simulation, and Deterministic Replay: Debugging Loops, Runbooks, and Regression Suites