Distributed Testing, Simulation, and Deterministic Replay: CI Integration, Runtime Budgets, and Failure Triage
LESSON
Distributed Testing, Simulation, and Deterministic Replay: CI Integration, Runtime Budgets, and Failure Triage
Core Insight
In CheckoutService, the deterministic lab can find rare duplicate-capture schedules, shrink them, and replay production incidents from observability packets. That is powerful, but it creates a new operational question: which of those tests should run on every pull request, which should run nightly, and what should happen when a randomized exploration finds a new counterexample at 3 a.m.?
Distributed test suites fail when they are either too weak for CI or too expensive for CI. If every commit runs only shallow unit tests, dangerous interleavings escape. If every commit runs hours of randomized simulation, engineers ignore the signal or bypass the suite. CI integration is the discipline of placing each test mode where its cost, signal, and reproducibility fit.
The trade-off is speed versus search depth. Pull-request checks need fast, deterministic, high-signal failures. Longer jobs can explore more schedules, workloads, and fault combinations, but they must produce replay artifacts that a developer can run locally. A CI failure without a replay command is an alert; a CI failure with a seed, minimized history, and invariant report is a debugging entry point.
Test Modes In The Pipeline
Distributed testing usually needs several CI modes, not one giant job.
Fast deterministic checks run on every pull request:
scope: fixed replays, core invariants, small topologies
budget: seconds to a few minutes
failure: blocks merge
artifact: exact replay command
These checks should be boring in the best sense. They replay known incidents, known minimized counterexamples, and a small set of high-value schedules.
Bounded randomized exploration runs on pull requests when it is cheap enough:
scope: small seed set, limited schedules, short logical time
budget: a few minutes
failure: blocks merge only if replay is deterministic
artifact: seed plus generated scenario
The key is determinism after discovery. If a randomized job cannot reproduce its own failure with the same seed and schedule, it should be triaged as a harness problem before it becomes a product blocker.
Deep exploration runs outside the critical merge path:
scope: larger topologies, longer histories, fault combinations
budget: nightly or continuous background
failure: opens issue or pages owning team by policy
artifact: replay packet, minimized history, invariant result
Production incident replays should run as regression tests once they are distilled:
scope: one incident shape
budget: short and deterministic
failure: blocks merge
artifact: link to incident replay record
The pipeline is stronger when it separates these modes explicitly. Otherwise a slow exploratory job and a deterministic regression test look like the same kind of failure, even though they need different responses.
Runtime Budgets
A runtime budget is not just a timeout. It is a promise about how much search the job can perform and how useful the result must be.
Useful budget dimensions include:
- wall-clock runtime
- logical simulation steps
- number of seeds
- number of generated operations
- number of nodes
- number of faults
- maximum shrink attempts
- maximum replay artifact size
- retained logs and traces
For example:
pull request:
seeds: 20
replicas: 3
clients: 2
steps: 200
faults: crash, delay, drop
max runtime: 4 minutes
shrink: 30 seconds
nightly:
seeds: 2000
replicas: 3 to 7
clients: 1 to 20
steps: 2000
faults: crash, delay, drop, partition, restart
max runtime: 3 hours
shrink: 10 minutes per new failure
The small job is not a weaker version of the big job. It is a different contract. The small job protects the merge path from known and common failures. The big job searches for new rare failures and must feed its discoveries back into the fast suite.
Budget decisions should be visible in code or configuration:
ci_profile: pull_request
exploration_profile: nightly
replay_profile: regression
Hidden budgets create confusion. A developer should know whether a failure came from a fixed replay, bounded exploration, or deep search.
Failure Artifacts
Every CI failure from a distributed harness should produce a small, durable artifact set.
At minimum:
- failing invariant name
- seed
- generated scenario
- schedule or replay log
- minimized counterexample if shrinking succeeded
- command to rerun locally
- relevant trace or event packet
- product version and harness version
- CI profile that found the failure
For CheckoutService, a useful failure artifact might look like:
failure:
invariant: at_most_one_provider_capture_per_key
seed: 91827
profile: nightly
replicas: A,B
key: m1/k1
external_effects: p778,p779
minimized_replay: artifacts/replays/91827-min.json
rerun:
./testlab replay artifacts/replays/91827-min.json
The rerun command matters. Developers should not have to reconstruct the lab setup from CI logs.
Artifacts should also distinguish discovery from confirmation:
discovery run:
randomized search found failure
confirmation run:
replay of recorded schedule reproduced failure
shrink run:
minimized replay still reproduced same invariant
If discovery fails but confirmation does not reproduce, CI should report a harness nondeterminism problem, not a product regression.
Triage Categories
Failure triage starts by naming what failed.
A product bug means the harness found a real invariant violation and replay confirms it.
action:
block merge or open incident-level issue
attach replay
add minimized replay to regression suite after fix
A harness bug means the failure cannot be replayed because the test did not control or record an input.
action:
fix clock, scheduler, dependency stub, cleanup, or replay record
do not weaken the product invariant
A model bug means the simulator behavior is wrong or stale relative to the claim.
action:
update model or claim boundary
calibrate against integration, contract, or production evidence
An environment bug means CI infrastructure changed the result without representing the system under test.
action:
isolate resources
pin dependencies
remove shared state
record environment metadata
A known failure means CI rediscovered an open bug.
action:
link to known issue
avoid creating duplicate noise
keep replay in regression tracking
Triage should be mechanical enough that two engineers reading the same artifact reach the same first classification.
Worked Example
A pull request changes retry backoff in CheckoutService. The PR job runs a small profile:
profile: pull_request
seeds: 20
replicas: 2
steps: 200
faults: delayed replication and crash
It fails:
invariant: at_most_one_provider_capture_per_key
seed: 1044
failure: duplicate captures p1,p2 for m1/k1
confirmation replay: passed
That is not yet a product blocker. The discovery run failed, but confirmation did not reproduce. The right triage is harness nondeterminism.
The artifact shows the missing input:
recorded:
seed
generated operations
network schedule
missing:
provider stub retry response order
The fix is to record dependency responses. After that, the same PR job fails again:
discovery: failed
confirmation replay: failed
shrink: reduced to 11 events
rerun: ./testlab replay artifacts/replays/1044-min.json
Now CI has a real product signal. The team can block the PR, attach the replay, and later add the minimized case to the regression profile.
The important point is that CI did not merely say "random test failed." It separated discovery, confirmation, shrinking, and classification.
Keeping CI Actionable
CI should make the next action obvious.
Good failure output says:
FAILED invariant at_most_one_provider_capture_per_key
profile nightly
confirmed deterministic replay
minimal replay artifacts/replays/1044-min.json
rerun ./testlab replay artifacts/replays/1044-min.json
suspected owner checkout-idempotency
Weak failure output says:
test failed after timeout
see logs
For distributed tests, "see logs" is rarely enough. The failure needs a replay path, a classification, and enough metadata to route ownership.
CI also needs aging rules. A failure discovered in nightly should either become a tracked issue, a regression replay, a harness fix, or a deliberate model-boundary update. Leaving it as repeated nightly noise teaches teams to ignore the suite.
Common Failure Modes
One mistake is running deep randomized exploration directly in the required merge path. The result is slow CI, noisy failures, and pressure to delete the tests.
Another mistake is treating every random discovery as a blocker before confirming replay. If replay cannot reproduce it, the first bug is in the harness or artifact capture.
A third mistake is failing to promote found bugs into fixed regression replays. Deep exploration should feed the fast suite.
A fourth mistake is hiding budgets in job scripts. Engineers need to know which profile failed and what level of evidence it promises.
A fifth mistake is keeping known failures unclassified. Repeated rediscovery without ownership creates noise and weakens trust.
Practice
Design CI placement for one deterministic distributed test lab.
- Which fixed incident replays should run on every pull request?
- How many seeds and logical steps fit the pull-request budget?
- Which deeper search should move to nightly or background runs?
- What artifacts must every failure upload?
- How does CI distinguish discovery failure from confirmed replay failure?
- What triage categories will the team use?
- Which failures block merge immediately?
- Which discoveries become regression tests after a fix?
Then write one CI failure message that includes a rerun command. If the failure message does not tell a developer how to reproduce the bug, the CI integration is not finished.
Connections
- Builds on Observability for Reproducible Distributed Bugs, because CI failures need replay packets, traces, and invariant evidence to stay actionable.
- Prepares for Debugging Loops, Runbooks, and Regression Suites, where confirmed failures become repeatable repair workflows.
- Connects to release engineering because test value depends on placing the right signal in the right pipeline stage.
Resources
- [BOOK] Site Reliability Engineering: Monitoring Distributed Systems
- [DOC] Jepsen Analyses
- [DOC] OpenTelemetry Traces
- [BOOK] Designing Data-Intensive Applications
Key Takeaways
- CI integration for distributed tests needs separate modes for fixed replays, bounded exploration, deep search, and incident regressions.
- Runtime budgets should name seeds, steps, topology, faults, shrink time, and artifact expectations, not only wall-clock timeout.
- A randomized CI failure should be confirmed by deterministic replay before it is treated as a product blocker.
- Deep exploration earns its cost when every new failure feeds a reproducible artifact, a triage decision, and eventually a regression test.
← Back to Distributed Testing, Simulation, and Deterministic Replay