Distributed Testing, Simulation, and Deterministic Replay: Debugging Loops, Runbooks, and Regression Suites
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Debugging Loops, Runbooks, and Regression Suites
Core Insight
In CheckoutService, CI has confirmed a deterministic replay: a retry reaches replica B before B receives the idempotency record from replica A, then two provider captures appear for the same scoped key. The team has a seed, a minimized replay, traces, and a rerun command. That is useful evidence, but it is not the end of the work. The bug is not really closed until the team can explain it, fix the right boundary, and keep the minimized case as a regression.
A distributed debugging loop turns a failing history into a durable engineering change. The loop is: confirm the replay, identify the causal mechanism, narrow the state and schedule, change the system or model, prove the same replay no longer fails, and add the replay to the right regression suite. Each step protects against a common failure: fixing the symptom, weakening the invariant, losing the reproduction, or letting the same bug return.
The trade-off is debugging speed versus repair confidence. It is tempting to patch the first local symptom that makes CI green. A stronger loop takes a little longer, but it preserves the causal explanation and creates a reusable test asset. In distributed systems, that extra discipline is often the difference between a one-off fix and a class of bugs that stays fixed.
The Debugging Loop
A useful debugging loop is explicit enough to run under pressure.
1. reproduce the failure from the artifact
2. confirm the same invariant fails
3. inspect the minimized history
4. name the causal mechanism
5. identify the broken boundary
6. apply the smallest valid fix
7. replay the original and minimized histories
8. broaden with adjacent schedules
9. add a regression case
10. update the runbook or model boundary
The loop starts with reproduction. If the failure cannot be replayed, the first task is not product debugging. The first task is fixing the harness, artifact capture, or environment.
The loop then checks identity. The same replay should fail the same invariant for the same causal reason. If the failure changes from duplicate capture to timeout, or from unsafe retry to stale provider model, the team may be chasing a different bug.
Only after that should the team modify product code. A distributed replay is strong because it can test the same schedule before and after the fix.
From Counterexample To Explanation
A minimized replay is not automatically an explanation. It is evidence that still needs interpretation.
For the duplicate capture case, a minimized replay might be:
1 C sends confirm(order-1, k1) to A
2 A durably records in-flight(k1,h1)
3 A sends replication m44 -> B
4 network holds m44
5 A sends provider capture q1
6 provider records p778
7 A crashes before outcome fsync
8 retry timer fires
9 C retries confirm(order-1,k1) to B
10 B has not applied m44
11 B sends provider capture q2
12 provider records p779
13 invariant fails
The explanation is not simply "replication was slow." Slow replication is allowed. The causal mechanism is narrower:
retry reached a replica without durable in-flight or outcome evidence
before the external effect boundary was protected by a shared idempotency decision
That explanation points to the repair boundary. Possible fixes include:
- route same-key retries to the owner of the durable idempotency record
- make the idempotency record strongly consistent before external capture
- use a provider idempotency key that makes duplicate captures impossible
- recover unknown provider outcomes before issuing a new capture
- make retry responses return a safe in-progress state until outcome evidence exists
Different fixes have different costs. The replay helps evaluate them because the same failure schedule can be rerun against each design.
Runbooks For Reproducible Bugs
A runbook is not only an incident document. For deterministic replay, it is a repeatable debugging procedure.
A good runbook entry includes:
symptom:
duplicate provider captures for scoped idempotency key
invariant:
at_most_one_provider_capture_per_key
artifact:
artifacts/replays/1044-min.json
rerun:
./testlab replay artifacts/replays/1044-min.json
expected failure before fix:
p778 and p779 observed for m1/k1
debug views:
idempotency table events
provider request log
retry timer events
replication message m44
known repair boundary:
retry may reach a replica without durable outcome evidence
This saves time because the next engineer does not need to rediscover which logs, traces, or state dumps matter.
The runbook should also say what not to do:
do not fix by increasing retry timeout only
do not weaken invariant to "eventually reconciled"
do not delete provider-effect oracle
do not mark replay flaky unless confirmation fails
Those warnings preserve the test's purpose.
Regression Suites
Not every failing exploration should become a permanent required test in its original form. Regression suites need structure.
A fixed replay suite contains minimized histories that should always pass:
regression/fixed-replays/
duplicate-capture-retry-before-replication.json
config-change-split-brain-quorum.json
stale-read-after-lease-expiry.json
These belong in fast CI because they are deterministic and explain a known bug.
A property regression suite contains focused generated scenarios around a repaired class of bugs:
same-key retry routes to stale replica
provider unknown response before local outcome fsync
same key with different request hash
idempotency record expires near retry boundary
These explore nearby cases so the fix does not overfit the exact minimized history.
An incident replay suite contains sanitized production shapes. It preserves customer-visible failures without importing production data.
incident-replays/
2026-04-duplicate-capture-shape.json
2026-05-membership-change-split-brain-shape.json
A quarantine or investigation suite can hold confirmed-but-unfixed failures. It should have owners and expiration rules. Otherwise it becomes a graveyard that everyone ignores.
Worked Example
The team receives this CI artifact:
profile: nightly
invariant: at_most_one_provider_capture_per_key
seed: 1044
minimal replay: artifacts/replays/1044-min.json
rerun: ./testlab replay artifacts/replays/1044-min.json
The debugging loop starts:
./testlab replay artifacts/replays/1044-min.json
The replay fails the same invariant. The team inspects the events and writes the causal statement:
B issued a provider capture because it had neither replicated in-flight evidence
nor durable outcome evidence for k1 when retry a2 arrived.
A tempting patch is:
increase client retry timeout from 50 ms to 500 ms
The replay passes after that change, but the explanation is weak. It only moves the timing window. The nightly search can still find schedules where replication takes longer than the retry.
A stronger fix is:
before provider capture:
acquire scoped idempotency decision from authoritative store
if key is in-flight elsewhere, return in_progress
if outcome exists, return stored outcome
if request hash conflicts, return conflict
only one decision holder may call provider
Now the team reruns:
./testlab replay artifacts/replays/1044-min.json
./testlab explore --profile idempotency-nearby --seeds 1000
./testlab replay incidents/duplicate-capture-2026-04-shape.json
The fixed replay passes. Nearby exploration passes. The incident shape passes. The minimized replay becomes:
regression/fixed-replays/duplicate-capture-retry-before-replication.json
The runbook is updated with the causal mechanism and the repair boundary. The bug is now closed in a way that future engineers can verify.
Repair Confidence Checks
After a fix, do more than rerun the exact minimized case.
Check the original unshrunk failure. Shrinking can remove context that still matters for a realistic workload.
Check neighboring schedules:
- retry just before replication delivery
- retry just after replication delivery
- provider returns unknown outcome
- crash before in-flight fsync
- crash after provider effect but before outcome fsync
- same key with different request hash
- idempotency record near retention expiry
Check the oracle. A fix that passes by muting provider-effect recording is not a fix.
Check the model boundary. If the fix depends on provider idempotency semantics, confirm that the provider model still matches contract evidence.
Check regression placement. A slow exploratory reproduction should be reduced into a fast fixed replay when possible.
Common Failure Modes
One mistake is fixing the exact schedule by adding a sleep or timeout. That may pass the replay while leaving the unsafe boundary intact.
Another mistake is losing the replay artifact after the fix. If the artifact is not added to a regression suite, the bug can return silently.
A third mistake is weakening the invariant to match the implementation. The invariant should change only when the product contract changes deliberately.
A fourth mistake is treating every minimized replay as a complete explanation. The replay shows what happened; engineers still need to name why it was illegal.
A fifth mistake is keeping runbooks separate from executable artifacts. A runbook without a rerun command becomes documentation; a runbook with a replay becomes an operational tool.
Practice
Take one confirmed distributed failure and write the debugging loop.
- What command reproduces the failure?
- Which invariant fails?
- What is the minimized history?
- What is the causal mechanism in one sentence?
- Which boundary is broken: clock, network, durability, membership, dependency, or client semantics?
- Which fix preserves the product contract?
- Which nearby schedules should be tested after the fix?
- Which regression suite should keep the replay?
- What runbook entry would help the next engineer?
Then check whether the regression is fast enough for pull-request CI. If it is not, shrink it further or place it in the correct profile with a clear owner.
Connections
- Builds on CI Integration, Runtime Budgets, and Failure Triage, because confirmed CI failures need a repeatable repair path.
- Prepares for Design Review for Testing Strategy Selection, where teams choose which testing strategy fits a claim before incidents occur.
- Connects to incident response because a good postmortem should leave behind executable evidence, not only narrative.
Resources
- [BOOK] Site Reliability Engineering: Postmortem Culture
- [DOC] Jepsen Analyses
- [BOOK] Designing Data-Intensive Applications
- [DOC] OpenTelemetry Traces
Key Takeaways
- A distributed bug is not closed until its replay is understood, fixed, rerun, and placed in the right regression suite.
- Runbooks should connect symptoms, invariants, replay commands, debug views, and repair boundaries.
- Regression suites should separate fixed replays, property regressions, incident shapes, and investigated known failures.
- Repair confidence comes from rerunning the minimized case, the original case, nearby schedules, and the oracle that made the bug visible.
← Back to Distributed Testing, Simulation, and Deterministic Replay