Distributed Testing, Simulation, and Deterministic Replay: Replaying Production Incidents Without Recreating Production
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Replaying Production Incidents Without Recreating Production
Core Insight
In CheckoutService, production saw duplicate payment captures during a regional slowdown. The first instinct is to "replay production," but that phrase is dangerous. Production has private data, live dependencies, real payment credentials, enormous traffic volume, partial observability, and environmental details that should not be copied into a test lab.
The useful goal is not to recreate production. The useful goal is to preserve the incident shape: the causal ingredients that made the bug possible. That might be one idempotency key, two replicas, a delayed replication message, one retry timer, and a dependency response that arrived after the retry.
The trade-off is fidelity versus safety. Too little fidelity produces a clean toy that cannot fail the same way. Too much fidelity imports production risk, cost, noise, and privacy exposure. A good incident replay strips production down to the smallest controlled model that still contains the causality of the failure.
Incident Shape, Not Incident Copy
An incident copy tries to reproduce everything:
same traffic volume
same customers
same regions
same databases
same payment provider
same background jobs
same dashboards
same operational timeline
That is rarely safe or useful. It brings in data governance problems, cost, nondeterminism, and dependencies the test cannot control.
An incident shape keeps the mechanism:
same ordering pattern
same timeout relationship
same retry behavior
same partial replication
same acknowledgement boundary
same invariant violation
For duplicate payment capture, the shape might be:
- client sends confirm with idempotency key
k1 - replica A starts capture and tries to replicate "in flight"
- replication message is delayed behind congestion
- client retry timer fires before replica B sees
k1 - replica B starts a second capture
- payment adapter accepts both captures
- invariant fails: one idempotency key produced two external effects
That shape can be replayed with fake customers, fake payment responses, two replicas instead of a whole fleet, and a deterministic network model instead of a real regional outage.
The Extraction Workflow
Turning a production incident into a deterministic replay usually follows a sequence.
First, name the violated claim. Examples:
one idempotency key creates at most one external capture
committed ledger entries are not lost during leader failover
clients never observe balance totals below the conserved amount
all live replicas converge after a healed partition
Second, identify the minimum causal ingredients. Look for the operation, key, node roles, delayed messages, timers, dependency responses, and state boundaries that matter.
Third, replace production data with synthetic equivalents. The replay needs an idempotency key, not a real customer id. It needs a payment approval response, not a live payment provider. It needs a log prefix, not an entire database.
Fourth, model the unsafe dependencies. External services become controlled stubs. Regional networks become simulated links. Production clocks become simulated clocks. Operator actions become recorded fault events.
Fifth, validate the replay against the original evidence. The replay should fail the same invariant and produce a causal path that matches the incident timeline at the level that matters.
Worked Example
The production report says:
13:01: client confirms order-77 with idempotency key k-prod-9
13:02: payment provider receives capture p1
13:02: replication lag grows between us-east and us-west
13:03: client retry routes to us-west
13:03: payment provider receives capture p2
13:05: reconciliation detects duplicate external captures
The raw facts are too production-specific. The replay distills them:
entities:
- client C
- order replica A
- order replica B
- payment adapter stub P
state:
- idempotency table empty on A and B
- replication link A -> B delayed
events:
1. C sends confirm(order-1, key k1) to A
2. A records in-flight k1 locally
3. A sends replication message m1 to B
4. network delays m1
5. A sends capture(k1) to P
6. P returns approved after 80 simulated steps
7. C retry timer fires at 50 simulated steps
8. C sends confirm(order-1, key k1) to B
9. B has not seen m1
10. B sends capture(k1) to P
11. invariant fails: P saw two captures for k1
This replay does not need the real order id, the real provider, the real region, the real latency histogram, or the real customer. It needs the causal relationship: retry outran replication and the external effect was not guarded by a shared authoritative idempotency decision.
The replay should also record what was intentionally excluded:
excluded:
- real payment credentials
- full customer profile
- production database dump
- unrelated background workers
- unrelated traffic from other tenants
- exact host-level thread schedule
Documenting exclusions is not bureaucracy. It makes the model reviewable. If a later engineer believes background workers mattered, they can add that hypothesis deliberately instead of assuming the replay captured everything.
Fidelity Choices
Every incident replay makes fidelity choices. The important question is whether each choice preserves the failure mechanism.
Use exact event ordering when order caused the bug. If a retry happened before replication, that ordering belongs in the replay.
Use synthetic payloads when payload content did not matter. A fake order can stand in for a real order if the bug depends on idempotency, not on product catalog details.
Use controlled dependency responses when the dependency behavior matters. If the payment provider accepted duplicate requests, the stub should do that.
Use approximate latency only when exact latency is not the point. If the critical fact is "retry fires before replication arrives," the replay can express that as a relative ordering instead of copying a production latency distribution.
Use a small topology when topology did not create the bug. Two replicas may be enough to reproduce a stale-retry bug. A full fleet may hide the mechanism under noise.
Safety Boundaries
Production incident replay needs explicit safety boundaries.
Do not replay real customer data unless the organization has a clear approved path and the test genuinely needs it. Most distributed replay tests need shape, not identity.
Do not call live external systems. Payment providers, email systems, identity services, DNS control planes, object stores, and queues should be stubbed or simulated unless the test is specifically a controlled integration test.
Do not assume production configuration is safe in a lab. Secrets, endpoints, region names, feature flags, and retention settings can cause accidental side effects.
Do not let the replay mutate shared state. The harness should run in an isolated environment with disposable state, fake credentials, and explicit egress controls.
Do not preserve more data than the test needs. Trace minimization is part of both privacy and debuggability.
Common Failure Modes
One mistake is treating production replay as a database restore. Restoring data may reproduce volume, but not the timing, scheduling, and failure relationship that caused the incident.
Another mistake is over-sanitizing until the bug disappears. If the real incident involved a timeout before replication, the replay must keep that relative timing even if all identities are synthetic.
A third mistake is replaying only the happy path after the incident. The replay should preserve the unsafe overlap, the failed invariant, and the recovery path if recovery behavior is part of the claim.
A fourth mistake is letting the replay depend on live services. A live dependency can make the test flaky, unsafe, expensive, or impossible to debug.
A fifth mistake is failing to compare the replay against incident evidence. A replay that fails a different invariant or uses a different causal path may still be useful, but it is not the same incident.
Practice
Take one incident report and extract a replay plan:
- What system claim was violated?
- Which operation, key, message, timer, dependency response, and failure action mattered?
- Which production details can become synthetic values?
- Which dependencies must be stubbed or simulated?
- Which relative ordering must be preserved?
- Which invariant should fail in the replay?
- Which exclusions should be documented?
Then ask whether the replay would still fail if you removed production data, reduced the topology, and replaced exact timestamps with deterministic ordering. If yes, you probably captured the incident shape.
Connections
- Builds on Deterministic Replay for Inputs, Time, and Scheduling, because incident replay needs the same controls over inputs, simulated time, and selected actions.
- Prepares for Shrinking, Delta Debugging, and Minimal Counterexamples, where a replayed incident is reduced to the smallest history that still fails.
- Connects to reliability practice because a useful regression test preserves the incident's causal mechanism without importing production risk.
Resources
- [BOOK] Designing Data-Intensive Applications
- [DOC] Jepsen Analyses
- [BOOK] Site Reliability Engineering: Postmortem Culture
- [PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store
Key Takeaways
- Incident replay should preserve the causal shape of a production failure, not clone production itself.
- Useful replays replace identities, payloads, and live dependencies with synthetic or controlled equivalents while keeping the ordering that caused the bug.
- Fidelity choices should be justified by the claim being tested and the invariant that failed.
- Safety boundaries are part of the engineering: no live side effects, no unnecessary private data, and no uncontrolled production dependencies.
← Back to Distributed Testing, Simulation, and Deterministic Replay