Distributed Testing, Simulation, and Deterministic Replay: Observability for Reproducible Distributed Bugs
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Observability for Reproducible Distributed Bugs
Core Insight
In CheckoutService, a customer reports one order and two card captures. The simulator already knows how to replay retry timing, replication lag, idempotency records, and external payment effects. The problem is that production evidence is scattered: one service has the client request id, another has the idempotency key, the payment adapter has the provider capture id, and the trace ends before the retry reaches the second replica.
Observability for reproducible distributed bugs is not just dashboards, metrics, or logs. It is evidence design. The system must record enough causal information to reconstruct the failing history: who acted, what identity was used, which message or timer fired, which state was durable, which external effect happened, and which invariant failed.
The trade-off is evidence completeness versus cost, privacy, and noise. Recording every payload, message, and internal transition is expensive and unsafe. Recording only aggregate metrics is cheap but cannot drive replay. A good observability design captures the small set of identifiers, event facts, and boundary decisions needed to turn an incident into a deterministic test.
Observability For Replay
Operational observability often asks:
is the service healthy?
where is latency high?
which dependency is failing?
how many errors happened?
Replay-oriented observability asks a different set of questions:
which logical operation was this?
which retry attempt was this?
which replica handled each attempt?
which timer fired first?
which message was delayed or lost?
which state was durable at crash time?
which external effect was observed?
which invariant did this history violate?
Metrics are useful for detecting a problem, but they are usually too aggregated to reproduce it. Logs are useful when they carry stable identities and event facts, but free-form text without correlation is hard to replay. Traces are useful when they show causality across services, but only if they include the domain identities that make the bug meaningful.
The goal is a replay packet: a compact incident artifact that can seed a local or simulated reproduction.
replay packet:
operation identity
client attempts
selected trace spans
message and timer events
fault or crash evidence
dependency requests and responses
relevant durable state facts
observed external effects
failed invariant
The packet does not need all production data. It needs enough causal shape to build the replay safely.
Correlation Is The Spine
Distributed bugs cross process boundaries. Without correlation, each service tells a local story that may be true but incomplete.
Useful correlation identifiers include:
- trace id
- client request id
- idempotency key
- tenant or merchant scope
- operation type
- retry attempt id
- message id
- timer id
- log index or version
- epoch, term, or membership configuration id
- provider request id or external effect id
For the duplicate capture incident, the trace id alone is not enough. A retry may create a new trace. The idempotency key alone is not enough. It may be scoped by merchant and request hash. The provider capture id alone is not enough. The second capture may not know the first local request id.
The incident becomes replayable when those identities are joined:
merchant=m1
idempotency_key=k1
request_hash=h1
client_attempts=a1,a2
replicas=A,B
provider_captures=p778,p779
replication_message=m44
retry_timer=t19
Correlation also needs causality. A list of events is weaker than an ordered partial history.
a1 sent before m44 delivered
t19 fired before B saw k1
p778 recorded before A crashed
a2 routed to B after timeout
Those relationships are what the deterministic replay must preserve.
What To Record
The right record depends on the property, but several categories recur.
Record operation identity:
operation=confirm_order
tenant=m1
idempotency_key=k1
request_hash=h1
client_attempt=a2
Record boundary decisions:
routed attempt a2 to replica B
accepted retry as same request hash
classified provider 500 as unknown
served read from follower C under lease L7
Record time and scheduling facts:
timer t19 fired at logical time 50
replication message m44 delivered after t19
backoff jitter selected 37 ms
scheduler ran retry handler before replication apply
Record communication:
sent message m44 A -> B
held m44 during partition p3
delivered m44 after retry a2
duplicated provider response r9
Record durability:
in-flight record k1 fsynced on A
outcome p778 not fsynced before crash
log entry 91 committed under config C12
snapshot includes key k1 through version 88
Record external effects:
provider saw capture request q1
provider returned capture id p778
provider saw duplicate capture request q2
email service accepted message e44
queue published job j19
Record invariant results:
invariant: at most one provider capture per scoped idempotency key
observed: p778 and p779 for (m1,k1)
status: failed
These facts are small, structured, and safer than dumping full request bodies or databases.
Worked Example
The incident begins with a customer report:
order: order-1
merchant: m1
symptom: two captures
captures: p778, p779
The first log search finds local fragments:
service A:
trace=tr1 attempt=a1 key=k1 sent capture p778
service B:
trace=tr2 attempt=a2 key=k1 sent capture p779
payment adapter:
provider_request=q2 capture=p779
replication:
delayed message m44 A -> B
This is suggestive, but not yet replayable. The missing questions are causal:
did B receive the idempotency record before attempt a2?
did A durably record p778 before crashing?
was a2 the same request body as a1?
did the retry timer fire before m44 delivered?
did the provider treat q1 and q2 as the same idempotency identity?
A replay-ready incident record answers them:
operation:
merchant=m1
idempotency_key=k1
request_hash=h1
attempts:
a1 -> A at logical time 10
a2 -> B at logical time 50
events:
A durably records in-flight(k1,h1)
A sends replication m44 to B
network holds m44
A sends provider request q1
provider records capture p778
A crashes before outcome fsync
retry timer t19 fires
a2 routes to B
B has not applied m44
B sends provider request q2
provider records capture p779
invariant:
captures_for(m1,k1).count <= 1
observed captures: p778,p779
That record can become a deterministic replay:
seed production incident with:
initial key state empty on A and B
delayed m44
crash A after provider response before outcome fsync
retry timer before delivery
provider model accepting q1 and q2 as separate captures
The replay may later be reduced by shrinking, but observability supplies the first faithful shape.
Production Evidence Versus Simulation Evidence
Simulation can record perfect internal events. Production cannot. Production has privacy boundaries, sampling, rate limits, clock skew, log loss, and services owned by different teams.
That means production observability should record stable facts at boundaries:
- before and after external calls
- before and after durable writes
- when timers are scheduled and fired
- when retries are classified
- when messages are sent, held, dropped, or applied
- when invariants detect a violation
Avoid relying only on derived states:
bad:
order status became confirmed
better:
attempt a1 recorded in-flight k1
provider request q1 returned p778
outcome p778 was not durable before crash
attempt a2 triggered provider request q2
The second version can explain an execution. The first version only reports where the system landed.
The same event schema should work in tests and production when possible. If the simulator records message_id, timer_id, idempotency_key, and invariant_name, production should use the same vocabulary. Shared vocabulary makes incident replay cheaper.
Privacy And Volume Boundaries
Replay evidence must not become an excuse to collect everything.
Prefer identifiers and hashes over full payloads:
request_hash=h1
payload_schema=confirm_order_v3
amount_bucket=50_to_100
merchant=m1
Record the minimum payload fields required by the invariant. If exact amount matters to the bug, record amount under the approved data policy. If only request equality matters, record a hash.
Use sampling carefully. Random sampling can drop the only event that makes a rare bug reproducible. For invariant failures, crashes, ambiguous outcomes, and external side effects, biased capture is often better:
always capture:
failed invariants
duplicate external effect detection
unknown provider outcomes
retry after timeout
crash during in-flight operation
Keep retention aligned with debugging needs. A duplicate capture discovered during reconciliation may need event records from hours or days earlier. If logs expire before the invariant is checked, replay evidence disappears.
Common Failure Modes
One mistake is collecting metrics without causality. A spike in duplicate captures says the property failed, but not which schedule caused it.
Another mistake is logging free-form messages without stable identifiers. Text is readable to humans but hard to join across services and hard to turn into replay input.
A third mistake is sampling away rare bug evidence. The rarer the failure, the more important it is to capture the complete causal path when the invariant trips.
A fourth mistake is recording only successes. Ambiguous outcomes, retries, conflicts, rejected duplicate keys, and dependency unknowns are the events that explain distributed failures.
A fifth mistake is using different names in tests and production. If the harness says timer_fire and production says retry woke up, replay tooling has to guess that they are the same kind of event.
Practice
Take one distributed bug your test harness can replay and design its production evidence.
- What invariant would identify the bug?
- Which operation identity links all events?
- Which retry, message, timer, and dependency events must be recorded?
- Which durable-state boundary matters?
- Which external effects must be counted?
- Which fields can be hashes instead of raw data?
- Which events should bypass random sampling?
- How long must the evidence be retained to support replay?
Then write a replay packet schema. Keep it small enough to be safe and practical, but complete enough that another engineer could build the first deterministic reproduction from it.
Connections
- Builds on Testing Client Semantics, Idempotency, and Exactly-Once Claims, because client-visible guarantees require evidence that joins retries, identities, and external effects.
- Prepares for CI Integration, Runtime Budgets, and Failure Triage, where replay evidence must be routed into automated test runs without overwhelming CI.
- Connects to reliability practice because incident response depends on turning production symptoms into reproducible engineering evidence.
Resources
- [DOC] OpenTelemetry Traces
- [PAPER] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
- [DOC] Jepsen Analyses
- [BOOK] Designing Data-Intensive Applications
Key Takeaways
- Replay-oriented observability records causal evidence, not just health signals.
- Correlation identifiers must connect client attempts, retries, messages, timers, durable state, and external effects.
- Production evidence should be structured, privacy-aware, and aligned with the event vocabulary used by the simulator.
- The best incident record is small enough to store safely and complete enough to seed a deterministic reproduction.
← Back to Distributed Testing, Simulation, and Deterministic Replay