Distributed Testing, Simulation, and Deterministic Replay: Observability for Reproducible Distributed Bugs

LESSON

Distributed Testing, Simulation, and Deterministic Replay

020 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Observability for Reproducible Distributed Bugs

Core Insight

In CheckoutService, a customer reports one order and two card captures. The simulator already knows how to replay retry timing, replication lag, idempotency records, and external payment effects. The problem is that production evidence is scattered: one service has the client request id, another has the idempotency key, the payment adapter has the provider capture id, and the trace ends before the retry reaches the second replica.

Observability for reproducible distributed bugs is not just dashboards, metrics, or logs. It is evidence design. The system must record enough causal information to reconstruct the failing history: who acted, what identity was used, which message or timer fired, which state was durable, which external effect happened, and which invariant failed.

The trade-off is evidence completeness versus cost, privacy, and noise. Recording every payload, message, and internal transition is expensive and unsafe. Recording only aggregate metrics is cheap but cannot drive replay. A good observability design captures the small set of identifiers, event facts, and boundary decisions needed to turn an incident into a deterministic test.

Observability For Replay

Operational observability often asks:

is the service healthy?
where is latency high?
which dependency is failing?
how many errors happened?

Replay-oriented observability asks a different set of questions:

which logical operation was this?
which retry attempt was this?
which replica handled each attempt?
which timer fired first?
which message was delayed or lost?
which state was durable at crash time?
which external effect was observed?
which invariant did this history violate?

Metrics are useful for detecting a problem, but they are usually too aggregated to reproduce it. Logs are useful when they carry stable identities and event facts, but free-form text without correlation is hard to replay. Traces are useful when they show causality across services, but only if they include the domain identities that make the bug meaningful.

The goal is a replay packet: a compact incident artifact that can seed a local or simulated reproduction.

replay packet:
  operation identity
  client attempts
  selected trace spans
  message and timer events
  fault or crash evidence
  dependency requests and responses
  relevant durable state facts
  observed external effects
  failed invariant

The packet does not need all production data. It needs enough causal shape to build the replay safely.

Correlation Is The Spine

Distributed bugs cross process boundaries. Without correlation, each service tells a local story that may be true but incomplete.

Useful correlation identifiers include:

trace id
client request id
idempotency key
tenant or merchant scope
operation type
retry attempt id
message id
timer id
log index or version
epoch, term, or membership configuration id
provider request id or external effect id

For the duplicate capture incident, the trace id alone is not enough. A retry may create a new trace. The idempotency key alone is not enough. It may be scoped by merchant and request hash. The provider capture id alone is not enough. The second capture may not know the first local request id.

The incident becomes replayable when those identities are joined:

merchant=m1
idempotency_key=k1
request_hash=h1
client_attempts=a1,a2
replicas=A,B
provider_captures=p778,p779
replication_message=m44
retry_timer=t19

Correlation also needs causality. A list of events is weaker than an ordered partial history.

a1 sent before m44 delivered
t19 fired before B saw k1
p778 recorded before A crashed
a2 routed to B after timeout

Those relationships are what the deterministic replay must preserve.

What To Record

The right record depends on the property, but several categories recur.

Record operation identity:

operation=confirm_order
tenant=m1
idempotency_key=k1
request_hash=h1
client_attempt=a2

Record boundary decisions:

routed attempt a2 to replica B
accepted retry as same request hash
classified provider 500 as unknown
served read from follower C under lease L7

Record time and scheduling facts:

timer t19 fired at logical time 50
replication message m44 delivered after t19
backoff jitter selected 37 ms
scheduler ran retry handler before replication apply

Record communication:

sent message m44 A -> B
held m44 during partition p3
delivered m44 after retry a2
duplicated provider response r9

Record durability:

in-flight record k1 fsynced on A
outcome p778 not fsynced before crash
log entry 91 committed under config C12
snapshot includes key k1 through version 88

Record external effects:

provider saw capture request q1
provider returned capture id p778
provider saw duplicate capture request q2
email service accepted message e44
queue published job j19

Record invariant results:

invariant: at most one provider capture per scoped idempotency key
observed: p778 and p779 for (m1,k1)
status: failed

These facts are small, structured, and safer than dumping full request bodies or databases.

Worked Example

The incident begins with a customer report:

order: order-1
merchant: m1
symptom: two captures
captures: p778, p779

The first log search finds local fragments:

service A:
  trace=tr1 attempt=a1 key=k1 sent capture p778

service B:
  trace=tr2 attempt=a2 key=k1 sent capture p779

payment adapter:
  provider_request=q2 capture=p779

replication:
  delayed message m44 A -> B

This is suggestive, but not yet replayable. The missing questions are causal:

did B receive the idempotency record before attempt a2?
did A durably record p778 before crashing?
was a2 the same request body as a1?
did the retry timer fire before m44 delivered?
did the provider treat q1 and q2 as the same idempotency identity?

A replay-ready incident record answers them:

operation:
  merchant=m1
  idempotency_key=k1
  request_hash=h1

attempts:
  a1 -> A at logical time 10
  a2 -> B at logical time 50

events:
  A durably records in-flight(k1,h1)
  A sends replication m44 to B
  network holds m44
  A sends provider request q1
  provider records capture p778
  A crashes before outcome fsync
  retry timer t19 fires
  a2 routes to B
  B has not applied m44
  B sends provider request q2
  provider records capture p779

invariant:
  captures_for(m1,k1).count <= 1
  observed captures: p778,p779

That record can become a deterministic replay:

seed production incident with:
  initial key state empty on A and B
  delayed m44
  crash A after provider response before outcome fsync
  retry timer before delivery
  provider model accepting q1 and q2 as separate captures

The replay may later be reduced by shrinking, but observability supplies the first faithful shape.

Production Evidence Versus Simulation Evidence

Simulation can record perfect internal events. Production cannot. Production has privacy boundaries, sampling, rate limits, clock skew, log loss, and services owned by different teams.

That means production observability should record stable facts at boundaries:

before and after external calls
before and after durable writes
when timers are scheduled and fired
when retries are classified
when messages are sent, held, dropped, or applied
when invariants detect a violation

Avoid relying only on derived states:

bad:
  order status became confirmed

better:
  attempt a1 recorded in-flight k1
  provider request q1 returned p778
  outcome p778 was not durable before crash
  attempt a2 triggered provider request q2

The second version can explain an execution. The first version only reports where the system landed.

The same event schema should work in tests and production when possible. If the simulator records message_id, timer_id, idempotency_key, and invariant_name, production should use the same vocabulary. Shared vocabulary makes incident replay cheaper.

Privacy And Volume Boundaries

Replay evidence must not become an excuse to collect everything.

Prefer identifiers and hashes over full payloads:

request_hash=h1
payload_schema=confirm_order_v3
amount_bucket=50_to_100
merchant=m1

Record the minimum payload fields required by the invariant. If exact amount matters to the bug, record amount under the approved data policy. If only request equality matters, record a hash.

Use sampling carefully. Random sampling can drop the only event that makes a rare bug reproducible. For invariant failures, crashes, ambiguous outcomes, and external side effects, biased capture is often better:

always capture:
  failed invariants
  duplicate external effect detection
  unknown provider outcomes
  retry after timeout
  crash during in-flight operation

Keep retention aligned with debugging needs. A duplicate capture discovered during reconciliation may need event records from hours or days earlier. If logs expire before the invariant is checked, replay evidence disappears.

Common Failure Modes

One mistake is collecting metrics without causality. A spike in duplicate captures says the property failed, but not which schedule caused it.

Another mistake is logging free-form messages without stable identifiers. Text is readable to humans but hard to join across services and hard to turn into replay input.

A third mistake is sampling away rare bug evidence. The rarer the failure, the more important it is to capture the complete causal path when the invariant trips.

A fourth mistake is recording only successes. Ambiguous outcomes, retries, conflicts, rejected duplicate keys, and dependency unknowns are the events that explain distributed failures.

A fifth mistake is using different names in tests and production. If the harness says timer_fire and production says retry woke up, replay tooling has to guess that they are the same kind of event.

Practice

Take one distributed bug your test harness can replay and design its production evidence.

What invariant would identify the bug?
Which operation identity links all events?
Which retry, message, timer, and dependency events must be recorded?
Which durable-state boundary matters?
Which external effects must be counted?
Which fields can be hashes instead of raw data?
Which events should bypass random sampling?
How long must the evidence be retained to support replay?

Then write a replay packet schema. Keep it small enough to be safe and practical, but complete enough that another engineer could build the first deterministic reproduction from it.

Connections

Builds on Testing Client Semantics, Idempotency, and Exactly-Once Claims, because client-visible guarantees require evidence that joins retries, identities, and external effects.
Prepares for CI Integration, Runtime Budgets, and Failure Triage, where replay evidence must be routed into automated test runs without overwhelming CI.
Connects to reliability practice because incident response depends on turning production symptoms into reproducible engineering evidence.

Resources

[DOC] OpenTelemetry Traces
[PAPER] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
[DOC] Jepsen Analyses
[BOOK] Designing Data-Intensive Applications

Key Takeaways

Replay-oriented observability records causal evidence, not just health signals.
Correlation identifiers must connect client attempts, retries, messages, timers, durable state, and external effects.
Production evidence should be structured, privacy-aware, and aligned with the event vocabulary used by the simulator.
The best incident record is small enough to store safely and complete enough to seed a deterministic reproduction.

← Back to Distributed Testing, Simulation, and Deterministic Replay

← Back to Distributed Systems

← Back to Learning Hub