Distributed Testing, Simulation, and Deterministic Replay: Deterministic Replay for Inputs, Time, and Scheduling

LESSON

Distributed Testing, Simulation, and Deterministic Replay

013 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Deterministic Replay for Inputs, Time, and Scheduling

Core Insight

In LedgerService, the trace shows a transfer acknowledged before its log entry was safely replicated. Rerunning the test might not fail again: the retry timer could fire later, the network message could arrive earlier, the random client generator could choose different accounts, or the host runtime could schedule a task in a different order. The bug is real, but a normal rerun is not evidence you can debug.

Deterministic replay means recording and then controlling the choices that made the execution happen. For distributed tests, the important choices are usually inputs, time, and scheduling: which client operation arrived, what the clock returned, which timer fired, which message was delivered, which dependency response appeared, and which task ran next.

The trade-off is scope versus reproducibility. A narrow replay boundary is easier to build but may miss the nondeterminism that caused the bug. A broad boundary can reproduce more failures but requires more instrumentation, more stable state snapshots, and stricter control over the runtime.

What Replay Must Control

Replay starts by deciding what the harness owns. Anything outside that boundary can still introduce nondeterminism.

A deterministic distributed replay usually controls three categories.

The first is inputs. Inputs include client operations, generated workload values, network messages, external dependency responses, crash commands, restart commands, configuration, initial state, random seeds, and any data read from the outside world.

input examples:
- client-2 invokes transfer A -> C amount 10
- payment-adapter returns approved for req-17
- network injects partition n1 | n2,n3
- generator chooses account A from seed 91827

The second is time. Time includes logical clocks, deadlines, sleeps, retry timers, lease expiration, heartbeat intervals, cache TTLs, and scheduled callbacks. A replay that calls the host clock directly is already leaking nondeterminism.

time examples:
- now() returns simulated step 120
- retry timer t44 becomes fireable at step 150
- lease L9 expires at logical time 3000
- scheduler advances to next timer without sleeping

The third is scheduling. Scheduling decides which enabled action happens next: deliver a message, run a node callback, fire a timer, crash a process, resume a client, or apply a dependency response. In thread-based systems, this may mean controlling task interleavings. In simulator-based systems, it means choosing the next event from the enabled set.

scheduling examples:
- step 31: deliver msg-40 to n2
- step 32: run n2 append_entries handler
- step 33: fire client retry timer t17
- step 34: crash n1 before durable flush completes

If the replay owns those choices, a failing run can become a repeatable debugging artifact instead of a story someone hopes to see again.

Replay Record

A replay record is the minimal evidence needed to run the same execution again.

It usually includes:

A useful record is stable enough to survive normal code movement. It should not depend on memory addresses, thread ids assigned by the operating system, wall-clock timestamps, or generated UUIDs that are not themselves recorded.

One compact replay record might look like this:

replay_id: ledger-2026-06-05-seed-72109
initial_state: snapshots/ledger-three-node-empty-v4
seed: 72109

steps:
1  input  c1 invokes transfer A -> B 10 req-1
2  action deliver req-1 to n1
3  action n1 appends log index 40
4  fault  partition n1 | n2,n3
5  action n1 returns ok for req-1
6  input  c2 invokes transfer A -> C 10 req-2
7  action deliver req-2 to n1
8  action n1 appends log index 41
9  action n1 returns ok for req-2
10 fault  crash n1 before durable flush index 41
11 fault  heal partition
12 action n2 becomes leader
13 input  c3 invokes read_total
14 action c3 receives 290
15 check  total_balance_conserved fails

This is not only a log. It is a script the harness can enforce.

Worked Example

Suppose a randomized exploration run finds the LedgerService failure. The first run was selected by seed, but seed alone is not always enough. If code changes add one extra random draw, the same seed can diverge. If a timer uses the host clock, the schedule can diverge. If an external dependency returns a live response, the replay can diverge.

The harness should therefore check the replay step by step.

expected step 8:
  selected_action: n1 append log index 41 for req-2
  precondition: n1 alive, req-2 delivered, log index 40 exists
  result:
    log[n1][41] = transfer A -> C 10
    trace event ev-32 emitted

actual step 8 during replay:
  selected_action: n1 append log index 41 for req-2
  result matches expected state digest

If the replay diverges, the harness should fail early:

replay divergence at step 8:
  expected enabled action: append log index 41
  actual enabled actions:
  - fire retry timer t19
  - deliver heartbeat msg-88

likely missing control: timer creation or scheduler decision

That divergence is useful. It tells the team that the replay boundary is incomplete. A replay system should not quietly drift and then report a different failure.

Levels of Replay

Different systems choose different replay levels.

Level What it replays Strength Cost
Workload replay same client operations and faults simple and portable misses internal scheduling races
Trace-guided replay same external inputs plus selected internal events good debugging signal requires instrumentation
Simulator replay same deterministic event schedule strong for modeled systems only as faithful as the simulator
Runtime record/replay process-level execution choices high fidelity expensive and environment-specific

For this track, the simulator and trace-guided levels are the main tools. They are practical for distributed test harnesses because they can make messages, timers, failures, and random choices explicit.

Runtime record/replay can be powerful for low-level concurrency bugs, but it is not a replacement for a clear distributed model. It may reproduce the process execution without explaining whether the workload, consistency model, or failure assumptions were valid.

Building Replayable Distributed Tests

Replayability is easier to design in from the start than to add after a flaky failure.

Use deterministic clocks. Code under test should ask the harness for time, not the host clock. Sleeps should become scheduled timers.

Use deterministic random sources. The generator, scheduler, simulated network, and id creation should draw from named seeded sources or record every draw.

Use an injected scheduler. Components should not directly deliver messages, run callbacks, or advance time. They should expose enabled actions to the harness.

Use stable state summaries. Replays should compare meaningful state digests such as log term/index, durable key versions, queue contents, and invariant-relevant data.

Use controlled external dependencies. Payment adapters, storage calls, DNS, object stores, and identity services should return recorded or simulated responses during replay.

Use divergence checks. Replay should verify that the expected action is enabled and that the resulting state matches the recorded transition closely enough for the bug being debugged.

Common Failure Modes

One mistake is recording inputs but not time. Retry bugs, lease bugs, heartbeat bugs, and cache expiration bugs often depend on exactly when the system believes time advanced.

Another mistake is recording time but not scheduling. If two messages are both deliverable and the replay does not control which one arrives first, the failure can disappear.

A third mistake is trusting a seed without recording the decisions it produced. Seeds are useful, but they become fragile when code changes alter the order of random draws.

A fourth mistake is replaying after a different initial state. Distributed bugs often depend on durable log prefixes, queue contents, membership state, or cached metadata. The initial snapshot is part of the replay.

A fifth mistake is letting replay drift. If the expected action is no longer enabled, the tool should report divergence instead of continuing into a different execution.

Practice

Take a failing trace from the previous lesson and decide what the replay record must contain:

  1. Which initial state or snapshot is required?
  2. Which client inputs and fault actions must be replayed?
  3. Which time values, timers, leases, or deadlines must be controlled?
  4. Which scheduler decisions must be fixed?
  5. Which random choices must be recorded as values instead of only as a seed?
  6. Which state digest would detect replay divergence early?

Then intentionally remove one category. If the failure no longer reproduces, that category belongs inside the replay boundary.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: Trace Capture, Causality, and Event Logs NEXT Distributed Testing, Simulation, and Deterministic Replay: Replaying Production Incidents Without Recreating Production