Distributed Testing, Simulation, and Deterministic Replay: Deterministic Replay for Inputs, Time, and Scheduling
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Deterministic Replay for Inputs, Time, and Scheduling
Core Insight
In LedgerService, the trace shows a transfer acknowledged before its log entry was safely replicated. Rerunning the test might not fail again: the retry timer could fire later, the network message could arrive earlier, the random client generator could choose different accounts, or the host runtime could schedule a task in a different order. The bug is real, but a normal rerun is not evidence you can debug.
Deterministic replay means recording and then controlling the choices that made the execution happen. For distributed tests, the important choices are usually inputs, time, and scheduling: which client operation arrived, what the clock returned, which timer fired, which message was delivered, which dependency response appeared, and which task ran next.
The trade-off is scope versus reproducibility. A narrow replay boundary is easier to build but may miss the nondeterminism that caused the bug. A broad boundary can reproduce more failures but requires more instrumentation, more stable state snapshots, and stricter control over the runtime.
What Replay Must Control
Replay starts by deciding what the harness owns. Anything outside that boundary can still introduce nondeterminism.
A deterministic distributed replay usually controls three categories.
The first is inputs. Inputs include client operations, generated workload values, network messages, external dependency responses, crash commands, restart commands, configuration, initial state, random seeds, and any data read from the outside world.
input examples:
- client-2 invokes transfer A -> C amount 10
- payment-adapter returns approved for req-17
- network injects partition n1 | n2,n3
- generator chooses account A from seed 91827
The second is time. Time includes logical clocks, deadlines, sleeps, retry timers, lease expiration, heartbeat intervals, cache TTLs, and scheduled callbacks. A replay that calls the host clock directly is already leaking nondeterminism.
time examples:
- now() returns simulated step 120
- retry timer t44 becomes fireable at step 150
- lease L9 expires at logical time 3000
- scheduler advances to next timer without sleeping
The third is scheduling. Scheduling decides which enabled action happens next: deliver a message, run a node callback, fire a timer, crash a process, resume a client, or apply a dependency response. In thread-based systems, this may mean controlling task interleavings. In simulator-based systems, it means choosing the next event from the enabled set.
scheduling examples:
- step 31: deliver msg-40 to n2
- step 32: run n2 append_entries handler
- step 33: fire client retry timer t17
- step 34: crash n1 before durable flush completes
If the replay owns those choices, a failing run can become a repeatable debugging artifact instead of a story someone hopes to see again.
Replay Record
A replay record is the minimal evidence needed to run the same execution again.
It usually includes:
- test version and protocol version
- initial cluster configuration
- initial durable state or snapshot reference
- random seeds and generated workload decisions
- external inputs and dependency responses
- scheduler decisions by simulated step
- simulated clock values and timer firings
- network deliveries, drops, delays, and partitions
- crash and restart actions
- invariants checked and the first failing assertion
- optional trace ids that link replay steps to human-readable events
A useful record is stable enough to survive normal code movement. It should not depend on memory addresses, thread ids assigned by the operating system, wall-clock timestamps, or generated UUIDs that are not themselves recorded.
One compact replay record might look like this:
replay_id: ledger-2026-06-05-seed-72109
initial_state: snapshots/ledger-three-node-empty-v4
seed: 72109
steps:
1 input c1 invokes transfer A -> B 10 req-1
2 action deliver req-1 to n1
3 action n1 appends log index 40
4 fault partition n1 | n2,n3
5 action n1 returns ok for req-1
6 input c2 invokes transfer A -> C 10 req-2
7 action deliver req-2 to n1
8 action n1 appends log index 41
9 action n1 returns ok for req-2
10 fault crash n1 before durable flush index 41
11 fault heal partition
12 action n2 becomes leader
13 input c3 invokes read_total
14 action c3 receives 290
15 check total_balance_conserved fails
This is not only a log. It is a script the harness can enforce.
Worked Example
Suppose a randomized exploration run finds the LedgerService failure. The first run was selected by seed, but seed alone is not always enough. If code changes add one extra random draw, the same seed can diverge. If a timer uses the host clock, the schedule can diverge. If an external dependency returns a live response, the replay can diverge.
The harness should therefore check the replay step by step.
expected step 8:
selected_action: n1 append log index 41 for req-2
precondition: n1 alive, req-2 delivered, log index 40 exists
result:
log[n1][41] = transfer A -> C 10
trace event ev-32 emitted
actual step 8 during replay:
selected_action: n1 append log index 41 for req-2
result matches expected state digest
If the replay diverges, the harness should fail early:
replay divergence at step 8:
expected enabled action: append log index 41
actual enabled actions:
- fire retry timer t19
- deliver heartbeat msg-88
likely missing control: timer creation or scheduler decision
That divergence is useful. It tells the team that the replay boundary is incomplete. A replay system should not quietly drift and then report a different failure.
Levels of Replay
Different systems choose different replay levels.
| Level | What it replays | Strength | Cost |
|---|---|---|---|
| Workload replay | same client operations and faults | simple and portable | misses internal scheduling races |
| Trace-guided replay | same external inputs plus selected internal events | good debugging signal | requires instrumentation |
| Simulator replay | same deterministic event schedule | strong for modeled systems | only as faithful as the simulator |
| Runtime record/replay | process-level execution choices | high fidelity | expensive and environment-specific |
For this track, the simulator and trace-guided levels are the main tools. They are practical for distributed test harnesses because they can make messages, timers, failures, and random choices explicit.
Runtime record/replay can be powerful for low-level concurrency bugs, but it is not a replacement for a clear distributed model. It may reproduce the process execution without explaining whether the workload, consistency model, or failure assumptions were valid.
Building Replayable Distributed Tests
Replayability is easier to design in from the start than to add after a flaky failure.
Use deterministic clocks. Code under test should ask the harness for time, not the host clock. Sleeps should become scheduled timers.
Use deterministic random sources. The generator, scheduler, simulated network, and id creation should draw from named seeded sources or record every draw.
Use an injected scheduler. Components should not directly deliver messages, run callbacks, or advance time. They should expose enabled actions to the harness.
Use stable state summaries. Replays should compare meaningful state digests such as log term/index, durable key versions, queue contents, and invariant-relevant data.
Use controlled external dependencies. Payment adapters, storage calls, DNS, object stores, and identity services should return recorded or simulated responses during replay.
Use divergence checks. Replay should verify that the expected action is enabled and that the resulting state matches the recorded transition closely enough for the bug being debugged.
Common Failure Modes
One mistake is recording inputs but not time. Retry bugs, lease bugs, heartbeat bugs, and cache expiration bugs often depend on exactly when the system believes time advanced.
Another mistake is recording time but not scheduling. If two messages are both deliverable and the replay does not control which one arrives first, the failure can disappear.
A third mistake is trusting a seed without recording the decisions it produced. Seeds are useful, but they become fragile when code changes alter the order of random draws.
A fourth mistake is replaying after a different initial state. Distributed bugs often depend on durable log prefixes, queue contents, membership state, or cached metadata. The initial snapshot is part of the replay.
A fifth mistake is letting replay drift. If the expected action is no longer enabled, the tool should report divergence instead of continuing into a different execution.
Practice
Take a failing trace from the previous lesson and decide what the replay record must contain:
- Which initial state or snapshot is required?
- Which client inputs and fault actions must be replayed?
- Which time values, timers, leases, or deadlines must be controlled?
- Which scheduler decisions must be fixed?
- Which random choices must be recorded as values instead of only as a seed?
- Which state digest would detect replay divergence early?
Then intentionally remove one category. If the failure no longer reproduces, that category belongs inside the replay boundary.
Connections
- Builds on Trace Capture, Causality, and Event Logs, because replay uses those event logs to reproduce the same causal path.
- Prepares for Replaying Production Incidents Without Recreating Production, where the challenge is preserving the useful incident shape while removing unsafe production dependencies.
- Connects to simulation harness design because deterministic replay is a property of the whole harness boundary, not one logging call.
Resources
- [DOC] rr: lightweight recording and deterministic debugging
- [PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store
- [DOC] Jepsen
- [DOC] Microsoft CHESS
Key Takeaways
- Deterministic replay controls the inputs, time, and scheduling choices that made a distributed execution happen.
- A replay record should be an enforceable script, not merely a pile of logs.
- Seeds help, but recorded decisions and divergence checks make replay robust when code or instrumentation changes.
- The right replay boundary is a trade-off: broad enough to capture the bug, narrow enough to understand and maintain.
← Back to Distributed Testing, Simulation, and Deterministic Replay