Distributed Testing, Simulation, and Deterministic Replay: Trace Capture, Causality, and Event Logs

LESSON

Distributed Testing, Simulation, and Deterministic Replay

012 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Trace Capture, Causality, and Event Logs

Core Insight

In LedgerService, a Jepsen-style history can tell you that two transfers returned ok and a later balance read returned 290 instead of 300. That is enough to prove the observed behavior violates the model, but it is not enough to explain the internal path. Did the leader acknowledge before replication? Did a follower apply an old log entry after recovery? Did a retry reuse the same request id or accidentally create a second command?

Trace capture is the internal evidence layer. It records the events that connect a failing external history to the nodes, messages, timers, state transitions, and scheduler choices that produced it. A good trace does not just say "something happened at 12:03." It says what caused the event, what operation it belonged to, what message or timer carried it, and what deterministic step can replay it.

The trade-off is completeness versus signal. Capturing every byte, counter, and log line can make failures impossible to read. Capturing only high-level requests hides the causal link needed for replay. Useful event logs choose the fields that preserve causality and nondeterministic choices without drowning the learner or the debugger.

History, Trace, and Event Log

A history records what clients observed. A trace records what the system did internally. An event log is the durable representation of those trace events.

Those terms overlap in everyday conversation, but separating them helps when testing distributed systems:

Artifact Main question Example
History What did clients observe? c2 transfer A -> C returned ok
Trace What internal path produced it? request r88 appended on n1, not replicated to n2
Event log How is the trace stored for analysis and replay? step=44 node=n1 event=append_log request=r88

The external history checks the claim. The internal trace explains the failure. The event log gives tools a stable format to analyze, shrink, compare, and replay.

If the history says a linearizable transfer system lost money, the trace should help answer where the illegal behavior entered the system.

Causality Beats Wall-Clock Order

Wall-clock timestamps are useful context, but they are weak evidence for distributed causality. Two nodes may have skewed clocks. A delayed log line may be written after the action it describes. A retry may start before the original response arrives, even if the printed timestamps look close enough to confuse the reader.

Causality asks what depended on what.

Important causal links include:

A trace should preserve these links explicitly. One common pattern is to assign stable ids:

operation_id: client-visible operation, such as transfer-884
request_id: one client attempt, such as req-17
message_id: one network message, such as msg-991
timer_id: one scheduled timeout, such as timer-38
event_id: one recorded event, such as ev-2044
parent_event_id: the event that caused this event

That gives the debugger a happens-before skeleton. The event deliver msg-991 depends on send msg-991. The event fire timer-38 depends on schedule timer-38. The event ack req-17 depends on the internal condition the node believed was sufficient for acknowledgement.

What to Capture

A useful trace event usually captures a small stable set of fields:

step: 184
event_id: ev-184
parent_event_id: ev-177
node: n1
component: raft-log
operation_id: transfer-884
request_id: req-17
event: append_log_entry
logical_time: term=9 index=44
message_id: null
state_version: log_hash=7ac2
payload_digest: 42f1

The exact schema depends on the system, but the intent is consistent.

Capture identity fields so related events can be grouped. Capture causal fields so the path can be reconstructed. Capture logical state, not every private object in memory. Capture payload digests when full payloads are too large or sensitive. Capture nondeterministic choices, such as random values, selected scheduler actions, generated ids, and selected network delays.

Avoid relying on accidental host behavior:

The goal is not to record the whole universe. The goal is to record enough to explain and reproduce the relevant causal path.

Worked Example

The failing external history from the previous lesson was:

t1  c1 invoke transfer A -> B 10
t2  nemesis partitions n1 from n2,n3
t3  c2 invoke transfer A -> C 10
t4  c1 ok
t5  c2 ok
t6  nemesis heals
t7  c3 invoke read_total
t8  c3 ok 290

A trace can connect those client-visible events to internal state:

step 21  ev-21  c1 -> n1 receive req-1 transfer A->B 10
step 22  ev-22  n1 append log index 40 for req-1
step 23  ev-23  n1 send append index 40 to n2 msg-40
step 24  ev-24  network delays msg-40 due to partition
step 25  ev-25  n1 returns ok for req-1

step 31  ev-31  c2 -> n1 receive req-2 transfer A->C 10
step 32  ev-32  n1 appends log index 41 for req-2
step 33  ev-33  n1 returns ok for req-2
step 34  ev-34  n1 crashes before durable flush of index 41

step 44  ev-44  n2 becomes leader with log through index 40
step 45  ev-45  n2 applies index 40 only
step 52  ev-52  c3 read_total returns 290

This trace suggests a bug in the acknowledgement boundary: n1 returned success for req-2 before the command was safely durable or replicated according to the protocol's own claim. The external history proves the violation. The trace identifies the suspect mechanism.

Now imagine the trace omitted the durable-flush event, the log index, and the message id. The team might still know a transfer disappeared, but they would not know whether it was a client retry bug, a log truncation bug, a duplicate request bug, or a recovery bug.

Event Logs for Replay

For deterministic replay, an event log should be ordered by simulated step, not by host timestamp. Each step should name the action selected by the scheduler and the consequences that followed.

step 184
selected_action: deliver(msg-991)
precondition: msg-991 deliverable from n1 to n2
events:
- n2 receive append_entries(term=9,index=44)
- n2 write log index 44
- n2 send ack msg-1002
invariants_checked:
- total_balance_conserved
- committed_entries_prefix_ordered

This format serves three jobs:

  1. It explains why the step was legal.
  2. It records what changed.
  3. It lets the replay engine choose the same action again.

Replay logs often need both high-level and low-level entries. High-level events help humans read the failure. Low-level events preserve determinism. A good format lets tools filter by operation id, node, message id, invariant, or state transition.

Trace Design Mistakes

One mistake is confusing logging volume with useful evidence. More log lines can make the causal path harder to see if the events lack ids, parent links, and state versions.

Another mistake is recording only success paths. Distributed bugs often live in timeout, retry, cancellation, crash, recovery, and duplicate delivery paths. Those events need the same trace discipline as ordinary requests.

A third mistake is recording timestamps but not causal links. Timestamp order may be helpful, but send-deliver, schedule-fire, append-ack, and crash-recover links are what make the execution explainable.

A fourth mistake is capturing sensitive or huge payloads when a digest would work. Trace systems should support debugging without turning tests into a privacy or storage problem.

A fifth mistake is letting instrumentation change the schedule. If tracing adds blocking I/O, lock contention, or hidden time dependencies, it can mask the bug it is trying to capture.

Practice

Take one failing history from a simulated distributed test and design the trace schema you would need to debug it:

  1. Which operation ids and request ids should appear?
  2. Which messages and timers need stable ids?
  3. Which state version or log index proves the causal path?
  4. Which nondeterministic choices must be recorded for replay?
  5. Which payloads can be replaced with digests?
  6. Which events should be visible in the human failure report?

Then remove one field from the schema and ask what explanation becomes impossible. That is the quickest way to distinguish essential causal evidence from noisy logging.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: Workload Modeling and Jepsen-Style Histories NEXT Distributed Testing, Simulation, and Deterministic Replay: Deterministic Replay for Inputs, Time, and Scheduling