Distributed Testing, Simulation, and Deterministic Replay: Trace Capture, Causality, and Event Logs
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Trace Capture, Causality, and Event Logs
Core Insight
In LedgerService, a Jepsen-style history can tell you that two transfers returned ok and a later balance read returned 290 instead of 300. That is enough to prove the observed behavior violates the model, but it is not enough to explain the internal path. Did the leader acknowledge before replication? Did a follower apply an old log entry after recovery? Did a retry reuse the same request id or accidentally create a second command?
Trace capture is the internal evidence layer. It records the events that connect a failing external history to the nodes, messages, timers, state transitions, and scheduler choices that produced it. A good trace does not just say "something happened at 12:03." It says what caused the event, what operation it belonged to, what message or timer carried it, and what deterministic step can replay it.
The trade-off is completeness versus signal. Capturing every byte, counter, and log line can make failures impossible to read. Capturing only high-level requests hides the causal link needed for replay. Useful event logs choose the fields that preserve causality and nondeterministic choices without drowning the learner or the debugger.
History, Trace, and Event Log
A history records what clients observed. A trace records what the system did internally. An event log is the durable representation of those trace events.
Those terms overlap in everyday conversation, but separating them helps when testing distributed systems:
| Artifact | Main question | Example |
|---|---|---|
| History | What did clients observe? | c2 transfer A -> C returned ok |
| Trace | What internal path produced it? | request r88 appended on n1, not replicated to n2 |
| Event log | How is the trace stored for analysis and replay? | step=44 node=n1 event=append_log request=r88 |
The external history checks the claim. The internal trace explains the failure. The event log gives tools a stable format to analyze, shrink, compare, and replay.
If the history says a linearizable transfer system lost money, the trace should help answer where the illegal behavior entered the system.
Causality Beats Wall-Clock Order
Wall-clock timestamps are useful context, but they are weak evidence for distributed causality. Two nodes may have skewed clocks. A delayed log line may be written after the action it describes. A retry may start before the original response arrives, even if the printed timestamps look close enough to confuse the reader.
Causality asks what depended on what.
Important causal links include:
- client invocation to server receive
- request handling to durable write
- log append to acknowledgement
- message send to message delivery
- timer creation to timer firing
- crash to recovery
- random choice to selected branch
- scheduler decision to resulting event
A trace should preserve these links explicitly. One common pattern is to assign stable ids:
operation_id: client-visible operation, such as transfer-884
request_id: one client attempt, such as req-17
message_id: one network message, such as msg-991
timer_id: one scheduled timeout, such as timer-38
event_id: one recorded event, such as ev-2044
parent_event_id: the event that caused this event
That gives the debugger a happens-before skeleton. The event deliver msg-991 depends on send msg-991. The event fire timer-38 depends on schedule timer-38. The event ack req-17 depends on the internal condition the node believed was sufficient for acknowledgement.
What to Capture
A useful trace event usually captures a small stable set of fields:
step: 184
event_id: ev-184
parent_event_id: ev-177
node: n1
component: raft-log
operation_id: transfer-884
request_id: req-17
event: append_log_entry
logical_time: term=9 index=44
message_id: null
state_version: log_hash=7ac2
payload_digest: 42f1
The exact schema depends on the system, but the intent is consistent.
Capture identity fields so related events can be grouped. Capture causal fields so the path can be reconstructed. Capture logical state, not every private object in memory. Capture payload digests when full payloads are too large or sensitive. Capture nondeterministic choices, such as random values, selected scheduler actions, generated ids, and selected network delays.
Avoid relying on accidental host behavior:
- wall-clock time should not be the replay clock
- thread scheduling should not be the replay schedule
- random number generation should be seeded and recorded
- generated ids should be deterministic or logged
- external calls should record inputs and controlled responses
The goal is not to record the whole universe. The goal is to record enough to explain and reproduce the relevant causal path.
Worked Example
The failing external history from the previous lesson was:
t1 c1 invoke transfer A -> B 10
t2 nemesis partitions n1 from n2,n3
t3 c2 invoke transfer A -> C 10
t4 c1 ok
t5 c2 ok
t6 nemesis heals
t7 c3 invoke read_total
t8 c3 ok 290
A trace can connect those client-visible events to internal state:
step 21 ev-21 c1 -> n1 receive req-1 transfer A->B 10
step 22 ev-22 n1 append log index 40 for req-1
step 23 ev-23 n1 send append index 40 to n2 msg-40
step 24 ev-24 network delays msg-40 due to partition
step 25 ev-25 n1 returns ok for req-1
step 31 ev-31 c2 -> n1 receive req-2 transfer A->C 10
step 32 ev-32 n1 appends log index 41 for req-2
step 33 ev-33 n1 returns ok for req-2
step 34 ev-34 n1 crashes before durable flush of index 41
step 44 ev-44 n2 becomes leader with log through index 40
step 45 ev-45 n2 applies index 40 only
step 52 ev-52 c3 read_total returns 290
This trace suggests a bug in the acknowledgement boundary: n1 returned success for req-2 before the command was safely durable or replicated according to the protocol's own claim. The external history proves the violation. The trace identifies the suspect mechanism.
Now imagine the trace omitted the durable-flush event, the log index, and the message id. The team might still know a transfer disappeared, but they would not know whether it was a client retry bug, a log truncation bug, a duplicate request bug, or a recovery bug.
Event Logs for Replay
For deterministic replay, an event log should be ordered by simulated step, not by host timestamp. Each step should name the action selected by the scheduler and the consequences that followed.
step 184
selected_action: deliver(msg-991)
precondition: msg-991 deliverable from n1 to n2
events:
- n2 receive append_entries(term=9,index=44)
- n2 write log index 44
- n2 send ack msg-1002
invariants_checked:
- total_balance_conserved
- committed_entries_prefix_ordered
This format serves three jobs:
- It explains why the step was legal.
- It records what changed.
- It lets the replay engine choose the same action again.
Replay logs often need both high-level and low-level entries. High-level events help humans read the failure. Low-level events preserve determinism. A good format lets tools filter by operation id, node, message id, invariant, or state transition.
Trace Design Mistakes
One mistake is confusing logging volume with useful evidence. More log lines can make the causal path harder to see if the events lack ids, parent links, and state versions.
Another mistake is recording only success paths. Distributed bugs often live in timeout, retry, cancellation, crash, recovery, and duplicate delivery paths. Those events need the same trace discipline as ordinary requests.
A third mistake is recording timestamps but not causal links. Timestamp order may be helpful, but send-deliver, schedule-fire, append-ack, and crash-recover links are what make the execution explainable.
A fourth mistake is capturing sensitive or huge payloads when a digest would work. Trace systems should support debugging without turning tests into a privacy or storage problem.
A fifth mistake is letting instrumentation change the schedule. If tracing adds blocking I/O, lock contention, or hidden time dependencies, it can mask the bug it is trying to capture.
Practice
Take one failing history from a simulated distributed test and design the trace schema you would need to debug it:
- Which operation ids and request ids should appear?
- Which messages and timers need stable ids?
- Which state version or log index proves the causal path?
- Which nondeterministic choices must be recorded for replay?
- Which payloads can be replaced with digests?
- Which events should be visible in the human failure report?
Then remove one field from the schema and ask what explanation becomes impossible. That is the quickest way to distinguish essential causal evidence from noisy logging.
Connections
- Builds on Workload Modeling and Jepsen-Style Histories, because the external history proves the violation while the internal trace explains it.
- Prepares for Deterministic Replay for Inputs, Time, and Scheduling, where logged choices become the inputs to a replay engine.
- Connects to observability because tracing is useful only when it preserves causal structure, not merely when it emits more data.
Resources
- [PAPER] Time, Clocks, and the Ordering of Events in a Distributed System
- [PAPER] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
- [DOC] OpenTelemetry Traces
- [BOOK] Designing Data-Intensive Applications
Key Takeaways
- Histories record what clients observed; traces record the internal causal path; event logs store that evidence for tools and replay.
- Useful trace capture preserves ids, parent links, logical state, nondeterministic choices, and scheduler decisions.
- Wall-clock order is not enough for distributed debugging; causal links such as send-deliver and schedule-fire are the important evidence.
- Deterministic replay depends on event logs that capture both the selected action and the state-changing consequences of that action.
← Back to Distributed Testing, Simulation, and Deterministic Replay