Distributed Testing, Simulation, and Deterministic Replay: Logical Time, Clocks, and Event Ordering

LESSON

Distributed Testing, Simulation, and Deterministic Replay

004 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Logical Time, Clocks, and Event Ordering

Core Insight

A replay harness cannot trust wall-clock time to explain why a distributed bug happened. Two nodes can look at different clocks, messages can arrive after later events, and a retry can be caused by a timeout even when the original operation later commits.

In CheckoutService, imagine a payment confirmation, an inventory reservation, and a client retry all appear in logs with timestamps around 10:15:03. The timestamp alone does not tell you whether the retry caused the duplicate payment, whether the reservation happened before the confirmation, or whether a delayed message arrived after a crash. To replay and debug the run, the harness needs an event order that records causality, not just time-of-day.

Logical time gives the harness a way to talk about "before" and "after" when physical clocks are unreliable or irrelevant. The trade-off is that logical time is less intuitive than wall time: it may not tell you how many milliseconds passed, but it can tell you which event could have influenced another event. That is exactly the distinction deterministic replay needs.

Wall Time Is Not Causal Order

Wall time is the time read from a physical or operating-system clock. It is useful for latency, deadlines, rate limits, and human debugging. It is not a safe proof of causal order in a distributed system.

Several things break the naive story:

Nodes read different clocks.
Clocks can drift, jump, or be corrected.
Logs are written after events are observed locally, not at one global point.
Network delay can make an earlier send arrive after a later send.
Timeout behavior depends on local clock reads and scheduler decisions.

Suppose the harness records:

10:15:03.100 inventory-1: reserve(item-17) accepted
10:15:03.090 order-2: confirm(order-101) started
10:15:03.130 client-A: retry confirm(order-101)
10:15:03.125 payment-1: capture(order-101) accepted

The timestamps are helpful clues, but they do not prove the true order. order-2 may have started before or after inventory-1 accepted the reservation, depending on clock skew. The client retry may have been scheduled by a local timeout before the payment response was delivered. The payment capture may have been caused by the first confirm or by the retry.

If the oracle from lesson 2 checks idempotency, it needs the causality graph behind those observations. If the failure model from lesson 3 includes clock skew or timeout behavior, the harness must decide which clock reads are controlled and which events are ordered only by message delivery, durable writes, or scheduler steps.

Logical Time

Logical time is a numbering scheme that captures event relationships. The numbers are not trying to be real clock readings. They are trying to preserve the rule "if one event caused another, the first event must come before the second."

The simplest example is a Lamport clock. Each process keeps a counter. It increments the counter for local events, attaches the counter to messages, and moves its own counter forward when it receives a message with a larger value.

inventory-1 local reserve accepted      L=7
inventory-1 sends reservation event     L=8
order-2 receives reservation event      L=max(3, 8)+1 = 9
order-2 confirms order                  L=10

Lamport clocks tell you that if event A causally happened before event B, A's timestamp will be smaller than B's. They do not tell you the reverse. If A has a smaller Lamport timestamp than B, A may have caused B, or they may simply be unrelated events that received different numbers.

Vector clocks keep more information. Each process carries a small map of counters, one per process. That lets the system distinguish "A happened before B" from "A and B were concurrent."

inventory-1 accepted reservation:  {inventory: 7, order: 2}
order-2 started confirmation:      {inventory: 4, order: 9}

Neither vector dominates the other, so the events are concurrent from the harness's point of view. That means an oracle should not require one to precede the other unless the service contract or a later message creates that order.

Event Ordering for Replay

A deterministic replay log should record enough ordering evidence to reproduce the run and enough causal evidence to explain it.

Common event types include:

Client request observed by the harness.
Message sent, delivered, dropped, duplicated, or reordered.
Timer scheduled, fired, cancelled, or advanced.
Durable write started and committed.
Crash, restart, and recovery boundary.
Dependency call issued and response delivered.
Client-visible response returned.

The harness can then describe a run as a sequence of scheduler decisions plus causal edges:

e1 client-A -> order-2: confirm(order-101, key=k1)
e2 order-2 -> payment-1: capture(order-101)
e3 timer(client-A retry) scheduled for logical step 12
e4 payment-1 -> order-2: accepted
e5 client-A timeout fires
e6 client-A -> order-3: confirm(order-101, key=k1)
e7 order-2 -> client-A: success

The replay engine may choose to reproduce the exact scheduler sequence. The debugger may use causal edges to ask a different question: could e6 have influenced e7, or was the retry independent of the success response? Those are different uses of order, and both matter.

This is where logical time changes design decisions. Instead of sprinkling calls to the host clock through the service, the harness should route time through controllable APIs. Instead of logging only timestamps, it should record message identifiers, timer identifiers, sender and receiver, durable commit points, and client request identifiers. Those fields give replay a structured history rather than a pile of timestamped text.

Worked Example

The team sees two payment captures for one order. The first log view looks like this:

10:15:03.100 client-A retry confirm(order-101, k1)
10:15:03.101 order-2 capture payment(order-101)
10:15:03.099 order-1 capture payment(order-101)
10:15:03.103 payment-1 accepted capture #1
10:15:03.104 payment-1 accepted capture #2

By wall time, this looks confusing. The retry appears before one capture and after another. The team cannot tell whether the retry caused both captures or whether two order replicas raced independently.

A causal replay log is clearer:

e1 client-A -> order-1: confirm(order-101, k1)
e2 order-1 -> payment-1: capture(order-101, k1)
e3 client-A timeout fires before e2 response delivered
e4 client-A -> order-2: confirm(order-101, k1)
e5 order-2 does not observe in-flight capture for k1
e6 order-2 -> payment-1: capture(order-101, k1)
e7 payment-1 accepts e2
e8 payment-1 accepts e6

Now the failure is specific. The duplicate capture did not come from clock order. It came from missing idempotency coordination across replicas while the client retry was concurrent with the first capture. The oracle can fail the history, the failure model can explain why the timeout was allowed, and the next run can replay the same interleaving.

Common Ordering Mistakes

One mistake is sorting logs by timestamp and treating the result as truth. That can hide exactly the bugs distributed tests are meant to expose. A sorted log is a view, not an ordering proof.

Another mistake is using logical time as if it measured latency. A Lamport timestamp of 20 does not mean twenty milliseconds, and a vector clock does not tell you how long two concurrent branches ran. Use wall time for duration and logical time for causality.

A third mistake is recording timer fires without recording timer scheduling. If a retry fires, the harness should know which operation scheduled it, which clock advanced it, and whether the original response was already deliverable but delayed.

The last mistake is over-ordering concurrent events. If two events are genuinely concurrent under the model, the oracle should allow either order unless the contract says otherwise. Over-ordering creates false failures and makes the test suite brittle.

Practice

Take a short failure history with a client request, a timeout, a retry, a message delay, and a durable commit. Write it twice:

As wall-clock logs sorted by timestamp.
As causal events with message, timer, and commit identifiers.

Then mark which events definitely happened before another event, which events were concurrent, and which ordering cannot be inferred from the evidence. That exercise usually exposes whether the harness has enough data for deterministic replay.

Connections

Builds on Failure Models for Distributed Test Harnesses by making clock and timeout assumptions explicit.
Prepares for Schedule Control and Interleaving Exploration, where the harness starts choosing event orders deliberately rather than only recording them.
Connects to consistency and replication because linearizability, causality, and session guarantees all depend on what order the system can justify.

Resources

[PAPER] Time, Clocks, and the Ordering of Events in a Distributed System
[BOOK] Designing Data-Intensive Applications
[DOC] Jepsen
[BOOK] Distributed Systems, 4th edition

Key Takeaways

Wall-clock timestamps help with latency and debugging, but they do not prove causal order.
Logical time records which events could have influenced other events.
Replay logs need message, timer, commit, and client identifiers, not just sorted timestamps.
Correct oracles should respect concurrency instead of inventing order the system never promised.

← Back to Distributed Testing, Simulation, and Deterministic Replay

← Back to Distributed Systems

← Back to Learning Hub