Time, Clocks, and Causality
LESSON
Time, Clocks, and Causality
Core Insight
Imagine editing a shared note from two devices. Your phone is offline on a train, so you change the title to "Trip Plan." At the same time, your laptop is online at home, and you change the title to "June Itinerary." Both devices later sync. The phone says its edit happened at 10:03:12. The laptop says its edit happened at 10:02:55.
Which edit should win?
The tempting answer is "keep the latest timestamp." But that answer assumes wall-clock time is the same as meaningful order. It is not. The phone clock may be fast. The laptop clock may be slow. More importantly, neither edit had seen the other before acting. From the system's point of view, the edits are independent, even if one clock number is larger.
That is the center of distributed time. Wall clocks are useful for logs, alerts, retention windows, deadlines, and human explanations. They are weaker when the system needs to know whether one event could have influenced another. The stronger idea is causality: one event causally follows another when information could have flowed from the first event to the second.
The design question is not "which timestamp is bigger?" It is "what had this participant seen when it made the decision?" If the system cannot answer that question, it can accidentally erase real work while appearing perfectly deterministic.
Three Kinds Of Time Evidence
Distributed systems use several kinds of time evidence, and each answers a different question.
A wall clock tries to answer, "What time was it for this machine?" Wall clocks are the familiar timestamps in logs and databases. They are useful, but they can drift, jump backward or forward during correction, differ across machines, or reflect when an event was observed rather than when it became meaningful.
A monotonic clock tries to answer, "How much time elapsed on this machine?" It is useful for measuring durations, timeouts, and latency because it should not jump backward like a corrected wall clock. It does not compare events across machines by itself.
A logical clock tries to answer, "What events had this participant already seen?" It does not measure physical time. It records ordering evidence created by local sequence and messages.
For the shared note, those questions lead to different decisions:
wall clock:
phone edit timestamp = 10:03:12
laptop edit timestamp = 10:02:55
monotonic clock:
phone sync took 450 ms
laptop sync took 120 ms
logical evidence:
phone edit had not seen laptop edit
laptop edit had not seen phone edit
The wall clock can help explain when users probably acted. The monotonic clock can help tune sync timeouts. The logical evidence decides whether one edit should replace another or whether the system has a conflict.
Happens-Before
The foundation term is happens-before. It names the order created by local sequence and message flow.
Three rules are enough for a working mental model:
- If one event happens earlier in the same process, it happens before the later local event.
- If one event sends a message and another event receives that message, the send happens before the receive.
- If
Ahappens beforeB, andBhappens beforeC, thenAhappens beforeC.
If neither event happens before the other, they are concurrent from the system's point of view. Concurrent does not mean simultaneous in physical time. It means the system has no evidence that either event observed or influenced the other.
Here is a causal chain:
phone:
edit title to "Trip Plan"
send sync message -------->
server:
receive sync
store title "Trip Plan"
send update -------->
laptop:
receive update
display "Trip Plan"
The laptop display happened after the phone edit in the meaningful sense because information flowed through messages. Even if one machine's wall clock is wrong, the message path proves causal order.
Now compare two offline edits:
phone:
edit title to "Trip Plan"
laptop:
edit title to "June Itinerary"
no message passed between them before either edit
These edits are concurrent. The system should not pretend one edit intentionally replaced the other. If the product cares about preserving user work, it needs a conflict rule.
Worked Path: From Timestamp Winner To Causal Conflict
Suppose the note service stores a title and a version. Each device can edit offline and later sync.
At the start, both devices have seen the same version:
base version:
title = "Draft"
version = {phone: 0, laptop: 0}
The phone edits offline:
phone local state:
title = "Trip Plan"
version = {phone: 1, laptop: 0}
The laptop also edits offline:
laptop local state:
title = "June Itinerary"
version = {phone: 0, laptop: 1}
When the server later compares the versions, neither one includes the other's history. The phone version has seen one phone event and zero laptop events. The laptop version has seen zero phone events and one laptop event.
{phone: 1, laptop: 0}
{phone: 0, laptop: 1}
comparison:
phone edit did not include laptop edit
laptop edit did not include phone edit
result: concurrent conflict
A timestamp-only system might choose the phone edit because its wall clock says 10:03:12. A causal system can say something more honest: both edits changed the same field without seeing each other. The product can keep both versions, ask the user to resolve, or apply a domain-specific merge rule.
Now consider a different path. The phone edit syncs first, and the laptop receives it before editing:
phone edit:
version = {phone: 1, laptop: 0}
laptop receives phone edit:
seen version = {phone: 1, laptop: 0}
laptop edits title:
version = {phone: 1, laptop: 1}
Now the laptop version includes the phone version's history. Replacing the title may be reasonable because the later edit had the earlier edit in view. The decision is no longer based on which wall-clock timestamp is larger. It is based on what each edit had seen.
Logical Clocks And Version Vectors
A logical clock records ordering evidence rather than physical time. A simple counter is enough when all operations flow through one ordered log. Each new event gets a larger number, and the number tells you order inside that stream.
For multi-device or multi-region systems, one counter is often too small. A version vector records what each participant has seen. In the note example:
{phone: 1, laptop: 0}
means "this version includes one phone event and zero laptop events." Another version:
{phone: 1, laptop: 1}
includes everything in the first version plus one laptop event. It is causally later than the first.
The comparison rule is:
- If every counter in version A is less than or equal to version B, and at least one is smaller, A happened before B.
- If every counter in B is less than or equal to A, and at least one is smaller, B happened before A.
- If each version is ahead in at least one place, the versions are concurrent.
Version vectors cost storage, code, and operational care. They are not free. But they protect systems where silently dropping concurrent work would break the product promise.
Where Physical Time Still Matters
Causality does not make wall clocks useless. It only narrows what they can prove.
Wall-clock time is often the right tool for human and operational boundaries. Logs need timestamps so people can reconstruct an incident. Certificates and tokens expire at real times. Backups, retention policies, billing periods, leases, and alert windows all need some idea of physical time. A support person asking "what happened around 10:00?" needs timestamps, not a version vector.
Monotonic time is often the right tool for local waiting. A timeout should usually measure elapsed time on the caller, not compare wall-clock timestamps from different machines. If a machine's wall clock jumps backward during correction, a timeout based on wall-clock time can behave strangely. A monotonic clock avoids that local problem.
Causal evidence becomes necessary when the system wants to explain dependence. Did this notification use the edited email address? Did this permission check see the revocation? Did this replica overwrite a value it had actually observed, or did it make an independent concurrent edit? Those questions are not answered by "which timestamp is larger." They are answered by "what information had crossed the boundary before the decision?"
The practical rule is simple: use physical time for deadlines, duration, expiration, and human investigation; use causal evidence when replacement, merge, or conflict depends on what an actor had already seen.
Conflict Rules Match The Promise
Once you can distinguish causal replacement from concurrent conflict, conflict handling becomes a product decision instead of a timestamp accident.
Different state wants different rules:
- A shopping cart can often merge concurrent additions.
- A document editor may keep both versions and ask for resolution.
- A counter can merge increments if the operation is designed for that.
- A permission or security setting may require coordination so conflicting changes cannot happen independently.
- A feature flag may prefer consensus because two official versions would be dangerous.
The dangerous default is last write wins by wall-clock timestamp. It is simple and sometimes acceptable for low-value state. It is unsafe when the product cannot tolerate silent loss. A fast clock can make an older edit look newer. A slow clock can make a causally later edit look older. A timestamp can create an answer without proving that the answer preserves meaning.
The trade-off is clear. Wall-clock timestamps are cheap and easy to understand. Causal metadata is heavier, but it lets the system detect the difference between "this update intentionally replaced that one" and "these updates happened independently."
Operational Signals And Limits
Causality is not only an application-level concern. Operational systems need to know when time evidence is trustworthy enough for the decision being made.
Useful signals include:
- clock offset and drift between hosts
- NTP or time-sync health
- events whose timestamps move backward
- conflict rates in replicated state
- size and age of unresolved conflict queues
- version-vector metadata growth
- cases where last-write-wins overwrote user-visible fields
Clock health still matters. Deadlines, leases, certificates, retention, and alert windows often rely on physical time. But healthy clocks do not eliminate the need for causal reasoning. Synchronized clocks can make logs easier to read; they still do not prove what one event had seen before it acted.
The limit is complexity. Tracking causality for every tiny piece of state may be unnecessary. The right question is: would silent reordering or silent overwrite break the promise? If yes, carry causal evidence or coordinate the decision. If no, a simpler timestamp or merge policy may be good enough.
Practice Prompt
Pick one kind of state: a profile, shopping cart, document, counter, permission, feature flag, or configuration value. Fill in:
state:
events that must keep causal order:
events that can be concurrent:
metadata that shows what each event had seen:
merge or conflict rule:
where wall-clock time is useful:
where wall-clock time would be unsafe:
operational signal that would reveal bad ordering assumptions:
If the merge rule is only "keep the largest timestamp," ask what happens when two offline edits both matter.
Resources
- [PAPER] Time, Clocks, and the Ordering of Events in a Distributed System
- Focus: The original happens-before and logical clock model.
- [ARTICLE] Why Logical Clocks are Easy
- Focus: Practical intuition for logical clocks, causal relationships, and distributed ordering.
- [BOOK] Designing Data-Intensive Applications
- Focus: Conflict resolution, replication, logical clocks, and the limits of wall-clock ordering.
Key Takeaways
- Wall clocks help with human time, logs, retention, and deadlines, but they do not prove causality.
- Happens-before captures order created by local sequence and message flow.
- Concurrent events are unordered by evidence, not necessarily simultaneous.
- Version vectors help detect whether one version includes another's history or whether two versions conflict.
- Conflict rules should match the product promise instead of blindly trusting the largest timestamp.