Time, Clocks, and Causality

LESSON

Distributed Systems Foundations

006 25 min beginner

Time, Clocks, and Causality

Core Insight

Imagine editing a shared note from two devices. Your phone is offline on a train, so you change the title to "Trip Plan." At the same time, your laptop is online at home, and you change the title to "June Itinerary." Both devices later sync. The phone says its edit happened at 10:03:12. The laptop says its edit happened at 10:02:55.

Which edit should win?

The tempting answer is "keep the latest timestamp." But that answer assumes wall-clock time is the same as meaningful order. It is not. The phone clock may be fast. The laptop clock may be slow. More importantly, neither edit had seen the other before acting. From the system's point of view, the edits are independent, even if one clock number is larger.

That is the center of distributed time. Wall clocks are useful for logs, alerts, retention windows, deadlines, and human explanations. They are weaker when the system needs to know whether one event could have influenced another. The stronger idea is causality: one event causally follows another when information could have flowed from the first event to the second.

The design question is not "which timestamp is bigger?" It is "what had this participant seen when it made the decision?" If the system cannot answer that question, it can accidentally erase real work while appearing perfectly deterministic.

Three Kinds Of Time Evidence

Distributed systems use several kinds of time evidence, and each answers a different question.

A wall clock tries to answer, "What time was it for this machine?" Wall clocks are the familiar timestamps in logs and databases. They are useful, but they can drift, jump backward or forward during correction, differ across machines, or reflect when an event was observed rather than when it became meaningful.

A monotonic clock tries to answer, "How much time elapsed on this machine?" It is useful for measuring durations, timeouts, and latency because it should not jump backward like a corrected wall clock. It does not compare events across machines by itself.

A logical clock tries to answer, "What events had this participant already seen?" It does not measure physical time. It records ordering evidence created by local sequence and messages.

For the shared note, those questions lead to different decisions:

wall clock:
  phone edit timestamp = 10:03:12
  laptop edit timestamp = 10:02:55

monotonic clock:
  phone sync took 450 ms
  laptop sync took 120 ms

logical evidence:
  phone edit had not seen laptop edit
  laptop edit had not seen phone edit

The wall clock can help explain when users probably acted. The monotonic clock can help tune sync timeouts. The logical evidence decides whether one edit should replace another or whether the system has a conflict.

Happens-Before

The foundation term is happens-before. It names the order created by local sequence and message flow.

Three rules are enough for a working mental model:

If neither event happens before the other, they are concurrent from the system's point of view. Concurrent does not mean simultaneous in physical time. It means the system has no evidence that either event observed or influenced the other.

Here is a causal chain:

phone:
  edit title to "Trip Plan"
  send sync message -------->

server:
                         receive sync
                         store title "Trip Plan"
                         send update -------->

laptop:
                                           receive update
                                           display "Trip Plan"

The laptop display happened after the phone edit in the meaningful sense because information flowed through messages. Even if one machine's wall clock is wrong, the message path proves causal order.

Now compare two offline edits:

phone:
  edit title to "Trip Plan"

laptop:
  edit title to "June Itinerary"

no message passed between them before either edit

These edits are concurrent. The system should not pretend one edit intentionally replaced the other. If the product cares about preserving user work, it needs a conflict rule.

Worked Path: From Timestamp Winner To Causal Conflict

Suppose the note service stores a title and a version. Each device can edit offline and later sync.

At the start, both devices have seen the same version:

base version:
  title = "Draft"
  version = {phone: 0, laptop: 0}

The phone edits offline:

phone local state:
  title = "Trip Plan"
  version = {phone: 1, laptop: 0}

The laptop also edits offline:

laptop local state:
  title = "June Itinerary"
  version = {phone: 0, laptop: 1}

When the server later compares the versions, neither one includes the other's history. The phone version has seen one phone event and zero laptop events. The laptop version has seen zero phone events and one laptop event.

{phone: 1, laptop: 0}
{phone: 0, laptop: 1}

comparison:
  phone edit did not include laptop edit
  laptop edit did not include phone edit
  result: concurrent conflict

A timestamp-only system might choose the phone edit because its wall clock says 10:03:12. A causal system can say something more honest: both edits changed the same field without seeing each other. The product can keep both versions, ask the user to resolve, or apply a domain-specific merge rule.

Now consider a different path. The phone edit syncs first, and the laptop receives it before editing:

phone edit:
  version = {phone: 1, laptop: 0}

laptop receives phone edit:
  seen version = {phone: 1, laptop: 0}

laptop edits title:
  version = {phone: 1, laptop: 1}

Now the laptop version includes the phone version's history. Replacing the title may be reasonable because the later edit had the earlier edit in view. The decision is no longer based on which wall-clock timestamp is larger. It is based on what each edit had seen.

Logical Clocks And Version Vectors

A logical clock records ordering evidence rather than physical time. A simple counter is enough when all operations flow through one ordered log. Each new event gets a larger number, and the number tells you order inside that stream.

For multi-device or multi-region systems, one counter is often too small. A version vector records what each participant has seen. In the note example:

{phone: 1, laptop: 0}

means "this version includes one phone event and zero laptop events." Another version:

{phone: 1, laptop: 1}

includes everything in the first version plus one laptop event. It is causally later than the first.

The comparison rule is:

Version vectors cost storage, code, and operational care. They are not free. But they protect systems where silently dropping concurrent work would break the product promise.

Where Physical Time Still Matters

Causality does not make wall clocks useless. It only narrows what they can prove.

Wall-clock time is often the right tool for human and operational boundaries. Logs need timestamps so people can reconstruct an incident. Certificates and tokens expire at real times. Backups, retention policies, billing periods, leases, and alert windows all need some idea of physical time. A support person asking "what happened around 10:00?" needs timestamps, not a version vector.

Monotonic time is often the right tool for local waiting. A timeout should usually measure elapsed time on the caller, not compare wall-clock timestamps from different machines. If a machine's wall clock jumps backward during correction, a timeout based on wall-clock time can behave strangely. A monotonic clock avoids that local problem.

Causal evidence becomes necessary when the system wants to explain dependence. Did this notification use the edited email address? Did this permission check see the revocation? Did this replica overwrite a value it had actually observed, or did it make an independent concurrent edit? Those questions are not answered by "which timestamp is larger." They are answered by "what information had crossed the boundary before the decision?"

The practical rule is simple: use physical time for deadlines, duration, expiration, and human investigation; use causal evidence when replacement, merge, or conflict depends on what an actor had already seen.

Conflict Rules Match The Promise

Once you can distinguish causal replacement from concurrent conflict, conflict handling becomes a product decision instead of a timestamp accident.

Different state wants different rules:

The dangerous default is last write wins by wall-clock timestamp. It is simple and sometimes acceptable for low-value state. It is unsafe when the product cannot tolerate silent loss. A fast clock can make an older edit look newer. A slow clock can make a causally later edit look older. A timestamp can create an answer without proving that the answer preserves meaning.

The trade-off is clear. Wall-clock timestamps are cheap and easy to understand. Causal metadata is heavier, but it lets the system detect the difference between "this update intentionally replaced that one" and "these updates happened independently."

Operational Signals And Limits

Causality is not only an application-level concern. Operational systems need to know when time evidence is trustworthy enough for the decision being made.

Useful signals include:

Clock health still matters. Deadlines, leases, certificates, retention, and alert windows often rely on physical time. But healthy clocks do not eliminate the need for causal reasoning. Synchronized clocks can make logs easier to read; they still do not prove what one event had seen before it acted.

The limit is complexity. Tracking causality for every tiny piece of state may be unnecessary. The right question is: would silent reordering or silent overwrite break the promise? If yes, carry causal evidence or coordinate the decision. If no, a simpler timestamp or merge policy may be good enough.

Practice Prompt

Pick one kind of state: a profile, shopping cart, document, counter, permission, feature flag, or configuration value. Fill in:

state:
events that must keep causal order:
events that can be concurrent:
metadata that shows what each event had seen:
merge or conflict rule:
where wall-clock time is useful:
where wall-clock time would be unsafe:
operational signal that would reveal bad ordering assumptions:

If the merge rule is only "keep the largest timestamp," ask what happens when two offline edits both matter.

Resources

Key Takeaways

PREVIOUS Consensus, Quorums, and Coordination NEXT CAP, PACELC, and Partition-Time Behavior