Clocks, Leases, and Safe Reads

LESSON

016 30 min advanced

Clocks, Leases, and Safe Reads

The core idea: Lease-based reads are only safe when the system can prove that no earlier leader could still believe it has authority, and that proof depends on bounded clock uncertainty rather than on optimism.

Core Insight

Harbor Point's reservation service has a familiar pressure: traders want "remaining capacity" reads to feel local and instant, but the value is only useful if it is current. After shard 184 moves leadership from ny-db-3 to md-db-2, the write side may be settled by consensus. The read side still has a sharper question: can Madrid answer from local memory immediately, or might the old New York leader still believe its read lease is valid for a few more milliseconds?

That is where clock uncertainty stops being an infrastructure footnote and becomes part of the correctness contract. A lease is a promise that one replica may act as read authority until an expiration time. But no machine observes real time directly. Each machine has a local clock with bounded error, so "it is 09:17:01.200" really means "real time is somewhere inside this interval."

If two leaders use raw wall-clock timestamps without respecting that interval, the old leader can keep serving reads after the new leader has started serving them too. That overlap is enough to break linearizable reads even if the election was otherwise correct. Harbor Point cares because shard 184 backs issuer reservation limits during market open. A stale read is not a cosmetic dashboard bug; it can make a desk act on capacity that has already been consumed.

The useful correction is that fast reads are not automatically safer than writes. They are a privilege earned by proof. Lease-based reads let a leader answer from local state only while the system can conservatively prove that no earlier leader could still be the read authority.

Time Is An Interval

Harbor Point runs a bounded-time service beside its replication layer. Each replica can ask for the current time and receive an interval rather than a single scalar:

now = [earliest, latest]

When synchronization is healthy, the interval might be only a few milliseconds wide. During a time-service incident, a VM pause, or an overloaded host, the interval widens because the system is less certain about where real time actually is. Suppose md-db-2 sees:

09:17:01.120 <= real time <= 09:17:01.127

That means it is unsafe to reason from the midpoint alone. The clock is not saying "real time is exactly 09:17:01.123." It is saying real time could already be as late as 09:17:01.127, or as early as 09:17:01.120. Lease logic has to use the pessimistic edge of the interval for each decision.

Before New York stalled, ny-db-3 held a leader lease for shard 184 until real-time expiration 09:17:01.150. Consensus can elect md-db-2 as the new term-42 writer, but lease-based reads are a different claim. They say not only "Madrid is the elected leader" but also "no previous leader could still legitimately answer this read as current." To make that statement safe, Harbor Point uses the uncertainty interval in two conservative ways:

old leader may serve reads only while latest < old_lease_expiry
new leader may start lease reads only when earliest > old_lease_expiry

The asymmetry is the point. The old leader stops early because its clock might be slow. The new leader starts late because its clock might be fast. Those two rules create a guard band with no overlap. There may be a brief period where neither replica may use the lease fast path, but a short gap is much safer than overlapping authority.

The guard band is the price of keeping the read contract honest:

real time ------------------------------>

old lease expiry at Texp = 09:17:01.150

ny-db-3 can still prove lease validity only if:
  TT.now().latest < Texp

md-db-2 can start lease reads only if:
  TT.now().earliest > Texp

          safe for old      guard band       safe for new
------------|------------------|----------------------------->
         latest<Texp        uncertainty gap        earliest>Texp

This is why teams talk about clock uncertainty instead of clock skew alone. Skew is the physical reality; uncertainty is the modeled bound that the software uses to stay safe in spite of that reality.

How A Lease Makes Reads Local

Harbor Point's replication layer still uses consensus to decide who may append writes for term 42. The lease does not replace consensus. It converts recent quorum agreement into a fast local read path.

Every 100 milliseconds, md-db-2 sends heartbeats to md-db-4 and ny-db-3 when reachable. A heartbeat acknowledgment from a quorum means two things at once: the quorum still recognizes term 42, and those replicas will not grant a conflicting read lease before the current lease window closes. With that proof in hand, the leader can answer many "current remaining capacity" reads from local state without appending a no-op or round-tripping to a quorum for every request.

In practice, Harbor Point does not treat the full configured lease duration as usable. It subtracts a safety margin for clock uncertainty and message delay. A simplified model looks like this:

configured lease duration: 300 ms
worst-case clock uncertainty: 8 ms
network / processing reserve: 12 ms
usable local-read window after quorum renewal: 280 ms

The exact arithmetic varies by implementation, but the principle is stable: the lease window consumed by the application is shorter than the window granted by the protocol. That difference is not waste. It is the margin that prevents a leader from reading one step beyond what the rest of the cluster can safely justify.

For shard 184, the sequence after leadership changes is:

09:17:01.091  md-db-2 becomes active leader for term 42
09:17:01.130  quorum heartbeat acknowledgments arrive
09:17:01.130  lease granted until 09:17:01.430
09:17:01.138  uncertainty-adjusted safe window opens for local reads
09:17:01.418  uncertainty-adjusted safe window closes

Between 09:17:01.138 and 09:17:01.418, md-db-2 can serve a linearizable read of "remaining reservation capacity" from local state because the lease proof says no competing leader can exist in that interval. Outside that window, the leader has options, but "pretend the lease is probably still good" is not one of them. It must renew the lease or switch to a slower read protocol that confirms leadership again.

This is also why safe reads are not identical to "reads from the leader." A leader without a current proof is just a replica that was probably leader a moment ago. Production systems need a stronger statement than probability.

When The Fast Path Should Close

Suppose Harbor Point's time service degrades and the uncertainty interval widens from 8 milliseconds to 70 milliseconds after a synchronization alarm. Nothing about the replicated log has become corrupt. md-db-2 may still be the legitimate writer. But the usable portion of each lease window shrinks sharply because the guard band around expiration gets larger. The system sees this in metrics before users see correctness problems:

clock_uncertainty_ms jumps above the normal envelope.
lease_read_margin_ms approaches zero.
read_path_fallback_total starts increasing because more requests take the quorum-confirmed path.

The trade-off is direct: give up latency before giving up correctness. Harbor Point falls back to a ReadIndex-style path, where the leader confirms with a quorum that it is still current and then serves the read from local state at a verified log boundary. That adds a round trip, so p99 read latency may rise from 3 milliseconds to 18 milliseconds, but the semantics remain sound. For a market-facing reservation system, a few extra milliseconds are cheaper than one wrong capacity decision.

This fallback logic also explains why lease-based reads and failover tuning cannot be reasoned about in isolation. If follower acknowledgments slow down because a replication link is saturated, the leader renews its lease later and the safe local-read window collapses more often. The system is still correct, but it spends more time on the expensive read path. That is the bridge to flow control: backpressure is not only a write-side concern when lease renewal depends on timely quorum communication.

Operational Failure Modes

Read latency spikes after a clock synchronization alert, even though write leadership never changed. The uncertainty interval widened, so the lease fast path lost most or all of its safe window. Reads are falling back to quorum confirmation. Alert on time uncertainty directly, not just on NTP daemon health, and treat the increased fallback rate as protective behavior while clock quality recovers.

A stale read appears right after failover, even though the election log shows a clean term transition. The read path relied on raw local wall-clock comparisons or skipped the uncertainty guard band, allowing the old and new leaders to believe the lease was valid at the same time. Store lease expiry in a form all replicas interpret conservatively, check latest < expiry before serving old-leader lease reads, and require the successor to wait until earliest > expiry before enabling its own lease fast path.

The leader frequently refuses lease reads during peak traffic, but time synchronization looks healthy. Lease renewal depends on timely quorum acknowledgments. Replication lag, congested heartbeat channels, or overloaded followers can reduce the usable read window even when clocks are accurate. Separate heartbeat and replication latency metrics, watch how much margin remains before lease expiry, and investigate follower overload before tuning the lease duration upward.

Connections

Secondary Indexes Across Shards showed that optimized read paths still need explicit freshness contracts; leases apply the same discipline to leader-local reads.
Membership Changes and Replica Set Evolution continues the authority story by separating suspicion, committed configuration, and safe participation in a replica set.
Replication Flow Control and Backpressure explains why timely follower acknowledgments affect both write throughput and lease-read availability.

Resources

[PAPER] Spanner: Google's Globally-Distributed Database
- Focus: Read the sections on TrueTime and external consistency to see how bounded uncertainty becomes an explicit API for correctness decisions.
[DOC] etcd API guarantees
- Focus: Compare Harbor Point's lease-read fast path with etcd's linearizable read guarantees and the cases where extra quorum confirmation is required.
[DOC] CockroachDB architecture: transaction layer
- Focus: Look for how hybrid logical clocks, uncertainty windows, and leaseholders interact in a production SQL system.
[PAPER] Paxos Made Live: An Engineering Perspective
- Focus: Pay attention to how a formally correct leader protocol still needs operational guardrails around time, failure detection, and stale authority.

Key Takeaways

A lease is a time-bounded proof, not a performance hint; local reads are safe only while the system can prove that no earlier leader still has authority.
Clock uncertainty turns "now" into an interval, so old leaders must stop early and successors must start late around lease handoff.
The correct degraded mode is slower reads, not unsafe reads: when uncertainty, heartbeat delay, or replication pressure erodes lease margin, the system should fall back to quorum-confirmed reads.

← Back to Consistency and Replication

← Back to Distributed Systems

← Back to Learning Hub