Coordination APIs: Locks, Leases, Watches, and Compare-And-Swap

LESSON

Consensus and Coordination

020 30 min intermediate

Coordination APIs: Locks, Leases, Watches, and Compare-And-Swap

The core idea: Coordination APIs turn consensus history into usable authority primitives, with a trade-off between expressive low-level operations and safer contracts that expose revisions, leases, watches, and stale-client failure modes.

Core Insight

Imagine a platform team exposes a consensus store to application teams. If the only interface is "write keys and read keys," every team will invent its own locking, leadership, rollout, and watch logic on top. Some of those inventions will work in tests and fail when a client pauses, a watch disconnects, or two controllers race.

Coordination APIs package consensus into safer primitives. Compare-and-swap turns an update into a guarded decision. Leases and sessions give ownership expiry semantics. Watches expose committed history as a recovery stream. Locks and elections combine those pieces into common authority patterns.

The misconception is that the consensus protocol is the product API. It is not. The protocol protects the store's history. The API decides what application developers are allowed to claim from that history, how they prove freshness, and how they recover when their assumptions become stale.

The design pressure is practical: the API should make the safe path obvious. A low-level key-value interface is expressive, but it can push every client into rebuilding the same fragile coordination patterns. A sharper API can prevent common split-brain, stale-read, and missed-watch mistakes.

Compare-And-Swap Turns State Into a Guarded Decision

Compare-and-swap, often exposed as a transaction or conditional update, says:

if revision(key) == R:
    write new value
else:
    fail and let the client retry with fresh state

That is a small primitive with large consequences. It lets clients express "only perform this change if the world is still the world I inspected." Without that guard, two clients can read the same value and overwrite each other.

In a coordination system, CAS protects decisions such as:

The important part is not only atomicity. It is evidence. A failed CAS tells the client its assumptions are stale and must be refreshed before it acts. That failure is a feature: it turns a hidden race into a visible retry path.

CAS also sets a boundary around business logic. The client may calculate a desired change locally, but the coordination service decides whether the precondition still holds in the committed order.

Leases and Locks Need Expiry Semantics

A lock without failure handling is dangerous. If the owner crashes, does the lock last forever? If it expires, how does a resumed owner know it lost authority?

This is why practical coordination APIs use sessions, leases, or TTL-bound ownership. A client owns something only while its lease remains valid. The service can revoke or expire that ownership when the client stops renewing.

The lock or election API should expose enough information for clients and downstream systems to distinguish current authority from stale authority:

acquire lease
receive revision or fencing token
act with token
downstream validates token

That connects this API directly to fencing. A lock service can grant authority, but protected resources still need a way to reject stale actors. If the API says only "lock acquired" and hides the revision or token, it makes correct downstream enforcement harder.

Good lock APIs are also honest about liveness. They define renewal deadlines, expiry behavior, session loss, and what a client must do after reconnecting. A client that lost its session should assume it lost ownership until it proves otherwise.

Watches Turn Committed History Into Reconciliation Input

Watches let clients observe changes from a known revision:

watch /deployments from revision 1200

This is the backbone of many control planes. Controllers do not poll blindly. They watch ordered metadata changes, update local caches, and reconcile actual state toward desired state.

Good watch APIs need clear semantics:

If those semantics are vague, controllers can miss changes, process stale state, or replay actions unsafely.

The recovery contract is as important as the streaming contract. A useful watch API gives the client a stable way to say, "I last processed revision 1200; show me everything after that, or tell me I must resync from a fresh snapshot."

Worked Example: A Safer Leader Election API

Suppose a deployment controller needs exactly one active leader.

The fragile version stores a key:

/leaders/deploy = controller-a

That looks simple, but it leaves too many questions open. Did controller-a acquire the key conditionally? Does the key expire? What revision proves the current ownership? What should another controller watch? What happens when controller-a reconnects after a pause?

A safer API sequence looks like this:

create session or lease -> lease_id=17
CAS /leaders/deploy if missing -> owner=controller-a, lease=17, token=41
watch /leaders/deploy from revision R
act only while lease is live
include token=41 in protected downstream actions

Now the authority claim has evidence. The CAS prevents two simultaneous acquisitions. The lease gives failure semantics. The revision gives watchers a recovery point. The token gives downstream systems a way to reject stale work.

API Design Review

When designing a coordination API, ask four questions:

  1. Which operation creates authority?
  2. Which revision or token proves that authority?
  3. How does a client learn it is stale?
  4. How does a watcher recover after losing its stream?

Those questions keep the API honest. They turn consensus from a hidden implementation detail into a usable control surface.

Add three more questions for production use:

  1. What can a client cache, and what must be read linearly?
  2. What happens after compaction removes the revision a watcher wanted?
  3. Which operations are intentionally not provided because they are too easy to misuse?

That last question is part of the API trade-off. A powerful primitive can support many designs, but a constrained primitive can encode the safety rules most clients should not have to rediscover.

Common Failure Modes

The failures usually come from missing evidence at the API boundary:

These are not failures of Paxos or Raft alone. They are failures to expose the protocol's evidence in a form application code can use correctly.

Connections

The previous lesson showed why linearizable reads, leases, and fencing matter for authority. This lesson turns those ideas into API shape: revisions, tokens, conditional updates, and recovery contracts.

The next lesson shifts from client-facing semantics to operations. Once a coordination API becomes a control-plane dependency, latency, disk behavior, network placement, and sizing determine whether those safe operations remain usable under load.

Resources

Key Takeaways

PREVIOUS Linearizable Reads, Leader Leases, and Fencing NEXT Operating Consensus Clusters: Latency, Disk, Network, and Sizing