Coordination APIs: Locks, Leases, Watches, and Compare-And-Swap
LESSON
Coordination APIs: Locks, Leases, Watches, and Compare-And-Swap
The core idea: Coordination APIs turn consensus history into usable authority primitives, with a trade-off between expressive low-level operations and safer contracts that expose revisions, leases, watches, and stale-client failure modes.
Core Insight
Imagine a platform team exposes a consensus store to application teams. If the only interface is "write keys and read keys," every team will invent its own locking, leadership, rollout, and watch logic on top. Some of those inventions will work in tests and fail when a client pauses, a watch disconnects, or two controllers race.
Coordination APIs package consensus into safer primitives. Compare-and-swap turns an update into a guarded decision. Leases and sessions give ownership expiry semantics. Watches expose committed history as a recovery stream. Locks and elections combine those pieces into common authority patterns.
The misconception is that the consensus protocol is the product API. It is not. The protocol protects the store's history. The API decides what application developers are allowed to claim from that history, how they prove freshness, and how they recover when their assumptions become stale.
The design pressure is practical: the API should make the safe path obvious. A low-level key-value interface is expressive, but it can push every client into rebuilding the same fragile coordination patterns. A sharper API can prevent common split-brain, stale-read, and missed-watch mistakes.
Compare-And-Swap Turns State Into a Guarded Decision
Compare-and-swap, often exposed as a transaction or conditional update, says:
if revision(key) == R:
write new value
else:
fail and let the client retry with fresh state
That is a small primitive with large consequences. It lets clients express "only perform this change if the world is still the world I inspected." Without that guard, two clients can read the same value and overwrite each other.
In a coordination system, CAS protects decisions such as:
- configuration updates,
- leader election records,
- lock acquisition,
- rollout state transitions,
- membership metadata.
The important part is not only atomicity. It is evidence. A failed CAS tells the client its assumptions are stale and must be refreshed before it acts. That failure is a feature: it turns a hidden race into a visible retry path.
CAS also sets a boundary around business logic. The client may calculate a desired change locally, but the coordination service decides whether the precondition still holds in the committed order.
Leases and Locks Need Expiry Semantics
A lock without failure handling is dangerous. If the owner crashes, does the lock last forever? If it expires, how does a resumed owner know it lost authority?
This is why practical coordination APIs use sessions, leases, or TTL-bound ownership. A client owns something only while its lease remains valid. The service can revoke or expire that ownership when the client stops renewing.
The lock or election API should expose enough information for clients and downstream systems to distinguish current authority from stale authority:
acquire lease
receive revision or fencing token
act with token
downstream validates token
That connects this API directly to fencing. A lock service can grant authority, but protected resources still need a way to reject stale actors. If the API says only "lock acquired" and hides the revision or token, it makes correct downstream enforcement harder.
Good lock APIs are also honest about liveness. They define renewal deadlines, expiry behavior, session loss, and what a client must do after reconnecting. A client that lost its session should assume it lost ownership until it proves otherwise.
Watches Turn Committed History Into Reconciliation Input
Watches let clients observe changes from a known revision:
watch /deployments from revision 1200
This is the backbone of many control planes. Controllers do not poll blindly. They watch ordered metadata changes, update local caches, and reconcile actual state toward desired state.
Good watch APIs need clear semantics:
- what revision the watch starts from,
- whether events are ordered per key, per prefix, or globally,
- what happens if the watcher falls too far behind,
- how the client resumes after disconnect,
- whether deletes, compaction, and lease expiry appear as explicit events.
If those semantics are vague, controllers can miss changes, process stale state, or replay actions unsafely.
The recovery contract is as important as the streaming contract. A useful watch API gives the client a stable way to say, "I last processed revision 1200; show me everything after that, or tell me I must resync from a fresh snapshot."
Worked Example: A Safer Leader Election API
Suppose a deployment controller needs exactly one active leader.
The fragile version stores a key:
/leaders/deploy = controller-a
That looks simple, but it leaves too many questions open. Did controller-a acquire the key conditionally? Does the key expire? What revision proves the current ownership? What should another controller watch? What happens when controller-a reconnects after a pause?
A safer API sequence looks like this:
create session or lease -> lease_id=17
CAS /leaders/deploy if missing -> owner=controller-a, lease=17, token=41
watch /leaders/deploy from revision R
act only while lease is live
include token=41 in protected downstream actions
Now the authority claim has evidence. The CAS prevents two simultaneous acquisitions. The lease gives failure semantics. The revision gives watchers a recovery point. The token gives downstream systems a way to reject stale work.
API Design Review
When designing a coordination API, ask four questions:
- Which operation creates authority?
- Which revision or token proves that authority?
- How does a client learn it is stale?
- How does a watcher recover after losing its stream?
Those questions keep the API honest. They turn consensus from a hidden implementation detail into a usable control surface.
Add three more questions for production use:
- What can a client cache, and what must be read linearly?
- What happens after compaction removes the revision a watcher wanted?
- Which operations are intentionally not provided because they are too easy to misuse?
That last question is part of the API trade-off. A powerful primitive can support many designs, but a constrained primitive can encode the safety rules most clients should not have to rediscover.
Common Failure Modes
The failures usually come from missing evidence at the API boundary:
- a lock API returns ownership but no fencing token,
- a watch API streams changes but gives no reliable resume point,
- clients use cached reads for authority decisions,
- a CAS failure is treated as a transient error instead of stale assumptions,
- lease expiry is hidden behind client libraries until stale actors have already acted.
These are not failures of Paxos or Raft alone. They are failures to expose the protocol's evidence in a form application code can use correctly.
Connections
The previous lesson showed why linearizable reads, leases, and fencing matter for authority. This lesson turns those ideas into API shape: revisions, tokens, conditional updates, and recovery contracts.
The next lesson shifts from client-facing semantics to operations. Once a coordination API becomes a control-plane dependency, latency, disk behavior, network placement, and sizing determine whether those safe operations remain usable under load.
Resources
- [DOC] etcd API
- Focus: Study transactions, watches, leases, and revisions as one API surface.
- [DOC] ZooKeeper Programmer's Guide
- Focus: Compare znodes, watches, sessions, and ephemeral nodes.
- [PAPER] The Chubby Lock Service for Loosely-Coupled Distributed Systems
- Focus: Read for why coordination APIs need semantics beyond a raw consensus log.
Key Takeaways
- Coordination APIs turn consensus history into guarded application decisions that clients can use without reimplementing the protocol.
- Compare-and-swap exposes stale assumptions instead of silently overwriting state.
- Locks and leases need explicit expiry, revision, session, and fencing-token semantics.
- Watches are safe only when clients can resume from known revisions or resync after falling behind.