Distributed Schedulers and Control Planes: Leases, Leadership, and Ownership Transfer
LESSON
Distributed Schedulers and Control Planes: Leases, Leadership, and Ownership Transfer
The core idea: A lease gives one controller replica time-bounded authority to act, so the design trade-off is between fast failover and strong protection against two owners acting at once.
Core Insight
Suppose the repair controller for fraud-batch runs as three replicas: ctrl-a, ctrl-b, and ctrl-c. They all watch the same workload queue, and they can all see that one GPU binding is stuck in unknown. If all three replicas try to repair it at once, the reconciliation loop from the previous lesson is no longer safe. The system needs high availability, but it also needs one clear owner for each dangerous transition.
A lease is the usual compromise. It is a durable record that says "this actor may act for this scope until this time, if it keeps renewing." The scope might be the whole scheduler, one shard of the scheduling queue, one tenant, or one resource partition. A leader is simply the actor currently holding the lease for that scope.
The non-obvious point is that a lease is not proof that the old leader is dead. It is a rule that other actors can use to decide when authority may transfer. A paused process, delayed packet, slow disk, or skewed clock can make an old leader look absent even though it will resume later. Safe control planes combine leases with fencing tokens, conditional writes, and ownership checks so an expired leader cannot keep mutating state after a new leader takes over.
Why Leases Exist
Control planes often run multiple replicas because the controller itself can fail. If the only scheduler process crashes, no new work is placed. If the only repair loop stops, failed work stays failed. Running multiple replicas improves liveness, but it creates a safety problem: replicated controllers may duplicate the same action.
Leases draw an authority boundary around a scope of work. For example:
scheduler-global: one leader performs all scheduling.scheduler-shard-12: one leader owns a subset of queue keys.tenant-fraud-repair: one leader repairs workloads for a tenant.node-gpu-7-drain: one leader coordinates a node-local transition.
The lease holder can process the associated work queue and write owned transitions. Other replicas can watch, stay warm, and attempt takeover only after the lease is not renewed. This keeps standby replicas useful without letting every replica act as if it owns the system.
The main trade-off is lease duration. Short leases detect leader failure quickly and improve failover. They also increase sensitivity to pauses, clock problems, and transient API slowness. Long leases reduce false takeover, but they make the platform wait longer after a real leader failure. The right value depends on how expensive duplicate action is compared with delayed progress.
What A Lease Record Carries
A useful lease record is more than a boolean lock. It usually contains enough information to make takeover and fencing debuggable:
scope: scheduler-shard-12
holder: ctrl-a
lease_duration: 15s
renew_time: 10:41:05.120
epoch: 1842
resource_version: 991477
The holder names the current actor. The renew_time and lease_duration define when others may try to acquire. The resource_version or compare-and-swap condition prevents two contenders from both believing they updated the same record. The epoch is a fencing token: every privileged write made by the leader can include the epoch, and the authority can reject writes from older epochs.
That fencing token matters because process state can lag behind authority. Imagine ctrl-a holds epoch 1842, then stalls for twenty seconds. ctrl-b sees the lease expire and acquires epoch 1843. When ctrl-a wakes up, it may still have local memory that says it is leader. If the binding API accepts writes only from the current epoch, ctrl-a becomes harmless: its stale repair request is rejected.
Without fencing, the system has only a polite agreement between processes. With fencing, the authority enforces the ownership transfer.
Leadership Loop
A leader election loop is itself a small reconciler. It repeatedly reads the lease record, tries a conditional update, and renews before expiry:
while running:
lease = read(scope)
if lease is empty or expired:
try_acquire(scope, holder=self, epoch=lease.epoch + 1)
if lease.holder == self:
renew_before_deadline(scope, epoch=lease.epoch)
process_owned_queue(scope, epoch=lease.epoch)
sleep(jittered_interval)
The loop should separate "I am currently leader" from "my last renewal succeeded." If renewal fails, the replica should stop starting new privileged work. It may finish local cleanup that does not mutate shared authority, but it should not bind workloads, release quota, or declare ownership over new transitions.
This is also why leader election should be boring and conservative. A controller that continues acting after losing the lease is worse than a controller that pauses too often. Pausing hurts liveness; stale authority can break safety across many workloads.
Ownership Transfer
Ownership transfer is the path from old leader to new leader. A safe transfer has three parts:
- The old leader stops renewing or loses its ability to renew.
- The new leader acquires the lease through an authoritative conditional write.
- Shared mutation points reject actions from the old leader's epoch.
For fraud-batch, a transfer might look like this:
10:41:05 ctrl-a renews scheduler-shard-12 at epoch 1842
10:41:08 ctrl-a pauses during garbage collection
10:41:22 ctrl-b observes expired lease and acquires epoch 1843
10:41:23 ctrl-b resumes repair work for fraud-batch
10:41:24 ctrl-a wakes up and tries to write a replacement request with epoch 1842
10:41:24 API rejects ctrl-a because scheduler-shard-12 is now epoch 1843
The transfer is safe because authority moved in durable state and the old epoch was fenced out. It is live because ctrl-b can resume queue processing after the timeout instead of waiting for a human to decide whether ctrl-a is dead.
The hard part is choosing which actions require fencing. Reading state does not. Updating local metrics usually does not. Writing a binding, claiming a shard, releasing quota, or declaring a task failed does. Dangerous writes should carry enough ownership evidence for the receiving API to reject stale actors.
Failure Modes
- Lease as a death detector: treating expiry as proof that the old leader is gone. The fix is to treat expiry as permission to attempt takeover, then fence stale writes.
- No fencing token: the old leader resumes and writes after a new leader takes over. The fix is epochs or generations checked by the authoritative mutation point.
- Lease too short: normal pauses cause repeated leadership churn. The fix is to tune duration, renewal interval, and retry jitter against real latency distributions.
- Lease too long: real failure leaves work idle for too long. The fix is to align lease duration with recovery objectives and the cost of delayed scheduling.
- Oversized scope: one global leader becomes a bottleneck. The fix is sharding, but each shard still needs clear ownership.
- Split ownership model: a controller uses leases for queue processing but writes bindings without checking the lease epoch. The fix is end-to-end authority, not just election ceremony.
Design Rules
- Lease scopes should match the work that needs exclusive authority.
- Renewal should happen well before expiry, with jitter and backoff for API pressure.
- Losing renewal should stop privileged actuation quickly.
- Every dangerous write should include a fencing token or equivalent ownership proof.
- Lease duration should be set from failure and latency assumptions, not copied blindly.
- Leadership state should be observable: holder, epoch, renew time, renewal errors, and takeover count.
Connections
- The previous lesson,
004.md, explained why reconcilers need idempotent actions. Leases add the ownership rule that decides which replica may perform those actions. - The next lesson,
006.md, uses this ownership foundation when walking through filtering, scoring, and binding in scheduler architecture. consensus-and-coordinationprovides the deeper background for compare-and-swap, leadership, fencing, and coordination services.
Resources
- [DOC] Kubernetes Leases
- Focus: Study how lease objects represent time-bounded ownership in a real control plane.
- [DOC] client-go Leader Election
- Focus: Look for renewal deadlines, retry periods, callbacks, and the assumptions behind leader transitions.
- [PAPER] The Chubby Lock Service for Loosely-Coupled Distributed Systems
- Focus: Use the paper to connect practical leases and locks with production coordination needs.
- [DOC] ZooKeeper Recipes and Solutions
- Focus: Compare leader election and lock recipes with lease-based controller ownership.
Key Takeaways
- Leases give one controller replica time-bounded authority for a clearly defined scope.
- Expiry is not proof that the old leader is dead; safe transfer requires fencing stale actors at mutation points.
- The central trade-off is faster failover versus stronger protection against false takeover and split ownership.
- Leadership is only useful when dangerous writes check the same ownership model that elected the leader.