Distributed Schedulers and Control Planes: Leases, Leadership, and Ownership Transfer

LESSON

Distributed Schedulers and Control Planes

005 35 min advanced

Distributed Schedulers and Control Planes: Leases, Leadership, and Ownership Transfer

The core idea: A lease gives one controller replica time-bounded authority to act, so the design trade-off is between fast failover and strong protection against two owners acting at once.

Core Insight

Suppose the repair controller for fraud-batch runs as three replicas: ctrl-a, ctrl-b, and ctrl-c. They all watch the same workload queue, and they can all see that one GPU binding is stuck in unknown. If all three replicas try to repair it at once, the reconciliation loop from the previous lesson is no longer safe. The system needs high availability, but it also needs one clear owner for each dangerous transition.

A lease is the usual compromise. It is a durable record that says "this actor may act for this scope until this time, if it keeps renewing." The scope might be the whole scheduler, one shard of the scheduling queue, one tenant, or one resource partition. A leader is simply the actor currently holding the lease for that scope.

The non-obvious point is that a lease is not proof that the old leader is dead. It is a rule that other actors can use to decide when authority may transfer. A paused process, delayed packet, slow disk, or skewed clock can make an old leader look absent even though it will resume later. Safe control planes combine leases with fencing tokens, conditional writes, and ownership checks so an expired leader cannot keep mutating state after a new leader takes over.

Why Leases Exist

Control planes often run multiple replicas because the controller itself can fail. If the only scheduler process crashes, no new work is placed. If the only repair loop stops, failed work stays failed. Running multiple replicas improves liveness, but it creates a safety problem: replicated controllers may duplicate the same action.

Leases draw an authority boundary around a scope of work. For example:

The lease holder can process the associated work queue and write owned transitions. Other replicas can watch, stay warm, and attempt takeover only after the lease is not renewed. This keeps standby replicas useful without letting every replica act as if it owns the system.

The main trade-off is lease duration. Short leases detect leader failure quickly and improve failover. They also increase sensitivity to pauses, clock problems, and transient API slowness. Long leases reduce false takeover, but they make the platform wait longer after a real leader failure. The right value depends on how expensive duplicate action is compared with delayed progress.

What A Lease Record Carries

A useful lease record is more than a boolean lock. It usually contains enough information to make takeover and fencing debuggable:

scope: scheduler-shard-12
holder: ctrl-a
lease_duration: 15s
renew_time: 10:41:05.120
epoch: 1842
resource_version: 991477

The holder names the current actor. The renew_time and lease_duration define when others may try to acquire. The resource_version or compare-and-swap condition prevents two contenders from both believing they updated the same record. The epoch is a fencing token: every privileged write made by the leader can include the epoch, and the authority can reject writes from older epochs.

That fencing token matters because process state can lag behind authority. Imagine ctrl-a holds epoch 1842, then stalls for twenty seconds. ctrl-b sees the lease expire and acquires epoch 1843. When ctrl-a wakes up, it may still have local memory that says it is leader. If the binding API accepts writes only from the current epoch, ctrl-a becomes harmless: its stale repair request is rejected.

Without fencing, the system has only a polite agreement between processes. With fencing, the authority enforces the ownership transfer.

Leadership Loop

A leader election loop is itself a small reconciler. It repeatedly reads the lease record, tries a conditional update, and renews before expiry:

while running:
  lease = read(scope)

  if lease is empty or expired:
    try_acquire(scope, holder=self, epoch=lease.epoch + 1)

  if lease.holder == self:
    renew_before_deadline(scope, epoch=lease.epoch)
    process_owned_queue(scope, epoch=lease.epoch)

  sleep(jittered_interval)

The loop should separate "I am currently leader" from "my last renewal succeeded." If renewal fails, the replica should stop starting new privileged work. It may finish local cleanup that does not mutate shared authority, but it should not bind workloads, release quota, or declare ownership over new transitions.

This is also why leader election should be boring and conservative. A controller that continues acting after losing the lease is worse than a controller that pauses too often. Pausing hurts liveness; stale authority can break safety across many workloads.

Ownership Transfer

Ownership transfer is the path from old leader to new leader. A safe transfer has three parts:

  1. The old leader stops renewing or loses its ability to renew.
  2. The new leader acquires the lease through an authoritative conditional write.
  3. Shared mutation points reject actions from the old leader's epoch.

For fraud-batch, a transfer might look like this:

10:41:05  ctrl-a renews scheduler-shard-12 at epoch 1842
10:41:08  ctrl-a pauses during garbage collection
10:41:22  ctrl-b observes expired lease and acquires epoch 1843
10:41:23  ctrl-b resumes repair work for fraud-batch
10:41:24  ctrl-a wakes up and tries to write a replacement request with epoch 1842
10:41:24  API rejects ctrl-a because scheduler-shard-12 is now epoch 1843

The transfer is safe because authority moved in durable state and the old epoch was fenced out. It is live because ctrl-b can resume queue processing after the timeout instead of waiting for a human to decide whether ctrl-a is dead.

The hard part is choosing which actions require fencing. Reading state does not. Updating local metrics usually does not. Writing a binding, claiming a shard, releasing quota, or declaring a task failed does. Dangerous writes should carry enough ownership evidence for the receiving API to reject stale actors.

Failure Modes

Design Rules

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Reconciliation Loops, Work Queues, and Idempotent Actuation NEXT Distributed Schedulers and Control Planes: Scheduling Architecture, Filtering, Scoring, and Binding