Distributed Schedulers and Control Planes: Safety, Liveness, and Failure Modes

LESSON

Distributed Schedulers and Control Planes

003 35 min advanced

Distributed Schedulers and Control Planes: Safety, Liveness, and Failure Modes

The core idea: A scheduler is useful only if it preserves safety while still making liveness progress under partial failure, so the design trade-off is between refusing uncertain actions and repairing stalled work quickly enough to matter.

Core Insight

Suppose fraud-batch needs one exclusive GPU, must run in eu-central, and belongs to a tenant with a strict quota. The scheduler sees node-gpu-7 as eligible, but the quota controller is slow to publish its latest usage, and the node heartbeat has not arrived for several seconds. The control plane now faces a choice: bind the job anyway, wait for stronger evidence, or place the job in a recoverable pending state.

That choice is where safety and liveness separate. Safety means the control plane never makes a forbidden state true: two workloads cannot own the same exclusive GPU, a tenant cannot exceed quota, a workload cannot run in the wrong region, and a controller cannot mutate state it does not own. Liveness means the system eventually makes useful progress when the required authority and capacity exist: eligible work should not sit forever in a queue because one cache entry is stale or one repair loop missed an event.

Distributed schedulers spend most of their complexity budget on this boundary. If they act too aggressively, they can violate safety with duplicate placement, quota bypass, or split ownership. If they act too conservatively, they can preserve every invariant while the platform fails to run available work. Failure modes are the patterns that push the system toward one side of that trade-off.

Safety Is About Forbidden States

Safety properties say what must never happen. In a scheduler and control plane, they are usually expressed as invariants over durable state:

One exclusive resource has at most one accepted owner.
A workload runs only in an allowed region, zone, or tenancy boundary.
A binding is accepted only by the authority that owns placement.
A repair controller does not create duplicate work for a task that is already running.
A quota reservation is either committed once or clearly released.

These rules are stronger than "the scheduler should usually choose a good node." They define states the system must not enter, even during retries, restarts, leadership changes, stale watches, and network partitions. Safety is why a scheduler often writes through an authoritative API instead of treating its local cache as truth. The cache can guide a proposal; the authoritative write decides whether the proposal is still valid.

The common design mistake is to treat a positive observation as a reservation. Seeing that node-gpu-7 looked free a moment ago is not the same as owning its GPU. A safe design turns the observation into a conditional write: bind only if the resource is still available, the workload version has not changed, and the quota reservation is still valid. If the write is rejected, the scheduler learns something and retries through the normal control flow.

Liveness Is About Eventual Progress

Liveness properties say what should eventually happen. They do not promise that every request succeeds immediately. They promise that when the required conditions are present, the control plane does not get stuck forever.

For fraud-batch, liveness might mean:

A pending job is reconsidered when quota becomes available.
A failed binding eventually returns to the scheduling queue.
A node that recovers eventually reports status and accepts eligible work.
A controller that restarts resumes from durable state instead of losing the work item.
A transient API failure produces backoff and retry, not silent abandonment.

Liveness depends on durable queues, watches, periodic resync, leases, timeouts, and idempotent actuation. A pure event-driven controller can miss progress if an event is dropped or a process restarts between receiving and recording it. A pure polling controller can waste capacity and overload the API. Most production control planes combine both: watch for fast reaction, resync for recovery, and keep enough durable state to know which action is still owed.

Liveness also needs bounded waiting. Waiting for stronger evidence is reasonable when the next action is dangerous, but an unbounded wait turns uncertainty into an outage. A job that waits for a missing heartbeat should either observe recovery, time out into a repair path, or be requeued with a reason that another controller can inspect.

Failure Modes At The Boundary

The same failure can look different depending on which property it threatens.

Stale eligibility: the scheduler cache says a node has free GPU capacity, but the authoritative resource state has changed. Acting directly from the cache threatens safety; rejecting every stale cache entry forever threatens liveness.
Ambiguous ownership: two controllers both believe they own the same transition, such as pending-to-bound or bound-to-failed. That can violate safety through duplicate writes and hurt liveness through oscillation.
Lost wakeup: a quota release or node recovery event is missed. Safety may still hold, but liveness fails because eligible work is never reconsidered.
Retry amplification: many controllers observe missing status and all retry at once. Liveness logic turns into load that prevents the API from recovering.
Fail-open repair: a repair loop starts replacement work before proving that the previous owner is gone. This improves apparent progress while risking duplicate execution.
Fail-closed repair: the system refuses replacement until every uncertainty is resolved. This protects invariants while leaving capacity idle and work pending.

Good designs make these modes visible in state. They distinguish pending, reserved, bound, running, unknown, failed, and released instead of hiding everything behind a single boolean. They also record who made the last transition and which evidence was used. That evidence turns "the scheduler is stuck" into a more precise diagnosis.

Worked Example: A Conservative Binding

Return to fraud-batch. The scheduler finds node-gpu-7 and wants to bind the workload. A safety-first placement flow might be:

1. Read desired workload version: gpu=1, region=eu-central, tenant=fraud.
2. Read observed node cache: node-gpu-7 appears eligible.
3. Request quota reservation through the quota authority.
4. Write a conditional binding through the placement authority.
5. If accepted, wait for node-local status.
6. If status is missing past a timeout, move to repair through an owned transition.

The important detail is that steps 3 and 4 are not advisory. They are the points where safety is enforced. The scheduler can score nodes from cache, but quota and binding must be committed by the systems that own those invariants.

Now consider liveness. If the quota controller is slow, the job cannot remain invisible. The scheduler should record why the workload is waiting and arrange a wakeup: quota watch, delayed retry, or periodic resync. If node-gpu-7 stops heartbeating after the binding is accepted, the repair controller should not immediately create duplicate execution. It should follow an explicit timeout and release path, then requeue work once the old ownership has been resolved.

The control plane is healthy when both halves are true: forbidden states are rejected, and recoverable states eventually move forward.

Design Rules

Make invariants explicit before optimizing placement speed.
Separate observation from authority: caches propose, authoritative writes commit.
Give every state transition one owner.
Use idempotent actions so retries do not create duplicate effects.
Combine watch-driven wakeups with periodic resync so missed events do not block liveness.
Treat unknown as a real state with timeout and repair rules.
Prefer visible pending reasons over silent waits.

Connections

The previous lesson, 002.md, separated desired state from observed state. This lesson adds the safety and liveness properties that constrain actions between those states.
The next lesson, 004.md, turns these properties into reconciliation loops, work queues, and idempotent actuation.
consensus-and-coordination gives the vocabulary for authority, leases, and ownership when multiple actors might act on the same state.

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Use the chapters on consistency, transactions, and distributed failure to sharpen the distinction between invariants and eventual progress.
[DOC] Kubernetes Controllers
- Focus: Study how control loops compare desired and current state while retrying through API-owned state transitions.
[DOC] Kubernetes Scheduler
- Focus: Connect filtering, scoring, and binding to the safety boundary between scheduler cache and authoritative placement.
[PAPER] Large-scale cluster management at Google with Borg
- Focus: Look for how production schedulers handle placement, recovery, and shared cluster invariants at large scale.

Key Takeaways

Safety says which scheduler states must never become true; liveness says which eligible work must eventually move forward.
Scheduler caches can guide proposals, but authoritative writes must enforce quota, placement, ownership, and tenancy invariants.
The central trade-off is refusing uncertain actions versus repairing stalled work quickly enough to preserve useful progress.
Durable queues, explicit state transitions, timeouts, and idempotent retries are what keep safety and liveness from fighting blindly.

← Back to Distributed Schedulers and Control Planes

← Back to Distributed Systems

← Back to Learning Hub