Distributed Schedulers and Control Planes: Safety, Liveness, and Failure Modes

LESSON

Distributed Schedulers and Control Planes

003 35 min advanced

Distributed Schedulers and Control Planes: Safety, Liveness, and Failure Modes

The core idea: A scheduler is useful only if it preserves safety while still making liveness progress under partial failure, so the design trade-off is between refusing uncertain actions and repairing stalled work quickly enough to matter.

Core Insight

Suppose fraud-batch needs one exclusive GPU, must run in eu-central, and belongs to a tenant with a strict quota. The scheduler sees node-gpu-7 as eligible, but the quota controller is slow to publish its latest usage, and the node heartbeat has not arrived for several seconds. The control plane now faces a choice: bind the job anyway, wait for stronger evidence, or place the job in a recoverable pending state.

That choice is where safety and liveness separate. Safety means the control plane never makes a forbidden state true: two workloads cannot own the same exclusive GPU, a tenant cannot exceed quota, a workload cannot run in the wrong region, and a controller cannot mutate state it does not own. Liveness means the system eventually makes useful progress when the required authority and capacity exist: eligible work should not sit forever in a queue because one cache entry is stale or one repair loop missed an event.

Distributed schedulers spend most of their complexity budget on this boundary. If they act too aggressively, they can violate safety with duplicate placement, quota bypass, or split ownership. If they act too conservatively, they can preserve every invariant while the platform fails to run available work. Failure modes are the patterns that push the system toward one side of that trade-off.

Safety Is About Forbidden States

Safety properties say what must never happen. In a scheduler and control plane, they are usually expressed as invariants over durable state:

These rules are stronger than "the scheduler should usually choose a good node." They define states the system must not enter, even during retries, restarts, leadership changes, stale watches, and network partitions. Safety is why a scheduler often writes through an authoritative API instead of treating its local cache as truth. The cache can guide a proposal; the authoritative write decides whether the proposal is still valid.

The common design mistake is to treat a positive observation as a reservation. Seeing that node-gpu-7 looked free a moment ago is not the same as owning its GPU. A safe design turns the observation into a conditional write: bind only if the resource is still available, the workload version has not changed, and the quota reservation is still valid. If the write is rejected, the scheduler learns something and retries through the normal control flow.

Liveness Is About Eventual Progress

Liveness properties say what should eventually happen. They do not promise that every request succeeds immediately. They promise that when the required conditions are present, the control plane does not get stuck forever.

For fraud-batch, liveness might mean:

Liveness depends on durable queues, watches, periodic resync, leases, timeouts, and idempotent actuation. A pure event-driven controller can miss progress if an event is dropped or a process restarts between receiving and recording it. A pure polling controller can waste capacity and overload the API. Most production control planes combine both: watch for fast reaction, resync for recovery, and keep enough durable state to know which action is still owed.

Liveness also needs bounded waiting. Waiting for stronger evidence is reasonable when the next action is dangerous, but an unbounded wait turns uncertainty into an outage. A job that waits for a missing heartbeat should either observe recovery, time out into a repair path, or be requeued with a reason that another controller can inspect.

Failure Modes At The Boundary

The same failure can look different depending on which property it threatens.

Good designs make these modes visible in state. They distinguish pending, reserved, bound, running, unknown, failed, and released instead of hiding everything behind a single boolean. They also record who made the last transition and which evidence was used. That evidence turns "the scheduler is stuck" into a more precise diagnosis.

Worked Example: A Conservative Binding

Return to fraud-batch. The scheduler finds node-gpu-7 and wants to bind the workload. A safety-first placement flow might be:

1. Read desired workload version: gpu=1, region=eu-central, tenant=fraud.
2. Read observed node cache: node-gpu-7 appears eligible.
3. Request quota reservation through the quota authority.
4. Write a conditional binding through the placement authority.
5. If accepted, wait for node-local status.
6. If status is missing past a timeout, move to repair through an owned transition.

The important detail is that steps 3 and 4 are not advisory. They are the points where safety is enforced. The scheduler can score nodes from cache, but quota and binding must be committed by the systems that own those invariants.

Now consider liveness. If the quota controller is slow, the job cannot remain invisible. The scheduler should record why the workload is waiting and arrange a wakeup: quota watch, delayed retry, or periodic resync. If node-gpu-7 stops heartbeating after the binding is accepted, the repair controller should not immediately create duplicate execution. It should follow an explicit timeout and release path, then requeue work once the old ownership has been resolved.

The control plane is healthy when both halves are true: forbidden states are rejected, and recoverable states eventually move forward.

Design Rules

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Desired State, Observed State, and Control Flow NEXT Distributed Schedulers and Control Planes: Reconciliation Loops, Work Queues, and Idempotent Actuation