Distributed Schedulers and Control Planes: Desired State, Observed State, and Control Flow

LESSON

Distributed Schedulers and Control Planes

002 35 min advanced

Distributed Schedulers and Control Planes: Desired State, Observed State, and Control Flow

The core idea: A control plane works by comparing desired state with observed state and issuing bounded actions, so the design trade-off is between converging quickly and avoiding unsafe actions based on stale or partial evidence.

Core Insight

Suppose a team submits a workload named fraud-batch with a desired state: run three replicas, use GPU nodes, keep data in eu-central, and restart failed work automatically. The scheduler does not directly make that world true. It sees API objects, node heartbeats, quota records, placement bindings, and controller status updates. Some of those signals are fresh, some are delayed, and some are temporarily missing.

The common misconception is that a control plane is a command pipeline: request comes in, scheduler chooses, nodes obey. Real control planes behave more like feedback systems. They hold a desired state, sample an observed state, compare the two, and send actions that should reduce the gap. The system is correct only when those actions remain safe despite delay, retries, duplicate events, and partial failure.

That distinction changes how schedulers and reconcilers are designed. A scheduler may create a binding, a node agent may report that work is running, and a repair controller may later notice that the observed state no longer matches the intended state. Each step needs a clear control-flow rule: what evidence is it allowed to trust, which object is it allowed to change, and when should it stop acting because another controller owns the next move?

From Intent To Observation

Desired state is the durable statement of what the platform should make true. It is usually stored in an API object, queue entry, workflow record, or declarative configuration. Observed state is the platform's best current evidence about what is actually true: node capacity, pod status, lease ownership, quota usage, health probes, placement records, and controller heartbeats.

A simple scheduler loop looks like this:

desired state     observed state      control action
-------------     --------------      --------------
job needs GPU  ->  node-b has GPU  ->  propose node-b
job bound      ->  node-b accepts  ->  wait for running
job should run ->  node-b failed   ->  requeue or repair

The arrows are not instantaneous. The scheduler may observe node-b before a drain starts. The node may accept a task and then fail before reporting status. The repair controller may see a missing heartbeat before the scheduler sees the failed binding. This is why control planes need versioned objects, owner references, leases, idempotent writes, and clear status fields. They give controllers enough structure to compare state without pretending the observation is perfect.

The useful mental model is not "the control plane sends orders." It is "the control plane keeps narrowing the difference between intent and reality." That narrowing can be fast, slow, noisy, or temporarily paused, depending on how reliable the observations are and how expensive a wrong action would be.

Control Flow Through A Placement

Consider a workload that needs one GPU replica. A well-behaved control flow might look like this:

1. API server stores desired workload: replicas=1, gpu=true.
2. Scheduler watches unscheduled work and cached node state.
3. Scheduler proposes node-gpu-7 and writes a binding through the authority.
4. Node agent observes the binding and starts the workload.
5. Status controller reports running, failed, or unknown.
6. Repair controller reacts if desired and observed state diverge.

Each step has a different ownership boundary. The scheduler owns the placement proposal. The authoritative API owns whether the binding is accepted. The node agent owns local execution. The status path owns observed reality. The repair loop owns convergence after drift. If those ownership boundaries blur, the system can oscillate: the scheduler places work, a repair controller undoes it, another controller retries, and operators see a cluster that is busy but not making progress.

Control flow also needs backpressure. If observations are missing, a controller should not necessarily act harder. A fast retry loop can amplify an outage by flooding the API, double-allocating capacity, or repeatedly evicting work that would have recovered. The trade-off is practical: aggressive loops converge quickly when evidence is accurate, but conservative loops are safer when evidence is old, contradictory, or incomplete.

Worked Example: The Stuck Replica

Imagine fraud-batch has desired state replicas=3, but the dashboard shows only two running replicas. A naive operator might ask, "Why did the scheduler fail to create the third one?" A control-plane view asks a better question: which part of desired-versus-observed flow is stuck?

There are several different answers:

Those cases require different fixes. Increasing scheduler replicas helps only one of them. The better diagnostic path follows the state transition: desired object, scheduling queue, binding record, node-local action, status update, and repair decision. The control plane becomes debuggable when each transition leaves evidence behind.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Time, Failure, and Coordination Boundaries NEXT Distributed Schedulers and Control Planes: Safety, Liveness, and Failure Modes