Distributed Schedulers and Control Planes: Desired State, Observed State, and Control Flow
LESSON
Distributed Schedulers and Control Planes: Desired State, Observed State, and Control Flow
The core idea: A control plane works by comparing desired state with observed state and issuing bounded actions, so the design trade-off is between converging quickly and avoiding unsafe actions based on stale or partial evidence.
Core Insight
Suppose a team submits a workload named fraud-batch with a desired state: run three replicas, use GPU nodes, keep data in eu-central, and restart failed work automatically. The scheduler does not directly make that world true. It sees API objects, node heartbeats, quota records, placement bindings, and controller status updates. Some of those signals are fresh, some are delayed, and some are temporarily missing.
The common misconception is that a control plane is a command pipeline: request comes in, scheduler chooses, nodes obey. Real control planes behave more like feedback systems. They hold a desired state, sample an observed state, compare the two, and send actions that should reduce the gap. The system is correct only when those actions remain safe despite delay, retries, duplicate events, and partial failure.
That distinction changes how schedulers and reconcilers are designed. A scheduler may create a binding, a node agent may report that work is running, and a repair controller may later notice that the observed state no longer matches the intended state. Each step needs a clear control-flow rule: what evidence is it allowed to trust, which object is it allowed to change, and when should it stop acting because another controller owns the next move?
From Intent To Observation
Desired state is the durable statement of what the platform should make true. It is usually stored in an API object, queue entry, workflow record, or declarative configuration. Observed state is the platform's best current evidence about what is actually true: node capacity, pod status, lease ownership, quota usage, health probes, placement records, and controller heartbeats.
A simple scheduler loop looks like this:
desired state observed state control action
------------- -------------- --------------
job needs GPU -> node-b has GPU -> propose node-b
job bound -> node-b accepts -> wait for running
job should run -> node-b failed -> requeue or repair
The arrows are not instantaneous. The scheduler may observe node-b before a drain starts. The node may accept a task and then fail before reporting status. The repair controller may see a missing heartbeat before the scheduler sees the failed binding. This is why control planes need versioned objects, owner references, leases, idempotent writes, and clear status fields. They give controllers enough structure to compare state without pretending the observation is perfect.
The useful mental model is not "the control plane sends orders." It is "the control plane keeps narrowing the difference between intent and reality." That narrowing can be fast, slow, noisy, or temporarily paused, depending on how reliable the observations are and how expensive a wrong action would be.
Control Flow Through A Placement
Consider a workload that needs one GPU replica. A well-behaved control flow might look like this:
1. API server stores desired workload: replicas=1, gpu=true.
2. Scheduler watches unscheduled work and cached node state.
3. Scheduler proposes node-gpu-7 and writes a binding through the authority.
4. Node agent observes the binding and starts the workload.
5. Status controller reports running, failed, or unknown.
6. Repair controller reacts if desired and observed state diverge.
Each step has a different ownership boundary. The scheduler owns the placement proposal. The authoritative API owns whether the binding is accepted. The node agent owns local execution. The status path owns observed reality. The repair loop owns convergence after drift. If those ownership boundaries blur, the system can oscillate: the scheduler places work, a repair controller undoes it, another controller retries, and operators see a cluster that is busy but not making progress.
Control flow also needs backpressure. If observations are missing, a controller should not necessarily act harder. A fast retry loop can amplify an outage by flooding the API, double-allocating capacity, or repeatedly evicting work that would have recovered. The trade-off is practical: aggressive loops converge quickly when evidence is accurate, but conservative loops are safer when evidence is old, contradictory, or incomplete.
Worked Example: The Stuck Replica
Imagine fraud-batch has desired state replicas=3, but the dashboard shows only two running replicas. A naive operator might ask, "Why did the scheduler fail to create the third one?" A control-plane view asks a better question: which part of desired-versus-observed flow is stuck?
There are several different answers:
- The desired state may be blocked by quota, so the scheduler correctly refuses to bind another GPU.
- The scheduler may have written a binding, but the node agent has not observed it yet.
- The node agent may have started the process, but status has not propagated.
- The workload may have crashed, and the repair loop is waiting before retrying.
- A stale cache may make the scheduler believe no eligible GPU node exists.
Those cases require different fixes. Increasing scheduler replicas helps only one of them. The better diagnostic path follows the state transition: desired object, scheduling queue, binding record, node-local action, status update, and repair decision. The control plane becomes debuggable when each transition leaves evidence behind.
Operational Failure Modes
- Action without ownership: two controllers mutate the same field and create a loop. The design response is to assign one writer per field or one authority per state transition.
- Observation treated as commitment: a controller sees free capacity and assumes it has reserved it. The design response is to make reservation or binding an explicit authoritative write.
- Retry storms: controllers react to missing status by retrying too quickly. The design response is backoff, queue rate limits, and idempotent operations.
- Unclear terminal states: work sits between pending, bound, running, failed, and unknown. The design response is a state machine with explicit timeout and repair rules.
Connections
- The previous lesson,
001.md, explains why stale observations and authority boundaries define safe scheduling decisions. - The next lesson,
003.md, uses this desired-versus-observed model to separate safety from liveness under failure. consistency-and-replicationgives language for why observed state can lag behind committed intent.
Resources
- [DOC] Kubernetes Controllers
- Focus: Study the control-loop model: watch shared state, compare desired and current state, then act.
- [DOC] Kubernetes Scheduler
- Focus: Connect filtering, scoring, and binding to the larger desired-versus-observed control flow.
- [PAPER] Large-scale cluster management at Google with Borg
- Focus: Look for the separation between declarative job intent, scheduling decisions, and observed task state.
Key Takeaways
- Desired state is the durable intent; observed state is delayed evidence about reality.
- Control planes make progress by comparing those two states and issuing bounded, owned actions.
- The central trade-off is convergence speed versus safety when observations are stale or incomplete.
- Debugging a scheduler starts by locating the broken transition, not by assuming the scheduler alone is at fault.