Distributed Schedulers and Control Planes: Watch Streams, Caches, and Staleness Boundaries

LESSON

Distributed Schedulers and Control Planes

012 35 min advanced

Distributed Schedulers and Control Planes: Watch Streams, Caches, and Staleness Boundaries

The core idea: Watch streams and caches let controllers react without overwhelming the authoritative store, so the design trade-off is between low-latency local decisions and explicit boundaries around stale state.

Core Insight

Suppose the risk-api rollout from the previous lesson has enabled scheduler-policy-v5 for a small canary. The scheduler, rollout controller, autoscaler, quota controller, and admission layer all need to notice new objects, policy updates, replica health, pending reasons, and capacity changes. If every controller polls the API store before every decision, the control plane becomes slow and expensive. If every controller trusts a local cache blindly, decisions can be made from old reality.

The usual answer is a watch stream backed by a local cache. A controller first lists the objects it cares about, then watches for later changes. The cache becomes a local working view: fast to read, indexed for the controller's needs, and updated by events from the authoritative store. That makes reconciliation practical at scale.

The non-obvious part is that the cache is not truth. It is a snapshot plus a stream of changes, and both can lag, disconnect, compact, or restart. A safe control plane treats cache freshness as a property of each decision. Some decisions can tolerate slightly stale state because reconciliation will repair them. Other decisions, such as binding scarce capacity or admitting a quota-consuming request, need stronger checks against an authoritative version.

List, Watch, and Cache

A common controller data path looks like this:

authoritative store -> API server -> list response -> local cache
                                      watch stream -> local cache
                                      work queue   -> reconciler

The initial list gives the controller a complete starting point for a scope: all pods in a namespace, all placement policies in a region, or all pending scheduling requests for a lane. The response carries a version marker that says, in effect, "this is the world as of this point in the store's history."

The watch then streams later changes: added, modified, deleted, and sometimes bookmark events. The controller applies those events to its local cache and enqueues work for reconciliation. Instead of scanning the whole world, the controller can react to the small set of objects that changed.

The cache is the controller's read-optimized memory of that world. It may have indexes by tenant, node, zone, priority class, owner reference, or policy version. For the scheduler, those indexes can answer questions like:

This design moves many reads away from the authoritative store. It also introduces a new responsibility: the controller must know when its local view is fresh enough for the action it is about to take.

Resource Versions and Freshness

Freshness needs a concrete handle. In many control planes, each object or list response has a monotonically advancing version, revision, timestamp, or generation. Kubernetes calls this idea resourceVersion; etcd exposes revisions. The exact name matters less than the contract: a client can say which point in history its cache has reached.

A version marker helps answer three questions:

For example, a scheduler may read a cached node object that says allocatable_gpu=1 at version 9001. Before binding risk-api, it should not assume that GPU is still free just because the cache says so. Another scheduler, quota controller, or operator may have changed the state. The binding path needs an authoritative compare-and-update, lease, reservation, or conflict check.

That distinction is the staleness boundary. The cache can find candidates quickly. The authoritative write path must prove that the decision is still safe.

Where Staleness Is Acceptable

Not every stale read is dangerous. A controller that notices risk-api has seven replicas when it eventually should have eight can enqueue a reconcile and try again later. If the cache is slightly behind, the next watch event or periodic resync can repair the view. This is why reconciliation works well for many desired-state loops.

Staleness is usually acceptable when:

Staleness is less acceptable when:

The practical design is often mixed. Read broadly from cache, then confirm narrowly at the boundary where a decision becomes authoritative.

Missed Events, Relist, and Compaction

Watch streams are not magic. Networks disconnect, clients restart, servers compact old history, and buffers overflow. A controller cannot assume it will see every event forever.

A robust controller handles watch failure as a normal path:

1. list objects and record version 5000
2. watch changes after version 5000
3. apply events to cache and enqueue affected keys
4. if the watch is interrupted, resume from the last safe version
5. if that version is too old, relist and rebuild the cache

Relisting is expensive, but it is the cost of returning to a coherent view. The controller should make relist visible in metrics because repeated relists can signal API pressure, slow consumers, excessive object churn, or a watch stream that cannot keep up.

Controllers also need to handle delete events carefully. If a delete is missed and the cache still believes an object exists, the controller can keep reconciling a ghost. If a delete arrives before related cleanup is visible, the controller can release capacity too early. Owner references, finalizers, tombstones, and authoritative existence checks are tools for making deletion safe under lag.

Worked Example: Stale Policy During a Canary

Imagine scheduler-policy-v5 is active for risk-api canary placements. The rollout controller updates the policy object:

policyVersion: v5-canary
scope: risk-api eu-central
active: true
maxZoneSkew: 1

The scheduler watches policy objects and node objects. Its cache receives the policy update quickly, but a node-capacity watch is delayed for twenty seconds. During that window, the scheduler sees a candidate node in zone-b with one free GPU even though another binding already consumed it.

A weak design binds directly from the cached node view:

cache says node-b7 has 1 GPU
bind risk-api replica to node-b7
later discover conflict or overcommitment

A stronger design separates selection from commitment:

cache selects node-b7 as a candidate
binding request includes observed node version
authoritative path rejects if capacity changed
scheduler retries with a fresh candidate

The stronger design still benefits from the cache. It avoids scanning every node through the API server. But it does not let the cache cross the boundary where exclusive capacity is claimed.

The same pattern applies to rollout gates. A cached view can compute "no pending-reason regression" cheaply, but advancing the rollout should include a freshness check: did every required watch reach at least the policy activation version, and are the key signals current enough to trust?

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Rollouts, Reconfiguration, and Safe Change NEXT Distributed Schedulers and Control Planes: Admission, Policy, and API Control Surfaces