Distributed Schedulers and Control Planes: Watch Streams, Caches, and Staleness Boundaries

LESSON

Distributed Schedulers and Control Planes

012 35 min advanced

Distributed Schedulers and Control Planes: Watch Streams, Caches, and Staleness Boundaries

The core idea: Watch streams and caches let controllers react without overwhelming the authoritative store, so the design trade-off is between low-latency local decisions and explicit boundaries around stale state.

Core Insight

Suppose the risk-api rollout from the previous lesson has enabled scheduler-policy-v5 for a small canary. The scheduler, rollout controller, autoscaler, quota controller, and admission layer all need to notice new objects, policy updates, replica health, pending reasons, and capacity changes. If every controller polls the API store before every decision, the control plane becomes slow and expensive. If every controller trusts a local cache blindly, decisions can be made from old reality.

The usual answer is a watch stream backed by a local cache. A controller first lists the objects it cares about, then watches for later changes. The cache becomes a local working view: fast to read, indexed for the controller's needs, and updated by events from the authoritative store. That makes reconciliation practical at scale.

The non-obvious part is that the cache is not truth. It is a snapshot plus a stream of changes, and both can lag, disconnect, compact, or restart. A safe control plane treats cache freshness as a property of each decision. Some decisions can tolerate slightly stale state because reconciliation will repair them. Other decisions, such as binding scarce capacity or admitting a quota-consuming request, need stronger checks against an authoritative version.

List, Watch, and Cache

A common controller data path looks like this:

authoritative store -> API server -> list response -> local cache
                                      watch stream -> local cache
                                      work queue   -> reconciler

The initial list gives the controller a complete starting point for a scope: all pods in a namespace, all placement policies in a region, or all pending scheduling requests for a lane. The response carries a version marker that says, in effect, "this is the world as of this point in the store's history."

The watch then streams later changes: added, modified, deleted, and sometimes bookmark events. The controller applies those events to its local cache and enqueues work for reconciliation. Instead of scanning the whole world, the controller can react to the small set of objects that changed.

The cache is the controller's read-optimized memory of that world. It may have indexes by tenant, node, zone, priority class, owner reference, or policy version. For the scheduler, those indexes can answer questions like:

which pods are pending in eu-central
which nodes still have GPU capacity
which placement policy applies to risk-api
which reservations belong to the recovery lane
which objects changed since the last scheduling cycle

This design moves many reads away from the authoritative store. It also introduces a new responsibility: the controller must know when its local view is fresh enough for the action it is about to take.

Resource Versions and Freshness

Freshness needs a concrete handle. In many control planes, each object or list response has a monotonically advancing version, revision, timestamp, or generation. Kubernetes calls this idea resourceVersion; etcd exposes revisions. The exact name matters less than the contract: a client can say which point in history its cache has reached.

A version marker helps answer three questions:

What did I see? The controller can attach the observed version to a decision.
What changed since then? The watch stream can resume from a known point when possible.
Is my write still valid? An update can use optimistic concurrency so it fails if the object changed after the controller read it.

For example, a scheduler may read a cached node object that says allocatable_gpu=1 at version 9001. Before binding risk-api, it should not assume that GPU is still free just because the cache says so. Another scheduler, quota controller, or operator may have changed the state. The binding path needs an authoritative compare-and-update, lease, reservation, or conflict check.

That distinction is the staleness boundary. The cache can find candidates quickly. The authoritative write path must prove that the decision is still safe.

Where Staleness Is Acceptable

Not every stale read is dangerous. A controller that notices risk-api has seven replicas when it eventually should have eight can enqueue a reconcile and try again later. If the cache is slightly behind, the next watch event or periodic resync can repair the view. This is why reconciliation works well for many desired-state loops.

Staleness is usually acceptable when:

the action is idempotent
the next reconcile can repair a missed or duplicate action
the action does not consume scarce exclusive capacity
the controller records enough state to detect conflicts later
the user-visible consequence is delay rather than corruption

Staleness is less acceptable when:

two actors could claim the same capacity
admission would allow a request beyond quota
a rollout gate could advance based on missing failures
an emergency override must stop new action immediately
a policy change affects security, isolation, or ownership

The practical design is often mixed. Read broadly from cache, then confirm narrowly at the boundary where a decision becomes authoritative.

Missed Events, Relist, and Compaction

Watch streams are not magic. Networks disconnect, clients restart, servers compact old history, and buffers overflow. A controller cannot assume it will see every event forever.

A robust controller handles watch failure as a normal path:

1. list objects and record version 5000
2. watch changes after version 5000
3. apply events to cache and enqueue affected keys
4. if the watch is interrupted, resume from the last safe version
5. if that version is too old, relist and rebuild the cache

Relisting is expensive, but it is the cost of returning to a coherent view. The controller should make relist visible in metrics because repeated relists can signal API pressure, slow consumers, excessive object churn, or a watch stream that cannot keep up.

Controllers also need to handle delete events carefully. If a delete is missed and the cache still believes an object exists, the controller can keep reconciling a ghost. If a delete arrives before related cleanup is visible, the controller can release capacity too early. Owner references, finalizers, tombstones, and authoritative existence checks are tools for making deletion safe under lag.

Worked Example: Stale Policy During a Canary

Imagine scheduler-policy-v5 is active for risk-api canary placements. The rollout controller updates the policy object:

policyVersion: v5-canary
scope: risk-api eu-central
active: true
maxZoneSkew: 1

The scheduler watches policy objects and node objects. Its cache receives the policy update quickly, but a node-capacity watch is delayed for twenty seconds. During that window, the scheduler sees a candidate node in zone-b with one free GPU even though another binding already consumed it.

A weak design binds directly from the cached node view:

cache says node-b7 has 1 GPU
bind risk-api replica to node-b7
later discover conflict or overcommitment

A stronger design separates selection from commitment:

cache selects node-b7 as a candidate
binding request includes observed node version
authoritative path rejects if capacity changed
scheduler retries with a fresh candidate

The stronger design still benefits from the cache. It avoids scanning every node through the API server. But it does not let the cache cross the boundary where exclusive capacity is claimed.

The same pattern applies to rollout gates. A cached view can compute "no pending-reason regression" cheaply, but advancing the rollout should include a freshness check: did every required watch reach at least the policy activation version, and are the key signals current enough to trust?

Operational Failure Modes

Cache treated as truth: a controller commits scarce capacity from a stale snapshot. The fix is authoritative conflict checks at binding, admission, or reservation time.
No freshness signal: operators cannot tell whether a calm dashboard reflects a healthy system or a stuck watch. The fix is watch lag, last event version, relist count, and cache-sync metrics.
Relist storm: many controllers lose watches and rebuild caches at once. The fix is backoff, bookmarks, pagination, and scoped watches.
Dropped delete semantics: stale objects remain in cache or cleanup runs too early. The fix is tombstone handling, finalizers, owner references, and existence checks.
Unbounded watch scope: every controller watches more objects than it needs. The fix is filtering by namespace, field, label, tenant, lane, or controller ownership.
Rollout gate from stale signals: a canary advances while failures are delayed in a cache. The fix is freshness requirements for the signals that gate activation.

Connections

The previous lesson, 011.md, treated rollouts as controlled state transitions. Watch freshness decides when rollout evidence is current enough to act on.
The next lesson, 013.md, moves to admission and policy. Admission paths often read from cache for context but need authoritative checks when accepting writes.
consistency-and-replication provides deeper background for revisions, stale reads, and consistency boundaries.

Resources

[DOC] Kubernetes API Concepts
- Focus: Study resourceVersion, watches, bookmarks, pagination, and efficient change detection.
[DOC] Kubernetes Controllers
- Focus: Connect watch-driven observation with reconciliation against desired state.
[DOC] etcd API Guarantees
- Focus: Look at revisions, watch behavior, and the consistency expectations behind control-plane storage.
[DOC] client-go tools/cache
- Focus: Inspect informers, stores, indexers, resync, and deleted object handling as implementation vocabulary.

Key Takeaways

Watch streams and caches make large control planes responsive without turning every decision into a direct store read.
A cache is a fast local view, not the authoritative boundary for scarce capacity, admission, ownership, or safety gates.
Version markers, relist paths, and cache-sync metrics make staleness visible enough to reason about.
The central trade-off is lower read pressure and faster local decisions versus explicit safeguards for stale or incomplete observations.

← Back to Distributed Schedulers and Control Planes

← Back to Distributed Systems

← Back to Learning Hub