Distributed Schedulers and Control Planes: Observability, Debuggability, and Hidden Coupling
LESSON
Distributed Schedulers and Control Planes: Observability, Debuggability, and Hidden Coupling
The core idea: Control-plane observability must explain decisions, not just symptoms, because the hardest scheduler incidents come from hidden coupling between controllers, caches, policies, quotas, and shared API paths.
Core Insight
Imagine the risk-api rollback from the previous lesson is mostly successful: traffic is serving, scheduler-policy-v4 is active again, and the leaked reservation was repaired. Two recovery replicas still remain pending. The service dashboard says error rate is acceptable. The deployment page says rollout progress is slow. The scheduler logs contain a few "unschedulable" lines. The autoscaler keeps adding desired capacity. No single signal explains the system.
This is the normal shape of a control-plane incident. The user-facing service may be fine while the control plane is unhealthy, or the control plane may look busy while it is repeatedly making the same invalid decision. Metrics, logs, traces, events, and status fields each expose a different slice. Debuggability comes from joining those slices around one object, one generation, and one decision path.
The non-obvious problem is hidden coupling. A scheduler may appear to be blocked by placement, but the real cause is an admission policy that added a label, a quota controller that consumed headroom, an informer cache that lagged behind node updates, or a repair loop that keeps changing status. None of those dependencies looks like a direct function call in the scheduler. They are coupled through shared state and shared control surfaces.
The trade-off is detail versus operability. Recording every decision input forever is expensive and noisy, but recording only aggregate symptoms makes incidents impossible to explain. A mature control plane chooses where high-cardinality detail is worth the cost: priority workloads, failed decisions, policy changes, rollout windows, and incident sampling.
What Control-Plane Observability Must Show
Application observability usually asks whether requests are fast, successful, and correct. Control-plane observability has to ask a different set of questions:
- What desired state was the controller trying to satisfy?
- Which observed state did it read, and how fresh was that state?
- Which constraints, policies, quotas, and priorities affected the decision?
- Which action did it attempt?
- Did the action commit, conflict, time out, or get retried?
- Which controller owns the next step?
- What partial progress or repair state exists now?
A scheduler that only exports "scheduling attempts per second" and "scheduling failures" is observable only at the surface. During a real incident, operators need to know whether a workload was filtered out by hard constraints, scored poorly by soft preferences, blocked by quota, waiting on a stale cache, preempted by higher priority work, or bound successfully but stuck at startup.
Good control-plane telemetry preserves the decision path:
object identity
-> desired generation
-> queue wait
-> cache/resource version used
-> filters and scores
-> bind or update result
-> observed generation in status
-> condition and reason
That path does not need to be logged at maximum detail for every object forever. It does need to be reconstructable for a sampled workload, an incident window, or a high-priority tenant. Otherwise the system gives operators symptoms without a causal trail.
Signals Have Different Jobs
The common telemetry signals overlap, but they are not interchangeable.
| Signal | Best at | Weakness |
|---|---|---|
| Metrics | rates, saturation, queue age, error counts, SLO burn | weak for single-object causality |
| Logs | local decisions, reasons, conflicts, exceptions | noisy and easy to lose without stable IDs |
| Traces | request paths and cross-service timing | miss asynchronous work if spans are not linked |
| Events | user-facing object transitions and warnings | often lossy, aggregated, or delayed |
| Status conditions | durable progress and readiness state | can be stale without observed-generation discipline |
| Audit records | who changed authoritative state | not enough to explain controller reasoning |
For a scheduler, metrics might show queue age rising. Logs might show that risk-api-7f2 failed a topology filter. Events might tell the user "insufficient zone capacity." Status might say SchedulingRetrying=True for generation 42. Audit records might show that a policy changed five minutes earlier. The incident becomes understandable only when those signals share stable identifiers and consistent vocabulary.
Two fields are especially useful:
- generation: which desired-state version the object is currently asking for
- observed generation: which desired-state version a controller has actually processed
Without those fields, a status condition can look current while describing an old decision. With them, an operator can tell whether the controller is behind, whether the condition belongs to the active rollout, and whether a rollback or repair step has actually been observed.
Debugging by Decision Timeline
A practical debugging question is not "what is wrong with the cluster?" It is "why did this controller make this decision for this object at this time?"
Start with one workload:
risk-api / replica-042 / generation 42
Then reconstruct the timeline:
00:00 desired replicas increase to 8
00:01 replica-042 enters scheduler queue
00:04 scheduler reads node cache at resourceVersion 918812
00:04 topology filter rejects zone-a and zone-b
00:04 quota plugin rejects zone-c for tenant risk
00:05 event emitted: FailedScheduling / insufficient eligible capacity
00:06 autoscaler sees low readiness and adds desired capacity
00:07 quota controller frees capacity in zone-c
00:09 scheduler cache observes quota update
00:10 replica-042 binds to node-c12
This timeline changes the diagnosis. If you only saw the event at 00:05, you might add more nodes. If you only saw the autoscaler at 00:06, you might blame capacity. If you saw the cache update at 00:09, you would suspect a freshness or propagation delay between quota and scheduling.
Decision timelines also reveal when controllers are doing reasonable local work that creates bad global behavior. The scheduler rejected the pod based on its current cache. The autoscaler reacted to low readiness. The quota controller eventually freed capacity. Each piece can be locally defensible while the combined loop oscillates or overreacts.
Hidden Coupling
Hidden coupling appears when two components influence each other through shared state, timing, or side effects rather than through an explicit interface that operators can see.
Common forms include:
- shared API pressure: a rollout, scheduler, and repair loop all depend on the same API server or datastore under incident load
- cache freshness: one controller acts on watch state that lags behind another controller's write
- quota and priority side effects: a quota or preemption decision changes what the scheduler sees as feasible
- status feedback: an autoscaler reacts to readiness that is delayed by placement or image pull, then adds more work
- policy injection: admission adds labels, tolerations, defaults, or constraints that later affect placement
- cleanup interference: repair removes partial state that another controller still considered useful progress
- shared worker pools: unrelated controllers block each other because they share rate limits, clients, queues, or locks
Hidden coupling is not automatically bad. Shared state is how control planes coordinate. The problem is unmanaged coupling: nobody can tell which dependency matters during an incident, so every team debugs its own component in isolation.
One useful test is to ask whether the operator can answer:
Which other controllers can change the inputs of this decision
without calling this controller directly?
If the answer is unknown, the system needs better telemetry, clearer ownership, or a narrower control surface.
Worked Example: Pending Replicas After Rollback
Return to risk-api. Two replicas are pending after rollback, and the autoscaler has added more desired capacity.
A surface-level dashboard shows:
Ready replicas: 6 / 8
Pending replicas: 2
Scheduler failures: elevated
Autoscaler scale-ups: 3 in 10 minutes
API latency: normal
That is enough to know there is work to do, but not enough to choose the action. A useful control-plane debug view would join the signals by workload and generation:
workload: risk-api
generation: 42
rollback observed by scheduler: true
pending replica: replica-042
last scheduling reason: quota/risk-zone-c
scheduler cache RV: 918812
quota controller latest RV: 918946
autoscaler reason: low ready replicas
repair condition: RepairComplete=True, observedGeneration=42
Now the hidden coupling is visible. The scheduler was not using the newest quota state when it made the last decision. The autoscaler was responding to readiness lag, not a durable lack of capacity. Repair is complete, so leaked reservations are not the current blocker. The next action is probably to watch the scheduler catch up or inspect cache/watch lag, not to roll back again or add emergency capacity.
This is what "debuggable" means for a control plane: the system gives the operator a small enough causal graph to choose a corrective action.
Operational Failure Modes
- Symptom-only dashboards: charts show pending work and error counts, but not which constraint or controller is blocking progress. The fix is decision-level reasons, queue age, and object-linked conditions.
- Uncorrelated logs: logs mention object names but not generation, policy revision, operation ID, or resource version. The fix is stable correlation fields across controllers.
- Stale status looks current: a condition is true but belongs to an old generation. The fix is observed-generation discipline and clear condition transitions.
- Events are treated as complete truth: operators rely on the last event even though it was aggregated, delayed, or superseded. The fix is to join events with status, metrics, and controller state.
- Hidden shared limits: unrelated controllers slow each other through shared clients, queues, API budgets, or datastore partitions. The fix is per-controller saturation metrics and rate-limit visibility.
- Debugging stops at team boundaries: each team proves its controller is locally correct. The fix is a cross-controller decision timeline centered on the object and incident window.
Connections
- The previous lesson,
018.md, separated recovery, rollback, and repair. Observability has to expose which of those loops is active and which one is blocked. - The next lesson,
020.md, turns these debugging needs into testing needs: deterministic replay is useful because control-plane incidents depend on timing, cache freshness, and controller interleavings. production-reliability-and-observabilitycovers service-level telemetry; this lesson adapts those ideas to asynchronous desired-state systems.
Resources
- [BOOK] Site Reliability Engineering: Monitoring Distributed Systems
- Focus: Use the distinction between symptoms and causes when deciding which control-plane signals deserve alerts.
- [DOC] OpenTelemetry Signals
- Focus: Compare metrics, logs, and traces as complementary evidence rather than one universal answer.
- [DOC] Kubernetes Events API
- Focus: Treat events as user-facing transition evidence that must be joined with durable status and controller telemetry.
- [DOC] Kubernetes API Concepts
- Focus: Study resource versions, watches, and consistency boundaries because cache freshness shapes scheduler decisions.
- [DOC] Kubernetes Debug Running Pods
- Focus: Notice how object state, events, logs, and runtime inspection work together during a real debug path.
Key Takeaways
- Control-plane observability has to explain decisions: desired generation, observed state, constraints, action result, and next owner.
- Metrics, logs, traces, events, status, and audit records each answer different questions; debugging improves when they share stable correlation fields.
- Hidden coupling often flows through shared state, caches, quotas, admission policy, status feedback, and API pressure.
- The practical goal is a decision timeline for one object and generation, not a pile of unrelated symptoms.