Distributed Schedulers and Control Planes: Design Review for SLOs, Capacity, and Failure Scenarios
LESSON
Distributed Schedulers and Control Planes: Design Review for SLOs, Capacity, and Failure Scenarios
The core idea: A scheduler control plane design is reviewable only when SLOs, capacity assumptions, failure scenarios, invariants, and operating evidence are tied to specific controller decisions.
Core Insight
Imagine a team proposes a new distributed scheduler for the risk-api platform. It will support multi-region recovery, tenant quotas, priority classes, autoscaling, repair controllers, human overrides, and a regional fast path that avoids global coordination for some placements. Each individual mechanism sounds reasonable because the track has covered all of them. The design review question is harder: will the whole system meet its service goals when capacity is tight and parts of the control plane are stale, slow, or wrong?
Weak design reviews argue from diagrams. Strong design reviews argue from commitments. If the service must recover four replicas in eu-west within five minutes, the review has to show where the capacity comes from, which controller has authority, which signals prove progress, which failures are tolerated, and which invariants cannot be violated while the system is trying to recover.
The non-obvious lesson is that SLOs are not only application promises. They shape the scheduler and control plane. A recovery-time objective implies reserved capacity, faster queues for urgent work, bounded retries, observable partial progress, and tested failure paths. A fairness objective implies quota accounting, preemption rules, noisy-neighbor controls, and audit. The central trade-off is ambition versus controllability: every stronger promise needs capacity, coordination, instrumentation, testing, and operational authority to back it up.
Review From Commitments
A useful design review begins by turning goals into control-plane commitments.
| User or business goal | Control-plane commitment | Review evidence |
|---|---|---|
| Recover critical service quickly | scheduler can place protected replicas within a deadline | reserved capacity, queue priority, recovery tests |
| Keep tenants isolated | one tenant cannot consume another tenant's protected capacity | quota model, admission checks, preemption policy |
| Avoid cascading overload | controllers do not amplify failure with retries and scale-ups | backoff, rate limits, queue metrics, simulation |
| Make incidents debuggable | operators can reconstruct one decision path | correlation fields, status conditions, events, traces |
| Allow emergency action | humans can override safely and temporarily | scoped override API, TTL, audit, runbook |
| Reduce global bottlenecks | local decisions are safe inside bounded authority | budget allocation, summary telemetry, repair ownership |
The review should reject vague promises. "The scheduler is highly available" is not enough. Ask what happens when the leader changes after a write timeout. "The system supports multi-region recovery" is not enough. Ask how many replicas can be placed if the global scheduler is slow and one regional capacity pool is unhealthy.
The design review is successful when every important claim has a mechanism and every mechanism has a failure scenario.
SLOs Shape the Scheduler
SLOs give the design a target, but only if they are specific enough to drive decisions. For a scheduler control plane, useful SLOs may include:
- time to admit high-priority work
- time from admission to binding
- time from binding to ready capacity
- recovery time for critical services
- percentage of scheduling decisions explained by durable reasons
- maximum queue age for priority classes
- fairness windows for tenant capacity
- maximum time an override may remain active
- maximum time orphaned reservations may exist
Each SLO implies a different design pressure. A low bind-latency SLO pressures the scheduler queue, cache freshness, and API write path. A recovery SLO pressures capacity reserves and topology policy. A fairness SLO pressures quota and preemption. A debuggability SLO pressures status, events, and correlation fields.
One common mistake is to review only application SLOs such as request success rate and latency. Those are necessary, but they can hide control-plane failure. A service can keep serving while the scheduler is unable to place recovery work. By the time user-facing metrics degrade, the control plane may already be behind.
Capacity Is a Design Argument
Capacity should not appear late in the review as a spreadsheet attachment. It is part of the scheduler's correctness story.
The review should ask:
- What capacity is protected for critical recovery?
- What capacity is opportunistic and reclaimable?
- Which workloads can be preempted, drained, throttled, or delayed?
- How much headroom exists per region, zone, node pool, and capacity class?
- What happens when capacity exists in aggregate but is fragmented by topology or resource shape?
- How quickly can new capacity become usable?
- Which controller owns the decision to spend emergency capacity?
- How does the system avoid autoscaling into a blocked scheduler path?
For example, "20 percent headroom" is not automatically useful. If the headroom is spread across zones that do not satisfy topology constraints, or across nodes without the right memory shape, the recovery workload may still be unschedulable. A capacity model has to match the placement constraints the scheduler actually enforces.
The design should distinguish:
hard capacity -> committed quota, reserved recovery pool, unique binding
reclaimable capacity -> lower-priority work that may be preempted or drained
soft capacity -> hints, forecasts, autoscaler targets, expected future nodes
Those categories should not be mixed. Soft future capacity should not satisfy a hard recovery commitment unless the SLO explicitly allows waiting for it.
Failure Scenario Matrix
A design review should force the system through realistic failure scenarios. The point is not to invent every possible outage. The point is to cover the boundaries where this control plane is likely to lie to itself.
| Scenario | Review question | Evidence to require |
|---|---|---|
| API write commits but client times out | Can retry avoid duplicate binding or reservation? | stable operation ID, read-after-timeout test |
| Scheduler cache is stale | Does the controller publish the resource version it used? | decision telemetry, stale-cache simulation |
| Autoscaler sees low readiness | Can it distinguish capacity shortage from placement lag? | pending reasons, bounded scale-up policy |
| Region loses capacity | Which replicas recover, where, and under whose budget? | recovery budget, topology plan, failover test |
| Policy rollback happens mid-recovery | Which side effects are preserved, drained, or repaired? | revision metadata, rollback runbook, repair conditions |
| Human override is applied | Does it expire and avoid fighting reconciliation? | override API, TTL, audit, postcondition checks |
| Two schedulers act concurrently | Which invariant prevents double commitment? | authoritative bind path, conflict tests |
| Repair loop runs during incident | Can it tell leaked state from useful partial progress? | owner references, finalizers, deadlines |
This matrix should connect directly to testing. If a scenario is important enough to appear in review, it should have some combination of unit test, integration test, simulation, replay case, chaos experiment, or runbook drill.
Review Questions
A senior review usually comes down to a few categories.
Authority
- What is the authoritative desired state?
- Which controller owns each transition?
- Where are leases, ownership transfer, and conflict checks required?
- Which decisions can be local, and which require global coordination?
Progress
- What partial progress is recorded?
- How does the system distinguish pending, stuck, failed, and repaired?
- Which deadlines turn waiting into a new decision?
- What is allowed to lag safely?
Capacity
- Which commitments are backed by reserved capacity?
- Which work is reclaimable?
- What are the fragmentation risks?
- How does the design behave when autoscaling is slower than recovery needs?
Safety
- What must never happen, even during failure?
- What duplicate side effects are possible after timeout, retry, or restart?
- How are finalizers, reservations, and orphaned children cleaned?
- Which human actions are allowed, scoped, and audited?
Evidence
- Can one scheduling decision be reconstructed?
- Are generation, observed generation, operation ID, resource version, and policy revision visible?
- Which simulations or replay artifacts prove the failure paths?
- Which metrics alert before users experience the whole failure?
The review should not require perfect answers to every question. It should require honest boundaries. A design that says "we cannot guarantee fairness during regional disaster, but we preserve recovery capacity and reconcile fairness over the next hour" is much stronger than a design that claims fairness without explaining the control path.
Worked Example: Reviewing risk-api Recovery
Suppose the proposal says:
risk-api must recover four serving replicas in eu-west
within five minutes after eu-central loses capacity.
Tenant fairness must remain bounded.
Operators may apply emergency overrides.
A weak review accepts the architecture because it has a scheduler, autoscaler, rollback controller, and dashboards.
A stronger review asks for the path:
1. Admission marks risk-api recovery as priority critical.
2. Global policy grants eu-west a recovery budget.
3. Regional scheduler places replicas within that budget.
4. Binding uses stable workload identity and conflict checks.
5. Rollout publishes bound, starting, ready, and serving counts.
6. Autoscaler is bounded by pending reasons and recovery deadline.
7. Repair cleans leaked reservations with owner-aware deadlines.
8. Human override can pause expansion or avoid a zone for 30 minutes.
9. Observability joins generation, policy revision, quota state, and bind result.
10. Simulation covers stale quota, timeout-after-commit, rollback, and leader restart.
Now the reviewers can see the trade-offs. The design spends coordination at budget allocation and binding, not every local placement. It preserves fast regional progress but accepts that global fairness is reconciled over a window rather than instantly. It depends on reserved capacity, visible partial progress, and tested repair behavior. Those are concrete claims that can be challenged.
Operational Failure Modes
- SLO without control path: the design promises recovery or fairness but cannot name the controller, state, and capacity that enforce it. The fix is to translate each SLO into specific control-plane commitments.
- Capacity model ignores shape: aggregate headroom exists but does not satisfy topology, memory, GPU, or locality constraints. The fix is resource-shape and topology-aware capacity review.
- Failure scenarios are too polite: the review covers clean node failure but not stale caches, timeout-after-commit, rollback during recovery, or human override. The fix is a scenario matrix tied to tests.
- Invariants are implicit: reviewers assume unique binding or quota safety without naming the authoritative write path. The fix is explicit invariant ownership.
- Observability is symptom-only: dashboards show pending work but not decision reasons or generation. The fix is decision-level telemetry.
- Operations are an afterthought: runbooks, overrides, repair, and promotion are not reviewed until incidents happen. The fix is to review operational control as part of the design.
Connections
- The previous lesson,
022.md, showed how alternative architectures move coordination boundaries. This lesson asks whether those boundaries are defensible under SLO, capacity, and failure pressure. - The next lesson,
024.md, is the capstone. Use this review frame as the checklist for the scheduler control plane you design there. reliability-engineering-foundationsandcapacity-planning-and-performance-engineeringprovide adjacent depth for SLOs, overload, and capacity evidence.
Resources
- [BOOK] Site Reliability Engineering: Service Level Objectives
- Focus: Translate user-facing reliability promises into measurable engineering commitments.
- [BOOK] Site Reliability Engineering: Addressing Cascading Failures
- Focus: Use overload and retry amplification as design-review scenarios for control planes.
- [DOC] Kubernetes: Resource Management for Pods and Containers
- Focus: Connect requests, limits, and resource shape to scheduler capacity assumptions.
- [DOC] Kubernetes: Scheduling, Preemption and Eviction
- Focus: Review how placement, priority, preemption, taints, and disruption interact.
- [DOC] Kubernetes: Pod Disruption Budgets
- Focus: Study how disruption policy constrains operational and automated movement.
Key Takeaways
- A scheduler design review should translate SLOs into concrete control-plane commitments, capacity assumptions, and evidence.
- Capacity review must account for topology, shape, priority, reclaimability, and recovery deadlines, not only aggregate headroom.
- Failure scenarios should include stale watches, retries, rollback, repair, autoscaling feedback, human overrides, and concurrent schedulers.
- The central trade-off is stronger service promises versus the coordination, capacity, observability, testing, and operational authority needed to keep them true.