Distributed Schedulers and Control Planes: Design Review for SLOs, Capacity, and Failure Scenarios

LESSON

Distributed Schedulers and Control Planes

023 35 min advanced

Distributed Schedulers and Control Planes: Design Review for SLOs, Capacity, and Failure Scenarios

The core idea: A scheduler control plane design is reviewable only when SLOs, capacity assumptions, failure scenarios, invariants, and operating evidence are tied to specific controller decisions.

Core Insight

Imagine a team proposes a new distributed scheduler for the risk-api platform. It will support multi-region recovery, tenant quotas, priority classes, autoscaling, repair controllers, human overrides, and a regional fast path that avoids global coordination for some placements. Each individual mechanism sounds reasonable because the track has covered all of them. The design review question is harder: will the whole system meet its service goals when capacity is tight and parts of the control plane are stale, slow, or wrong?

Weak design reviews argue from diagrams. Strong design reviews argue from commitments. If the service must recover four replicas in eu-west within five minutes, the review has to show where the capacity comes from, which controller has authority, which signals prove progress, which failures are tolerated, and which invariants cannot be violated while the system is trying to recover.

The non-obvious lesson is that SLOs are not only application promises. They shape the scheduler and control plane. A recovery-time objective implies reserved capacity, faster queues for urgent work, bounded retries, observable partial progress, and tested failure paths. A fairness objective implies quota accounting, preemption rules, noisy-neighbor controls, and audit. The central trade-off is ambition versus controllability: every stronger promise needs capacity, coordination, instrumentation, testing, and operational authority to back it up.

Review From Commitments

A useful design review begins by turning goals into control-plane commitments.

User or business goal	Control-plane commitment	Review evidence
Recover critical service quickly	scheduler can place protected replicas within a deadline	reserved capacity, queue priority, recovery tests
Keep tenants isolated	one tenant cannot consume another tenant's protected capacity	quota model, admission checks, preemption policy
Avoid cascading overload	controllers do not amplify failure with retries and scale-ups	backoff, rate limits, queue metrics, simulation
Make incidents debuggable	operators can reconstruct one decision path	correlation fields, status conditions, events, traces
Allow emergency action	humans can override safely and temporarily	scoped override API, TTL, audit, runbook
Reduce global bottlenecks	local decisions are safe inside bounded authority	budget allocation, summary telemetry, repair ownership

The review should reject vague promises. "The scheduler is highly available" is not enough. Ask what happens when the leader changes after a write timeout. "The system supports multi-region recovery" is not enough. Ask how many replicas can be placed if the global scheduler is slow and one regional capacity pool is unhealthy.

The design review is successful when every important claim has a mechanism and every mechanism has a failure scenario.

SLOs Shape the Scheduler

SLOs give the design a target, but only if they are specific enough to drive decisions. For a scheduler control plane, useful SLOs may include:

time to admit high-priority work
time from admission to binding
time from binding to ready capacity
recovery time for critical services
percentage of scheduling decisions explained by durable reasons
maximum queue age for priority classes
fairness windows for tenant capacity
maximum time an override may remain active
maximum time orphaned reservations may exist

Each SLO implies a different design pressure. A low bind-latency SLO pressures the scheduler queue, cache freshness, and API write path. A recovery SLO pressures capacity reserves and topology policy. A fairness SLO pressures quota and preemption. A debuggability SLO pressures status, events, and correlation fields.

One common mistake is to review only application SLOs such as request success rate and latency. Those are necessary, but they can hide control-plane failure. A service can keep serving while the scheduler is unable to place recovery work. By the time user-facing metrics degrade, the control plane may already be behind.

Capacity Is a Design Argument

Capacity should not appear late in the review as a spreadsheet attachment. It is part of the scheduler's correctness story.

The review should ask:

What capacity is protected for critical recovery?
What capacity is opportunistic and reclaimable?
Which workloads can be preempted, drained, throttled, or delayed?
How much headroom exists per region, zone, node pool, and capacity class?
What happens when capacity exists in aggregate but is fragmented by topology or resource shape?
How quickly can new capacity become usable?
Which controller owns the decision to spend emergency capacity?
How does the system avoid autoscaling into a blocked scheduler path?

For example, "20 percent headroom" is not automatically useful. If the headroom is spread across zones that do not satisfy topology constraints, or across nodes without the right memory shape, the recovery workload may still be unschedulable. A capacity model has to match the placement constraints the scheduler actually enforces.

The design should distinguish:

hard capacity        -> committed quota, reserved recovery pool, unique binding
reclaimable capacity -> lower-priority work that may be preempted or drained
soft capacity        -> hints, forecasts, autoscaler targets, expected future nodes

Those categories should not be mixed. Soft future capacity should not satisfy a hard recovery commitment unless the SLO explicitly allows waiting for it.

Failure Scenario Matrix

A design review should force the system through realistic failure scenarios. The point is not to invent every possible outage. The point is to cover the boundaries where this control plane is likely to lie to itself.

Scenario	Review question	Evidence to require
API write commits but client times out	Can retry avoid duplicate binding or reservation?	stable operation ID, read-after-timeout test
Scheduler cache is stale	Does the controller publish the resource version it used?	decision telemetry, stale-cache simulation
Autoscaler sees low readiness	Can it distinguish capacity shortage from placement lag?	pending reasons, bounded scale-up policy
Region loses capacity	Which replicas recover, where, and under whose budget?	recovery budget, topology plan, failover test
Policy rollback happens mid-recovery	Which side effects are preserved, drained, or repaired?	revision metadata, rollback runbook, repair conditions
Human override is applied	Does it expire and avoid fighting reconciliation?	override API, TTL, audit, postcondition checks
Two schedulers act concurrently	Which invariant prevents double commitment?	authoritative bind path, conflict tests
Repair loop runs during incident	Can it tell leaked state from useful partial progress?	owner references, finalizers, deadlines

This matrix should connect directly to testing. If a scenario is important enough to appear in review, it should have some combination of unit test, integration test, simulation, replay case, chaos experiment, or runbook drill.

Review Questions

A senior review usually comes down to a few categories.

Authority

What is the authoritative desired state?
Which controller owns each transition?
Where are leases, ownership transfer, and conflict checks required?
Which decisions can be local, and which require global coordination?

Progress

What partial progress is recorded?
How does the system distinguish pending, stuck, failed, and repaired?
Which deadlines turn waiting into a new decision?
What is allowed to lag safely?

Capacity

Which commitments are backed by reserved capacity?
Which work is reclaimable?
What are the fragmentation risks?
How does the design behave when autoscaling is slower than recovery needs?

Safety

What must never happen, even during failure?
What duplicate side effects are possible after timeout, retry, or restart?
How are finalizers, reservations, and orphaned children cleaned?
Which human actions are allowed, scoped, and audited?

Evidence

Can one scheduling decision be reconstructed?
Are generation, observed generation, operation ID, resource version, and policy revision visible?
Which simulations or replay artifacts prove the failure paths?
Which metrics alert before users experience the whole failure?

The review should not require perfect answers to every question. It should require honest boundaries. A design that says "we cannot guarantee fairness during regional disaster, but we preserve recovery capacity and reconcile fairness over the next hour" is much stronger than a design that claims fairness without explaining the control path.

Worked Example: Reviewing `risk-api` Recovery

Suppose the proposal says:

risk-api must recover four serving replicas in eu-west
within five minutes after eu-central loses capacity.
Tenant fairness must remain bounded.
Operators may apply emergency overrides.

A weak review accepts the architecture because it has a scheduler, autoscaler, rollback controller, and dashboards.

A stronger review asks for the path:

1. Admission marks risk-api recovery as priority critical.
2. Global policy grants eu-west a recovery budget.
3. Regional scheduler places replicas within that budget.
4. Binding uses stable workload identity and conflict checks.
5. Rollout publishes bound, starting, ready, and serving counts.
6. Autoscaler is bounded by pending reasons and recovery deadline.
7. Repair cleans leaked reservations with owner-aware deadlines.
8. Human override can pause expansion or avoid a zone for 30 minutes.
9. Observability joins generation, policy revision, quota state, and bind result.
10. Simulation covers stale quota, timeout-after-commit, rollback, and leader restart.

Now the reviewers can see the trade-offs. The design spends coordination at budget allocation and binding, not every local placement. It preserves fast regional progress but accepts that global fairness is reconciled over a window rather than instantly. It depends on reserved capacity, visible partial progress, and tested repair behavior. Those are concrete claims that can be challenged.

Operational Failure Modes

SLO without control path: the design promises recovery or fairness but cannot name the controller, state, and capacity that enforce it. The fix is to translate each SLO into specific control-plane commitments.
Capacity model ignores shape: aggregate headroom exists but does not satisfy topology, memory, GPU, or locality constraints. The fix is resource-shape and topology-aware capacity review.
Failure scenarios are too polite: the review covers clean node failure but not stale caches, timeout-after-commit, rollback during recovery, or human override. The fix is a scenario matrix tied to tests.
Invariants are implicit: reviewers assume unique binding or quota safety without naming the authoritative write path. The fix is explicit invariant ownership.
Observability is symptom-only: dashboards show pending work but not decision reasons or generation. The fix is decision-level telemetry.
Operations are an afterthought: runbooks, overrides, repair, and promotion are not reviewed until incidents happen. The fix is to review operational control as part of the design.

Connections

The previous lesson, 022.md, showed how alternative architectures move coordination boundaries. This lesson asks whether those boundaries are defensible under SLO, capacity, and failure pressure.
The next lesson, 024.md, is the capstone. Use this review frame as the checklist for the scheduler control plane you design there.
reliability-engineering-foundations and capacity-planning-and-performance-engineering provide adjacent depth for SLOs, overload, and capacity evidence.

Resources

[BOOK] Site Reliability Engineering: Service Level Objectives
- Focus: Translate user-facing reliability promises into measurable engineering commitments.
[BOOK] Site Reliability Engineering: Addressing Cascading Failures
- Focus: Use overload and retry amplification as design-review scenarios for control planes.
[DOC] Kubernetes: Resource Management for Pods and Containers
- Focus: Connect requests, limits, and resource shape to scheduler capacity assumptions.
[DOC] Kubernetes: Scheduling, Preemption and Eviction
- Focus: Review how placement, priority, preemption, taints, and disruption interact.
[DOC] Kubernetes: Pod Disruption Budgets
- Focus: Study how disruption policy constrains operational and automated movement.

Key Takeaways

A scheduler design review should translate SLOs into concrete control-plane commitments, capacity assumptions, and evidence.
Capacity review must account for topology, shape, priority, reclaimability, and recovery deadlines, not only aggregate headroom.
Failure scenarios should include stale watches, retries, rollback, repair, autoscaling feedback, human overrides, and concurrent schedulers.
The central trade-off is stronger service promises versus the coordination, capacity, observability, testing, and operational authority needed to keep them true.

← Back to Distributed Schedulers and Control Planes

← Back to Distributed Systems

← Back to Learning Hub

Distributed Schedulers and Control Planes: Design Review for SLOs, Capacity, and Failure Scenarios

Distributed Schedulers and Control Planes: Design Review for SLOs, Capacity, and Failure Scenarios

Core Insight

Review From Commitments

SLOs Shape the Scheduler

Capacity Is a Design Argument

Failure Scenario Matrix

Review Questions

Authority

Progress

Capacity

Safety

Evidence

Worked Example: Reviewing risk-api Recovery

Operational Failure Modes

Connections

Resources

Key Takeaways

Worked Example: Reviewing `risk-api` Recovery