Distributed Schedulers and Control Planes: Design Review for SLOs, Capacity, and Failure Scenarios

LESSON

Distributed Schedulers and Control Planes

023 35 min advanced

Distributed Schedulers and Control Planes: Design Review for SLOs, Capacity, and Failure Scenarios

The core idea: A scheduler control plane design is reviewable only when SLOs, capacity assumptions, failure scenarios, invariants, and operating evidence are tied to specific controller decisions.

Core Insight

Imagine a team proposes a new distributed scheduler for the risk-api platform. It will support multi-region recovery, tenant quotas, priority classes, autoscaling, repair controllers, human overrides, and a regional fast path that avoids global coordination for some placements. Each individual mechanism sounds reasonable because the track has covered all of them. The design review question is harder: will the whole system meet its service goals when capacity is tight and parts of the control plane are stale, slow, or wrong?

Weak design reviews argue from diagrams. Strong design reviews argue from commitments. If the service must recover four replicas in eu-west within five minutes, the review has to show where the capacity comes from, which controller has authority, which signals prove progress, which failures are tolerated, and which invariants cannot be violated while the system is trying to recover.

The non-obvious lesson is that SLOs are not only application promises. They shape the scheduler and control plane. A recovery-time objective implies reserved capacity, faster queues for urgent work, bounded retries, observable partial progress, and tested failure paths. A fairness objective implies quota accounting, preemption rules, noisy-neighbor controls, and audit. The central trade-off is ambition versus controllability: every stronger promise needs capacity, coordination, instrumentation, testing, and operational authority to back it up.

Review From Commitments

A useful design review begins by turning goals into control-plane commitments.

User or business goal Control-plane commitment Review evidence
Recover critical service quickly scheduler can place protected replicas within a deadline reserved capacity, queue priority, recovery tests
Keep tenants isolated one tenant cannot consume another tenant's protected capacity quota model, admission checks, preemption policy
Avoid cascading overload controllers do not amplify failure with retries and scale-ups backoff, rate limits, queue metrics, simulation
Make incidents debuggable operators can reconstruct one decision path correlation fields, status conditions, events, traces
Allow emergency action humans can override safely and temporarily scoped override API, TTL, audit, runbook
Reduce global bottlenecks local decisions are safe inside bounded authority budget allocation, summary telemetry, repair ownership

The review should reject vague promises. "The scheduler is highly available" is not enough. Ask what happens when the leader changes after a write timeout. "The system supports multi-region recovery" is not enough. Ask how many replicas can be placed if the global scheduler is slow and one regional capacity pool is unhealthy.

The design review is successful when every important claim has a mechanism and every mechanism has a failure scenario.

SLOs Shape the Scheduler

SLOs give the design a target, but only if they are specific enough to drive decisions. For a scheduler control plane, useful SLOs may include:

Each SLO implies a different design pressure. A low bind-latency SLO pressures the scheduler queue, cache freshness, and API write path. A recovery SLO pressures capacity reserves and topology policy. A fairness SLO pressures quota and preemption. A debuggability SLO pressures status, events, and correlation fields.

One common mistake is to review only application SLOs such as request success rate and latency. Those are necessary, but they can hide control-plane failure. A service can keep serving while the scheduler is unable to place recovery work. By the time user-facing metrics degrade, the control plane may already be behind.

Capacity Is a Design Argument

Capacity should not appear late in the review as a spreadsheet attachment. It is part of the scheduler's correctness story.

The review should ask:

For example, "20 percent headroom" is not automatically useful. If the headroom is spread across zones that do not satisfy topology constraints, or across nodes without the right memory shape, the recovery workload may still be unschedulable. A capacity model has to match the placement constraints the scheduler actually enforces.

The design should distinguish:

hard capacity        -> committed quota, reserved recovery pool, unique binding
reclaimable capacity -> lower-priority work that may be preempted or drained
soft capacity        -> hints, forecasts, autoscaler targets, expected future nodes

Those categories should not be mixed. Soft future capacity should not satisfy a hard recovery commitment unless the SLO explicitly allows waiting for it.

Failure Scenario Matrix

A design review should force the system through realistic failure scenarios. The point is not to invent every possible outage. The point is to cover the boundaries where this control plane is likely to lie to itself.

Scenario Review question Evidence to require
API write commits but client times out Can retry avoid duplicate binding or reservation? stable operation ID, read-after-timeout test
Scheduler cache is stale Does the controller publish the resource version it used? decision telemetry, stale-cache simulation
Autoscaler sees low readiness Can it distinguish capacity shortage from placement lag? pending reasons, bounded scale-up policy
Region loses capacity Which replicas recover, where, and under whose budget? recovery budget, topology plan, failover test
Policy rollback happens mid-recovery Which side effects are preserved, drained, or repaired? revision metadata, rollback runbook, repair conditions
Human override is applied Does it expire and avoid fighting reconciliation? override API, TTL, audit, postcondition checks
Two schedulers act concurrently Which invariant prevents double commitment? authoritative bind path, conflict tests
Repair loop runs during incident Can it tell leaked state from useful partial progress? owner references, finalizers, deadlines

This matrix should connect directly to testing. If a scenario is important enough to appear in review, it should have some combination of unit test, integration test, simulation, replay case, chaos experiment, or runbook drill.

Review Questions

A senior review usually comes down to a few categories.

Authority

Progress

Capacity

Safety

Evidence

The review should not require perfect answers to every question. It should require honest boundaries. A design that says "we cannot guarantee fairness during regional disaster, but we preserve recovery capacity and reconcile fairness over the next hour" is much stronger than a design that claims fairness without explaining the control path.

Worked Example: Reviewing risk-api Recovery

Suppose the proposal says:

risk-api must recover four serving replicas in eu-west
within five minutes after eu-central loses capacity.
Tenant fairness must remain bounded.
Operators may apply emergency overrides.

A weak review accepts the architecture because it has a scheduler, autoscaler, rollback controller, and dashboards.

A stronger review asks for the path:

1. Admission marks risk-api recovery as priority critical.
2. Global policy grants eu-west a recovery budget.
3. Regional scheduler places replicas within that budget.
4. Binding uses stable workload identity and conflict checks.
5. Rollout publishes bound, starting, ready, and serving counts.
6. Autoscaler is bounded by pending reasons and recovery deadline.
7. Repair cleans leaked reservations with owner-aware deadlines.
8. Human override can pause expansion or avoid a zone for 30 minutes.
9. Observability joins generation, policy revision, quota state, and bind result.
10. Simulation covers stale quota, timeout-after-commit, rollback, and leader restart.

Now the reviewers can see the trade-offs. The design spends coordination at budget allocation and binding, not every local placement. It preserves fast regional progress but accepts that global fairness is reconciled over a window rather than instantly. It depends on reserved capacity, visible partial progress, and tested repair behavior. Those are concrete claims that can be challenged.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Alternative Architectures and Coordination-Avoiding Designs NEXT Distributed Schedulers and Control Planes: Capstone: Build a Distributed Scheduler Control Plane