Distributed Schedulers and Control Planes: Capstone: Build a Distributed Scheduler Control Plane
LESSON
Distributed Schedulers and Control Planes: Capstone: Build a Distributed Scheduler Control Plane
The core idea: A distributed scheduler control plane is credible only when desired state, authority, placement, repair, observability, testing, and human control all support the same SLO and capacity story.
Core Insight
You are designing the scheduler control plane for a multi-region platform that runs risk-api, billing-api, search workers, fraud detection jobs, and best-effort analytics. The platform has three regions, several node pools, hard tenant quotas, regional recovery budgets, priority classes, and a requirement that critical services recover useful capacity quickly during regional failure.
The design is not finished when you draw a scheduler box. The scheduler is only one part of the control plane. Admission shapes intent. Queues express priority. Filters and scores choose candidates. Binding commits scarce capacity. Autoscalers, rollout controllers, quota controllers, repair controllers, regional schedulers, and human overrides all change the state that scheduling depends on.
This capstone asks you to build the complete control story. The trade-off is precision versus survivability: a single global scheduler can make cleaner decisions with fresh state, but it may become slow or fragile during stress; regional or optimistic designs can keep moving, but they need budgets, invariants, repair, and observability to prevent hidden damage.
Scenario
Your platform must support these workloads:
| Workload | Requirement | Scheduling pressure |
|---|---|---|
risk-api |
critical, multi-region recovery, low startup latency | reserved capacity, topology, fast rollback |
billing-api |
high correctness, steady traffic | strong isolation and disruption control |
| search indexing | bursty, memory-heavy workers | shape-aware placement and backpressure |
| fraud detection jobs | urgent batch work with deadlines | priority and preemption policy |
| analytics jobs | best-effort and reclaimable | opportunistic capacity and safe eviction |
The platform has these constraints:
- each tenant has quota and protected capacity
- some capacity is reserved for recovery
- regions may lose nodes, zones, or watch connectivity
- API writes can commit after the client sees a timeout
- scheduler caches can be stale
- operators need scoped emergency overrides
- the system must be debuggable from one workload generation
- every unsafe side effect must have a repair or finalization path
The service objective for the design review is:
Critical workloads recover at least four serving replicas in a healthy region
within five minutes of a regional capacity loss, while preserving unique binding,
tenant isolation, and bounded control-plane load.
What to Design
Your capstone design should include six parts.
1. State Model
Define the objects that make up the control surface. At minimum, name:
Workloador equivalent desired-state objectSchedulingRequestor queue itemBindingor placement commitmentReservationfor capacity held before bindSchedulerPolicyfor filters, scores, and priority rulesQuotaor tenant capacity budgetRecoveryBudgetfor critical failoverOverridefor human operational control- status conditions with
observedGeneration
For each object, say which controller owns it, which fields are authoritative, which fields are status, and what cleanup path exists.
2. Scheduling Path
Design the path from desired work to running capacity:
admit -> queue -> filter -> score -> reserve -> bind -> start -> ready -> serving
Name the hard gates and soft preferences. For example:
- hard gates: quota, topology safety, unique binding, required labels, node health
- soft preferences: locality, cost, fragmentation, warm images, balanced spread
- commit points: reservation and binding writes
- retry points: queue reattempt, bind conflict, timeout after possible commit
The path should show how a controller avoids duplicate reservations or bindings when a write times out after committing.
3. Authority Boundaries
Decide where coordination is required and where local progress is allowed.
A reasonable design might use:
- global policy controller for tenant budgets and recovery allocations
- regional schedulers for local placement inside assigned budgets
- authoritative API write path for binding and ownership transfer
- repair controller for leaked reservations, stale finalizers, and expired overrides
- autoscaler that reads pending reasons before scaling up
- rollout controller that exposes bound, ready, and serving progress
You may choose a centralized, hierarchical, optimistic, or regional design. The choice matters less than the argument. State which invariants require strong coordination and which decisions can use hints or local budgets.
4. Failure and Recovery Plan
Walk through at least five failure scenarios:
- regional capacity loss during recovery
- stale scheduler cache
- timeout after reservation or binding commit
- scheduler leader restart during placement
- policy rollback while replicas are starting
- autoscaler amplification during low readiness
- repair loop running during incident
- human override that must expire
For each scenario, answer:
- What desired state changes?
- Which controller acts first?
- What partial progress is recorded?
- What invariant must still hold?
- What telemetry proves the state of the operation?
- What repair or rollback path exists?
5. Observability, Testing, and Replay
The design must let an operator reconstruct one decision:
workload generation
queue time
policy revision
quota state
cache resource version
filter and score reasons
reservation or bind result
status condition
next owning controller
It must also include tests:
- unit tests for filters, scoring, quota, status transitions, and policy parsing
- integration tests for API conflict, observed generation, and finalizers
- property tests for invariants such as unique binding
- simulation for stale watches, queue ordering, restarts, and retries
- deterministic replay for at least one duplicate-reservation or rollback race
- operational drills for override expiry and recovery runbooks
6. Operating Model
Define how people operate the system:
- alerts before user-facing impact, such as queue age and recovery budget exhaustion
- runbooks for recovery override, zone avoidance, autoscaler pause, and repair approval
- break-glass access with audit and TTL
- dashboards that separate scheduling, binding, startup, readiness, and serving
- promotion and rollback of scheduler policy revisions
- post-incident cleanup checks for reservations, overrides, and emergency quotas
This section is part of the architecture, not an appendix. A control plane that cannot be operated safely is not complete.
Reference Architecture
One defensible architecture is hierarchical:
operator/API
|
v
global control plane
- tenant quota
- recovery budgets
- scheduler policy revisions
- override admission
|
v
regional schedulers
- queue and local cache
- filter and score
- reserve and bind through authoritative API
- publish decision status
|
v
node pools, rollout controllers, autoscalers, repair controllers
This architecture spends global coordination on policy, budgets, and binding authority. It lets regional schedulers make fast placement decisions inside bounded capacity. It accepts that global fairness may be reconciled over a window during disaster, but it does not accept duplicate binding or unowned reservations.
Your own design can be different. A centralized scheduler may be valid if the recovery SLO and load model support it. An optimistic shared-state design may be valid if conflict rates, retries, and fairness are tested. A more distributed design may be valid if its local decisions are bounded and repair is explicit.
Readiness Review
Before calling the design complete, check these claims:
| Claim | Evidence |
|---|---|
| Critical recovery meets the five-minute target | capacity model, failover drill, queue and readiness metrics |
| Unique binding is preserved | authoritative commit path, invariant test, replay after timeout |
| Tenant isolation is bounded | quota model, preemption policy, audit of emergency capacity |
| Autoscaling does not amplify placement lag | pending reasons, scale-up limits, simulation |
| Rollback and repair are state-aware | revision metadata, owner references, finalizers, repair conditions |
| Operators can act safely | scoped overrides, TTL, audit, runbooks, cleanup checks |
| Decisions are debuggable | generation, policy revision, resource version, reasons, status |
| Coordination boundaries are justified | invariant map, budget model, conflict and skew analysis |
If you cannot attach evidence to a claim, treat the claim as a design risk.
Common Failure Patterns
- Architecture without invariants: the design names controllers but not the properties they preserve. Add explicit safety and liveness claims.
- SLO without capacity: the recovery target depends on capacity that is only expected, not reserved or reclaimable. Tie the target to real capacity classes.
- Local schedulers without authority boundaries: regional progress is fast but can violate global budget or unique binding. Define the commit path and budget expiry.
- Autoscaler fights scheduler: low readiness creates more desired work while placement is blocked. Feed pending reasons and deadlines into scaling.
- Human overrides are invisible: manual fixes bypass reconciliation and remain after the incident. Use explicit override objects with TTL and audit.
- Testing stops at happy path: the design has no replay for timeout-after-commit, stale cache, leader restart, or rollback race. Add simulation and deterministic replay.
Deliverable
Produce a design document with:
- one architecture diagram or ASCII state flow
- object model with ownership, status, and cleanup
- scheduling algorithm with hard gates and soft preferences
- authority and coordination boundary map
- capacity and recovery budget model
- failure scenario matrix
- observability plan for decision reconstruction
- test and replay plan
- runbook and override model
- readiness review with residual risks
The result should let another engineer understand not only what the scheduler does, but what the control plane promises, what it refuses to promise, and how those promises will be checked.
Connections
- The previous lesson,
023.md, provided the design-review frame. This capstone applies it as a complete architecture exercise. consensus-and-coordinationsupports the authority, lease, and ownership boundaries in the design.cloud-platform-and-microservicesprovides adjacent context for platform APIs, controllers, rollout surfaces, and operating models.
Resources
- [DOC] Kubernetes Scheduling Framework
- Focus: Use extension points to reason about filter, score, reserve, permit, bind, and post-bind phases.
- [DOC] Kubernetes Controllers
- Focus: Ground the capstone in reconciliation loops and desired-state control.
- [PAPER] Omega: Flexible, Scalable Schedulers for Large Compute Clusters
- Focus: Compare centralized, two-level, and optimistic shared-state scheduling trade-offs.
- [PAPER] Large-scale cluster management at Google with Borg
- Focus: Study production-scale scheduling, isolation, and operational constraints.
- [BOOK] Site Reliability Engineering: Service Level Objectives
- Focus: Tie scheduler control-plane design back to explicit service commitments.
Key Takeaways
- A distributed scheduler control plane combines state models, placement decisions, authority boundaries, repair, observability, tests, and operations into one reliability argument.
- Hard invariants such as unique binding and quota commitment need authoritative checks; preferences and local progress can often use budgets, hints, and repair.
- The capstone design is ready only when its SLO, capacity, failure scenarios, and operational controls have concrete evidence.
- The central trade-off is precise global optimization versus bounded local progress under real failure and load.