Distributed Schedulers and Control Planes: Capstone: Build a Distributed Scheduler Control Plane

LESSON

Distributed Schedulers and Control Planes

024 35 min advanced CAPSTONE

Distributed Schedulers and Control Planes: Capstone: Build a Distributed Scheduler Control Plane

The core idea: A distributed scheduler control plane is credible only when desired state, authority, placement, repair, observability, testing, and human control all support the same SLO and capacity story.

Core Insight

You are designing the scheduler control plane for a multi-region platform that runs risk-api, billing-api, search workers, fraud detection jobs, and best-effort analytics. The platform has three regions, several node pools, hard tenant quotas, regional recovery budgets, priority classes, and a requirement that critical services recover useful capacity quickly during regional failure.

The design is not finished when you draw a scheduler box. The scheduler is only one part of the control plane. Admission shapes intent. Queues express priority. Filters and scores choose candidates. Binding commits scarce capacity. Autoscalers, rollout controllers, quota controllers, repair controllers, regional schedulers, and human overrides all change the state that scheduling depends on.

This capstone asks you to build the complete control story. The trade-off is precision versus survivability: a single global scheduler can make cleaner decisions with fresh state, but it may become slow or fragile during stress; regional or optimistic designs can keep moving, but they need budgets, invariants, repair, and observability to prevent hidden damage.

Scenario

Your platform must support these workloads:

Workload	Requirement	Scheduling pressure
`risk-api`	critical, multi-region recovery, low startup latency	reserved capacity, topology, fast rollback
`billing-api`	high correctness, steady traffic	strong isolation and disruption control
search indexing	bursty, memory-heavy workers	shape-aware placement and backpressure
fraud detection jobs	urgent batch work with deadlines	priority and preemption policy
analytics jobs	best-effort and reclaimable	opportunistic capacity and safe eviction

The platform has these constraints:

each tenant has quota and protected capacity
some capacity is reserved for recovery
regions may lose nodes, zones, or watch connectivity
API writes can commit after the client sees a timeout
scheduler caches can be stale
operators need scoped emergency overrides
the system must be debuggable from one workload generation
every unsafe side effect must have a repair or finalization path

The service objective for the design review is:

Critical workloads recover at least four serving replicas in a healthy region
within five minutes of a regional capacity loss, while preserving unique binding,
tenant isolation, and bounded control-plane load.

What to Design

Your capstone design should include six parts.

1. State Model

Define the objects that make up the control surface. At minimum, name:

Workload or equivalent desired-state object
SchedulingRequest or queue item
Binding or placement commitment
Reservation for capacity held before bind
SchedulerPolicy for filters, scores, and priority rules
Quota or tenant capacity budget
RecoveryBudget for critical failover
Override for human operational control
status conditions with observedGeneration

For each object, say which controller owns it, which fields are authoritative, which fields are status, and what cleanup path exists.

2. Scheduling Path

Design the path from desired work to running capacity:

admit -> queue -> filter -> score -> reserve -> bind -> start -> ready -> serving

Name the hard gates and soft preferences. For example:

hard gates: quota, topology safety, unique binding, required labels, node health
soft preferences: locality, cost, fragmentation, warm images, balanced spread
commit points: reservation and binding writes
retry points: queue reattempt, bind conflict, timeout after possible commit

The path should show how a controller avoids duplicate reservations or bindings when a write times out after committing.

3. Authority Boundaries

Decide where coordination is required and where local progress is allowed.

A reasonable design might use:

global policy controller for tenant budgets and recovery allocations
regional schedulers for local placement inside assigned budgets
authoritative API write path for binding and ownership transfer
repair controller for leaked reservations, stale finalizers, and expired overrides
autoscaler that reads pending reasons before scaling up
rollout controller that exposes bound, ready, and serving progress

You may choose a centralized, hierarchical, optimistic, or regional design. The choice matters less than the argument. State which invariants require strong coordination and which decisions can use hints or local budgets.

4. Failure and Recovery Plan

Walk through at least five failure scenarios:

regional capacity loss during recovery
stale scheduler cache
timeout after reservation or binding commit
scheduler leader restart during placement
policy rollback while replicas are starting
autoscaler amplification during low readiness
repair loop running during incident
human override that must expire

For each scenario, answer:

What desired state changes?
Which controller acts first?
What partial progress is recorded?
What invariant must still hold?
What telemetry proves the state of the operation?
What repair or rollback path exists?

5. Observability, Testing, and Replay

The design must let an operator reconstruct one decision:

workload generation
queue time
policy revision
quota state
cache resource version
filter and score reasons
reservation or bind result
status condition
next owning controller

It must also include tests:

unit tests for filters, scoring, quota, status transitions, and policy parsing
integration tests for API conflict, observed generation, and finalizers
property tests for invariants such as unique binding
simulation for stale watches, queue ordering, restarts, and retries
deterministic replay for at least one duplicate-reservation or rollback race
operational drills for override expiry and recovery runbooks

6. Operating Model

Define how people operate the system:

alerts before user-facing impact, such as queue age and recovery budget exhaustion
runbooks for recovery override, zone avoidance, autoscaler pause, and repair approval
break-glass access with audit and TTL
dashboards that separate scheduling, binding, startup, readiness, and serving
promotion and rollback of scheduler policy revisions
post-incident cleanup checks for reservations, overrides, and emergency quotas

This section is part of the architecture, not an appendix. A control plane that cannot be operated safely is not complete.

Reference Architecture

One defensible architecture is hierarchical:

operator/API
    |
    v
global control plane
  - tenant quota
  - recovery budgets
  - scheduler policy revisions
  - override admission
    |
    v
regional schedulers
  - queue and local cache
  - filter and score
  - reserve and bind through authoritative API
  - publish decision status
    |
    v
node pools, rollout controllers, autoscalers, repair controllers

This architecture spends global coordination on policy, budgets, and binding authority. It lets regional schedulers make fast placement decisions inside bounded capacity. It accepts that global fairness may be reconciled over a window during disaster, but it does not accept duplicate binding or unowned reservations.

Your own design can be different. A centralized scheduler may be valid if the recovery SLO and load model support it. An optimistic shared-state design may be valid if conflict rates, retries, and fairness are tested. A more distributed design may be valid if its local decisions are bounded and repair is explicit.

Readiness Review

Before calling the design complete, check these claims:

Claim	Evidence
Critical recovery meets the five-minute target	capacity model, failover drill, queue and readiness metrics
Unique binding is preserved	authoritative commit path, invariant test, replay after timeout
Tenant isolation is bounded	quota model, preemption policy, audit of emergency capacity
Autoscaling does not amplify placement lag	pending reasons, scale-up limits, simulation
Rollback and repair are state-aware	revision metadata, owner references, finalizers, repair conditions
Operators can act safely	scoped overrides, TTL, audit, runbooks, cleanup checks
Decisions are debuggable	generation, policy revision, resource version, reasons, status
Coordination boundaries are justified	invariant map, budget model, conflict and skew analysis

If you cannot attach evidence to a claim, treat the claim as a design risk.

Common Failure Patterns

Architecture without invariants: the design names controllers but not the properties they preserve. Add explicit safety and liveness claims.
SLO without capacity: the recovery target depends on capacity that is only expected, not reserved or reclaimable. Tie the target to real capacity classes.
Local schedulers without authority boundaries: regional progress is fast but can violate global budget or unique binding. Define the commit path and budget expiry.
Autoscaler fights scheduler: low readiness creates more desired work while placement is blocked. Feed pending reasons and deadlines into scaling.
Human overrides are invisible: manual fixes bypass reconciliation and remain after the incident. Use explicit override objects with TTL and audit.
Testing stops at happy path: the design has no replay for timeout-after-commit, stale cache, leader restart, or rollback race. Add simulation and deterministic replay.

Deliverable

Produce a design document with:

one architecture diagram or ASCII state flow
object model with ownership, status, and cleanup
scheduling algorithm with hard gates and soft preferences
authority and coordination boundary map
capacity and recovery budget model
failure scenario matrix
observability plan for decision reconstruction
test and replay plan
runbook and override model
readiness review with residual risks

The result should let another engineer understand not only what the scheduler does, but what the control plane promises, what it refuses to promise, and how those promises will be checked.

Connections

The previous lesson, 023.md, provided the design-review frame. This capstone applies it as a complete architecture exercise.
consensus-and-coordination supports the authority, lease, and ownership boundaries in the design.
cloud-platform-and-microservices provides adjacent context for platform APIs, controllers, rollout surfaces, and operating models.

Resources

[DOC] Kubernetes Scheduling Framework
- Focus: Use extension points to reason about filter, score, reserve, permit, bind, and post-bind phases.
[DOC] Kubernetes Controllers
- Focus: Ground the capstone in reconciliation loops and desired-state control.
[PAPER] Omega: Flexible, Scalable Schedulers for Large Compute Clusters
- Focus: Compare centralized, two-level, and optimistic shared-state scheduling trade-offs.
[PAPER] Large-scale cluster management at Google with Borg
- Focus: Study production-scale scheduling, isolation, and operational constraints.
[BOOK] Site Reliability Engineering: Service Level Objectives
- Focus: Tie scheduler control-plane design back to explicit service commitments.

Key Takeaways

A distributed scheduler control plane combines state models, placement decisions, authority boundaries, repair, observability, tests, and operations into one reliability argument.
Hard invariants such as unique binding and quota commitment need authoritative checks; preferences and local progress can often use budgets, hints, and repair.
The capstone design is ready only when its SLO, capacity, failure scenarios, and operational controls have concrete evidence.
The central trade-off is precise global optimization versus bounded local progress under real failure and load.

← Back to Distributed Schedulers and Control Planes

← Back to Distributed Systems

← Back to Learning Hub