Distributed Schedulers and Control Planes: Capstone: Build a Distributed Scheduler Control Plane

LESSON

Distributed Schedulers and Control Planes

024 35 min advanced CAPSTONE

Distributed Schedulers and Control Planes: Capstone: Build a Distributed Scheduler Control Plane

The core idea: A distributed scheduler control plane is credible only when desired state, authority, placement, repair, observability, testing, and human control all support the same SLO and capacity story.

Core Insight

You are designing the scheduler control plane for a multi-region platform that runs risk-api, billing-api, search workers, fraud detection jobs, and best-effort analytics. The platform has three regions, several node pools, hard tenant quotas, regional recovery budgets, priority classes, and a requirement that critical services recover useful capacity quickly during regional failure.

The design is not finished when you draw a scheduler box. The scheduler is only one part of the control plane. Admission shapes intent. Queues express priority. Filters and scores choose candidates. Binding commits scarce capacity. Autoscalers, rollout controllers, quota controllers, repair controllers, regional schedulers, and human overrides all change the state that scheduling depends on.

This capstone asks you to build the complete control story. The trade-off is precision versus survivability: a single global scheduler can make cleaner decisions with fresh state, but it may become slow or fragile during stress; regional or optimistic designs can keep moving, but they need budgets, invariants, repair, and observability to prevent hidden damage.

Scenario

Your platform must support these workloads:

Workload Requirement Scheduling pressure
risk-api critical, multi-region recovery, low startup latency reserved capacity, topology, fast rollback
billing-api high correctness, steady traffic strong isolation and disruption control
search indexing bursty, memory-heavy workers shape-aware placement and backpressure
fraud detection jobs urgent batch work with deadlines priority and preemption policy
analytics jobs best-effort and reclaimable opportunistic capacity and safe eviction

The platform has these constraints:

The service objective for the design review is:

Critical workloads recover at least four serving replicas in a healthy region
within five minutes of a regional capacity loss, while preserving unique binding,
tenant isolation, and bounded control-plane load.

What to Design

Your capstone design should include six parts.

1. State Model

Define the objects that make up the control surface. At minimum, name:

For each object, say which controller owns it, which fields are authoritative, which fields are status, and what cleanup path exists.

2. Scheduling Path

Design the path from desired work to running capacity:

admit -> queue -> filter -> score -> reserve -> bind -> start -> ready -> serving

Name the hard gates and soft preferences. For example:

The path should show how a controller avoids duplicate reservations or bindings when a write times out after committing.

3. Authority Boundaries

Decide where coordination is required and where local progress is allowed.

A reasonable design might use:

You may choose a centralized, hierarchical, optimistic, or regional design. The choice matters less than the argument. State which invariants require strong coordination and which decisions can use hints or local budgets.

4. Failure and Recovery Plan

Walk through at least five failure scenarios:

For each scenario, answer:

5. Observability, Testing, and Replay

The design must let an operator reconstruct one decision:

workload generation
queue time
policy revision
quota state
cache resource version
filter and score reasons
reservation or bind result
status condition
next owning controller

It must also include tests:

6. Operating Model

Define how people operate the system:

This section is part of the architecture, not an appendix. A control plane that cannot be operated safely is not complete.

Reference Architecture

One defensible architecture is hierarchical:

operator/API
    |
    v
global control plane
  - tenant quota
  - recovery budgets
  - scheduler policy revisions
  - override admission
    |
    v
regional schedulers
  - queue and local cache
  - filter and score
  - reserve and bind through authoritative API
  - publish decision status
    |
    v
node pools, rollout controllers, autoscalers, repair controllers

This architecture spends global coordination on policy, budgets, and binding authority. It lets regional schedulers make fast placement decisions inside bounded capacity. It accepts that global fairness may be reconciled over a window during disaster, but it does not accept duplicate binding or unowned reservations.

Your own design can be different. A centralized scheduler may be valid if the recovery SLO and load model support it. An optimistic shared-state design may be valid if conflict rates, retries, and fairness are tested. A more distributed design may be valid if its local decisions are bounded and repair is explicit.

Readiness Review

Before calling the design complete, check these claims:

Claim Evidence
Critical recovery meets the five-minute target capacity model, failover drill, queue and readiness metrics
Unique binding is preserved authoritative commit path, invariant test, replay after timeout
Tenant isolation is bounded quota model, preemption policy, audit of emergency capacity
Autoscaling does not amplify placement lag pending reasons, scale-up limits, simulation
Rollback and repair are state-aware revision metadata, owner references, finalizers, repair conditions
Operators can act safely scoped overrides, TTL, audit, runbooks, cleanup checks
Decisions are debuggable generation, policy revision, resource version, reasons, status
Coordination boundaries are justified invariant map, budget model, conflict and skew analysis

If you cannot attach evidence to a claim, treat the claim as a design risk.

Common Failure Patterns

Deliverable

Produce a design document with:

The result should let another engineer understand not only what the scheduler does, but what the control plane promises, what it refuses to promise, and how those promises will be checked.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Design Review for SLOs, Capacity, and Failure Scenarios