Distributed Schedulers and Control Planes: Fairness, Priority, Preemption, and Backpressure
LESSON
Distributed Schedulers and Control Planes: Fairness, Priority, Preemption, and Backpressure
The core idea: A scheduler needs fairness to prevent starvation, priority to protect urgent work, preemption to recover scarce capacity, and backpressure to stop overload from turning scheduling into churn.
Core Insight
Suppose fraud-batch, risk-api, and dozens of analyst notebooks all want the same eu-central GPU pool. risk-api serves user traffic and has an SLO. fraud-batch is important but can wait. The notebooks are interactive, bursty, and owned by many tenants. If the scheduler only picks the next feasible placement, the loudest queue can starve quieter work, urgent services can wait behind batch jobs, and repeated retries can flood the control plane.
Fairness, priority, preemption, and backpressure are the scheduler's pressure valves. Fairness says no tenant or workload class should consume the whole system indefinitely. Priority says some work deserves earlier service because its delay is more expensive. Preemption says the platform may remove or move lower-priority work to make room for higher-priority work. Backpressure says the system should slow, defer, or reject incoming work before the queue becomes meaningless.
The hard part is that these mechanisms pull against each other. Priority can undermine fairness. Preemption can improve urgent liveness while wasting completed work and destabilizing running services. Backpressure protects the control plane, but it also makes users feel blocked. The design trade-off is not "fair or fast"; it is how to serve urgent work without making the rest of the platform unpredictable.
Fairness Is Not First-Come First-Served
First-come first-served looks fair until one tenant submits ten thousand jobs before everyone else. In a shared scheduler, fairness usually means each tenant, queue, project, or workload class receives a defensible share of scheduling attention and capacity over time.
Common fairness signals include:
- per-tenant queue depth
- dominant resource usage across CPU, memory, GPU, and storage
- historical share versus configured share
- workload age
- starvation time
- reserved capacity or quota
- burst allowance and debt
For fraud-batch, fairness may mean the fraud tenant gets a defined share of the GPU pool, but cannot permanently block risk-api or every other tenant. For analyst notebooks, fairness may mean interactive sessions get quick small allocations but are capped so they do not consume the whole accelerator pool.
Fairness has to be measured against multiple resources. A tenant using little CPU but all of the GPUs may look small in one metric and dominant in another. This is why schedulers often reason about the scarcest or dominant resource, not only the number of tasks placed.
Priority Is A Policy Decision
Priority is the scheduler's answer to "who should wait when not everyone can run?" It should be explicit policy, not an accidental side effect of queue order.
Useful priority policies name:
- which workload classes can jump ahead
- whether priority is tenant-local or global
- how long lower-priority work may be delayed
- whether priority can trigger preemption
- who is allowed to assign high priority
- what operational evidence justifies the priority
For example, risk-api may have high priority because user-facing errors are immediately visible. fraud-batch may have medium priority because delay hurts analytics but does not break request serving. Analyst notebooks may have low priority with a small guaranteed slice for interactivity.
Priority becomes dangerous when it is too easy to set. If every team marks every job as critical, priority collapses into noise. If priority is too rigid, important background work can starve and create a future incident. A scheduler needs admission controls, quotas, and audit trails around priority assignment.
Preemption Is A Controlled Violation
Preemption means the scheduler evicts, suspends, or moves lower-priority work so higher-priority work can run. It is a powerful tool because it gives the control plane a way out when urgent work is blocked by existing allocations.
Preemption should be treated as a controlled violation of normal stability, not as a routine scheduling shortcut. A good preemption path asks:
1. Is the pending workload important enough to preempt?
2. Which running workloads are legal victims?
3. How much useful work would be lost?
4. Will preemption actually free the right resources and topology?
5. Are disruption budgets, grace periods, and ownership rules respected?
6. How is the victim requeued, rescheduled, or declared complete?
Imagine all healthy GPUs in eu-central-a are occupied by low-priority notebooks, while risk-api needs two replicas in that zone to recover an SLO violation. Preemption may be the right response. The scheduler can select notebook victims, ask the control plane to terminate them with a grace period, and bind risk-api once the GPU capacity is actually released.
The binding and reconciliation lessons still apply. Preemption is not complete when the scheduler decides on a victim. It is complete only when ownership changes are committed, resources are released, and the high-priority binding is accepted. Otherwise the scheduler can count capacity that does not exist yet.
Backpressure Keeps The Queue Honest
Backpressure prevents the scheduler and control plane from accepting more work than they can reason about. Without it, overload creates misleading signals: queue age grows, retries pile up, caches lag, and users submit duplicates because nothing appears to happen.
Backpressure can appear at several points:
- admission rejects work that cannot fit policy or quota
- queues cap per-tenant pending work
- controllers slow retries with rate limits and jitter
- schedulers limit concurrent binding attempts
- APIs return explicit retry-after or blocked reasons
- autoscalers receive bounded demand instead of infinite panic
For fraud-batch, backpressure might say "accepted but pending on GPU quota" instead of letting the tenant submit thousands of identical workers. For notebooks, it might say "interactive GPU limit reached" and keep the queue short enough that users understand the delay. For risk-api, backpressure may reserve a protected lane so urgent repair work is not buried behind batch retries.
Backpressure is not failure. It is a control signal. A platform without backpressure often fails later and less clearly: the API slows down, controllers retry harder, scheduler state becomes stale, and every tenant sees unpredictable delay.
Worked Example: The GPU Rush
Consider this shared pool after a regional incident:
available GPUs: 12
risk-api needed: 4 high priority replicas
fraud-batch needed: 12 medium priority workers
notebooks needed: 30 low priority sessions
A naive scheduler might fill the pool with whichever queue it sees first. A better scheduler can apply layered policy:
1. Reserve enough capacity lane for high-priority risk-api repair.
2. Give fraud-batch a fair medium-priority share without letting it take all GPUs.
3. Admit only a bounded number of notebook sessions and queue the rest with reasons.
4. If risk-api cannot place because notebooks already occupy the topology it needs, preempt legal notebook victims.
5. Back off repeated fraud-batch retries while quota and capacity are unavailable.
This policy does not make everyone happy immediately. It makes the system explainable. risk-api recovers because priority has teeth. fraud-batch keeps some progress because fairness prevents total starvation. Notebooks receive clear backpressure instead of silent waiting. Operators can inspect which mechanism made each decision.
Operational Failure Modes
- Priority inflation: every workload is marked critical. The fix is admission policy, audit, and scarce high-priority budget.
- Starvation by politeness: low-priority or quiet tenants never get scheduled. The fix is aging, minimum shares, or starvation-aware queueing.
- Preemption storms: the scheduler repeatedly evicts work without creating stable capacity. The fix is victim selection, cooldowns, and proof that preemption will satisfy the pending workload.
- Backpressure too late: queues fill before admission starts rejecting or deferring. The fix is explicit limits near the point of submission.
- Fairness on the wrong resource: CPU fairness hides GPU domination. The fix is multi-resource accounting and dominant-resource reasoning.
- Invisible blocked reasons: users cannot distinguish quota, priority, topology, or capacity delay. The fix is structured pending reasons and queue metrics.
Connections
- The previous lesson,
007.md, explained locality and topology. Fairness and preemption must respect those same topology constraints when deciding who can run or be displaced. - The next lesson,
009.md, goes deeper into capacity models, quotas, and overcommitment, which provide the accounting behind fairness and backpressure. production-reliability-and-observabilityconnects these scheduling policies to SLOs, overload behavior, and operational visibility.
Resources
- [DOC] Kubernetes Pod Priority and Preemption
- Focus: Study how priority classes and preemption decisions interact with pending pods and disruption.
- [DOC] Kubernetes Resource Quotas
- Focus: Connect quota enforcement with fairness, admission, and backpressure.
- [PAPER] Dominant Resource Fairness
- Focus: Use the paper to understand fairness when workloads consume different mixes of CPU, memory, GPU, and other scarce resources.
- [PAPER] Large-scale cluster management at Google with Borg
- Focus: Look for how production cluster managers combine priorities, quotas, and utilization goals.
Key Takeaways
- Fairness prevents one tenant or workload class from dominating shared scheduler capacity indefinitely.
- Priority protects urgent work, but it needs admission control and audit so every job does not become critical.
- Preemption can recover scarce capacity for urgent work, but it must be controlled because it wastes work and can cause churn.
- Backpressure is a control signal that keeps queues, retries, and scheduler state understandable under overload.