Distributed Schedulers and Control Planes: Capacity Models, Quotas, and Overcommitment
LESSON
Distributed Schedulers and Control Planes: Capacity Models, Quotas, and Overcommitment
The core idea: A scheduler can only make fair and safe decisions when capacity is modeled explicitly, so the design trade-off is between high utilization through overcommitment and predictable behavior when real demand arrives.
Core Insight
Suppose risk-api, fraud-batch, and analyst notebooks share the same eu-central GPU and CPU pool. The notebooks reserve small interactive sessions, fraud-batch asks for many workers, and risk-api keeps a high-priority lane for SLO repair. Everyone says they need capacity, but they do not consume it in the same way. Some workloads reserve more than they use; others spike above their steady state; GPU requests are discrete and cannot be split like CPU time.
This is why capacity is not just "how many machines exist." A control plane needs a model of allocatable resources, reserved resources, actual usage, quotas, limits, burst rules, and safety margins. The scheduler places work against that model, admission checks whether a tenant is allowed to ask for more, and controllers update the model as work starts, stops, fails, or is preempted.
The non-obvious failure is that a beautiful fairness policy can collapse if the capacity model is wrong. If quota tracks only submitted jobs, users are blocked even when machines are idle. If quota tracks only observed usage, tenants can over-reserve and cause failures later. If overcommitment is too aggressive, the platform looks efficient until many workloads become active at the same time. The scheduler's behavior is only as honest as its accounting.
What Capacity Means
Capacity appears in several forms:
- Physical capacity: what the fleet contains, such as CPU cores, memory, GPUs, disks, and network.
- Allocatable capacity: what remains after system daemons, overhead, maintenance buffers, and reserved lanes.
- Requested capacity: what workloads ask the scheduler to reserve.
- Limited capacity: the maximum a workload may consume at runtime.
- Observed usage: what workloads are actually consuming now.
- Committed capacity: what the control plane has promised through bindings, reservations, or quota.
- Recoverable capacity: what can return after preemption, drain, failure recovery, or cleanup.
For fraud-batch, one worker may request 1 GPU, 8 CPU, and 32 GiB memory. The GPU request is hard and exclusive. CPU may be shared and throttled. Memory may be dangerous to overcommit because exceeding it can kill the process. Network bandwidth may not be represented in the same quota object even though it becomes the actual bottleneck.
Schedulers need this distinction because different resources fail differently. CPU contention usually slows work down. Memory pressure can evict or kill work. GPU shortage blocks placement entirely. Storage and network pressure can make placed work technically running but operationally useless.
Quotas And Reservations
Quota is a policy boundary: how much a tenant, project, queue, or workload class is allowed to claim. Reservation is an operational claim: a specific slice of capacity is being held for a specific purpose or lane.
A quota system might say:
tenant fraud:
gpu.requests <= 24
cpu.requests <= 400
memory.requests <= 2 TiB
high-priority-gpu <= 4
That does not mean the tenant is currently using all of that capacity. It means admission and scheduling should reject or defer requests beyond the configured boundary. Quota prevents one tenant from turning fairness into a negotiation at scheduling time.
Reservations are more concrete. risk-api might reserve four GPUs for regional recovery, or the platform team might reserve a cell for control-plane components. Reservations can protect urgent work, but they also lower visible utilization if they are too large or too static. A scheduler should expose reserved-but-idle capacity so operators can decide whether the protection is still worth the cost.
The accounting path matters. A safe design updates quota or reservation state at the same ownership boundary as binding. If quota is decremented before binding and the binding fails, the quota must be released. If binding succeeds without quota reservation, the tenant can exceed policy. These are control-plane state transitions, not bookkeeping afterthoughts.
Overcommitment
Overcommitment means promising more capacity than the platform can deliver if everyone uses their maximum at once. It is common because most workloads do not use all reserved resources all the time. A cluster with no overcommitment can be predictable but wasteful. A cluster with reckless overcommitment can be efficient during calm periods and unstable during bursts.
Different resources tolerate overcommitment differently:
- CPU can often be overcommitted because work can be throttled or time-sliced.
- Memory is risky because pressure can cause evictions, swapping, or process death.
- GPUs are usually hard to overcommit unless the hardware and runtime explicitly support partitioning.
- Network and storage can be hidden bottlenecks because demand is bursty and harder to reserve.
For analyst notebooks, overcommitting CPU may be acceptable because many sessions are idle. Overcommitting GPU memory may be unacceptable because two active notebooks can fail each other. For risk-api, overcommitment might be disallowed on the recovery lane because predictable capacity matters more than utilization.
The trade-off is explicit: overcommitment increases average utilization and apparent capacity, but it shifts risk into the moment when correlated demand appears. That risk must be visible in policy, metrics, and failure handling.
Worked Example: Twelve GPUs, Three Demand Shapes
Imagine a pool with twelve healthy GPUs:
pool: eu-central-gpu
physical GPUs: 12
reserved for risk-api recovery: 4
available for shared scheduling: 8
Now three demand streams arrive:
risk-api: needs 4 GPUs, high priority, no overcommitment
fraud-batch: wants 12 GPUs, medium priority, can make partial progress
notebooks: wants 30 sessions, low priority, bursty and interactive
A useful capacity model can produce this result:
1. Hold 4 GPUs for risk-api recovery until the SLO risk clears.
2. Admit fraud-batch up to its tenant GPU quota, but schedule only as capacity appears.
3. Admit a bounded number of notebooks and backpressure the rest with a visible reason.
4. Overcommit CPU for notebooks, but not GPU memory.
5. Release reservations when bound work exits, fails, or is preempted.
The scheduler is not just finding free GPUs. It is reconciling physical capacity, quota boundaries, protected lanes, resource-specific overcommitment rules, and workload priority. If the control plane records only "12 GPUs exist," it cannot explain why fraud-batch is waiting. If it records quota, reservation, and actual usage separately, the pending reason becomes clear: shared GPU capacity is exhausted while protected capacity is reserved for risk-api.
Operational Failure Modes
- Requested equals used: the platform treats all reservations as actual usage and appears full while machines are idle. The fix is to track requested, limited, and observed usage separately.
- Used equals safe: quota is based only on current usage, so tenants can reserve more than the platform can safely run later. The fix is admission against requests and committed capacity.
- Leaked reservation: a failed bind or deleted workload leaves capacity held forever. The fix is owner references, finalizers, timeouts, and reconciliation.
- Overcommitment without class rules: CPU, memory, GPU, network, and storage are overcommitted as if they fail the same way. The fix is resource-specific policy.
- Hidden protected capacity: reserved lanes look like missing capacity to users. The fix is observable reservation state and pending reasons.
- Quota lag: quota updates arrive after scheduling decisions, creating repeated bind failures or policy bypass. The fix is authoritative quota reservation on the scheduling path.
Connections
- The previous lesson,
008.md, introduced fairness, priority, preemption, and backpressure. Capacity models and quotas provide the accounting those policies need. - The next lesson,
010.md, uses capacity signals as inputs to autoscaling feedback loops, where bad accounting can create oscillation. production-reliability-and-observabilityconnects capacity accounting to overload signals, SLO risk, and operational dashboards.
Resources
- [DOC] Kubernetes Resource Management for Pods and Containers
- Focus: Study requests, limits, allocatable resources, and how different resource types behave under pressure.
- [DOC] Kubernetes Resource Quotas
- Focus: Connect quota enforcement with admission, tenant boundaries, and scheduler-visible capacity.
- [DOC] Kubernetes Limit Ranges
- Focus: Look at default and bounded request/limit policy as a control surface for overcommitment.
- [PAPER] Large-scale cluster management at Google with Borg
- Focus: Look for how reservations, priorities, and utilization goals interact in a production cluster manager.
Key Takeaways
- Capacity models must separate physical, allocatable, requested, limited, observed, committed, and recoverable capacity.
- Quotas define who may claim capacity; reservations hold capacity for a purpose or protection lane.
- Overcommitment improves utilization but creates risk when many workloads become active together.
- The central trade-off is higher average utilization versus predictable behavior during bursts, failures, and recovery.