Distributed Schedulers and Control Planes: Capacity Models, Quotas, and Overcommitment

LESSON

Distributed Schedulers and Control Planes

009 35 min advanced

Distributed Schedulers and Control Planes: Capacity Models, Quotas, and Overcommitment

The core idea: A scheduler can only make fair and safe decisions when capacity is modeled explicitly, so the design trade-off is between high utilization through overcommitment and predictable behavior when real demand arrives.

Core Insight

Suppose risk-api, fraud-batch, and analyst notebooks share the same eu-central GPU and CPU pool. The notebooks reserve small interactive sessions, fraud-batch asks for many workers, and risk-api keeps a high-priority lane for SLO repair. Everyone says they need capacity, but they do not consume it in the same way. Some workloads reserve more than they use; others spike above their steady state; GPU requests are discrete and cannot be split like CPU time.

This is why capacity is not just "how many machines exist." A control plane needs a model of allocatable resources, reserved resources, actual usage, quotas, limits, burst rules, and safety margins. The scheduler places work against that model, admission checks whether a tenant is allowed to ask for more, and controllers update the model as work starts, stops, fails, or is preempted.

The non-obvious failure is that a beautiful fairness policy can collapse if the capacity model is wrong. If quota tracks only submitted jobs, users are blocked even when machines are idle. If quota tracks only observed usage, tenants can over-reserve and cause failures later. If overcommitment is too aggressive, the platform looks efficient until many workloads become active at the same time. The scheduler's behavior is only as honest as its accounting.

What Capacity Means

Capacity appears in several forms:

For fraud-batch, one worker may request 1 GPU, 8 CPU, and 32 GiB memory. The GPU request is hard and exclusive. CPU may be shared and throttled. Memory may be dangerous to overcommit because exceeding it can kill the process. Network bandwidth may not be represented in the same quota object even though it becomes the actual bottleneck.

Schedulers need this distinction because different resources fail differently. CPU contention usually slows work down. Memory pressure can evict or kill work. GPU shortage blocks placement entirely. Storage and network pressure can make placed work technically running but operationally useless.

Quotas And Reservations

Quota is a policy boundary: how much a tenant, project, queue, or workload class is allowed to claim. Reservation is an operational claim: a specific slice of capacity is being held for a specific purpose or lane.

A quota system might say:

tenant fraud:
  gpu.requests <= 24
  cpu.requests <= 400
  memory.requests <= 2 TiB
  high-priority-gpu <= 4

That does not mean the tenant is currently using all of that capacity. It means admission and scheduling should reject or defer requests beyond the configured boundary. Quota prevents one tenant from turning fairness into a negotiation at scheduling time.

Reservations are more concrete. risk-api might reserve four GPUs for regional recovery, or the platform team might reserve a cell for control-plane components. Reservations can protect urgent work, but they also lower visible utilization if they are too large or too static. A scheduler should expose reserved-but-idle capacity so operators can decide whether the protection is still worth the cost.

The accounting path matters. A safe design updates quota or reservation state at the same ownership boundary as binding. If quota is decremented before binding and the binding fails, the quota must be released. If binding succeeds without quota reservation, the tenant can exceed policy. These are control-plane state transitions, not bookkeeping afterthoughts.

Overcommitment

Overcommitment means promising more capacity than the platform can deliver if everyone uses their maximum at once. It is common because most workloads do not use all reserved resources all the time. A cluster with no overcommitment can be predictable but wasteful. A cluster with reckless overcommitment can be efficient during calm periods and unstable during bursts.

Different resources tolerate overcommitment differently:

For analyst notebooks, overcommitting CPU may be acceptable because many sessions are idle. Overcommitting GPU memory may be unacceptable because two active notebooks can fail each other. For risk-api, overcommitment might be disallowed on the recovery lane because predictable capacity matters more than utilization.

The trade-off is explicit: overcommitment increases average utilization and apparent capacity, but it shifts risk into the moment when correlated demand appears. That risk must be visible in policy, metrics, and failure handling.

Worked Example: Twelve GPUs, Three Demand Shapes

Imagine a pool with twelve healthy GPUs:

pool: eu-central-gpu
physical GPUs: 12
reserved for risk-api recovery: 4
available for shared scheduling: 8

Now three demand streams arrive:

risk-api:     needs 4 GPUs, high priority, no overcommitment
fraud-batch:  wants 12 GPUs, medium priority, can make partial progress
notebooks:    wants 30 sessions, low priority, bursty and interactive

A useful capacity model can produce this result:

1. Hold 4 GPUs for risk-api recovery until the SLO risk clears.
2. Admit fraud-batch up to its tenant GPU quota, but schedule only as capacity appears.
3. Admit a bounded number of notebooks and backpressure the rest with a visible reason.
4. Overcommit CPU for notebooks, but not GPU memory.
5. Release reservations when bound work exits, fails, or is preempted.

The scheduler is not just finding free GPUs. It is reconciling physical capacity, quota boundaries, protected lanes, resource-specific overcommitment rules, and workload priority. If the control plane records only "12 GPUs exist," it cannot explain why fraud-batch is waiting. If it records quota, reservation, and actual usage separately, the pending reason becomes clear: shared GPU capacity is exhausted while protected capacity is reserved for risk-api.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Fairness, Priority, Preemption, and Backpressure NEXT Distributed Schedulers and Control Planes: Autoscaling Feedback Loops and Stability