Distributed Schedulers and Control Planes: Autoscaling Feedback Loops and Stability

LESSON

Distributed Schedulers and Control Planes

010 35 min advanced

Distributed Schedulers and Control Planes: Autoscaling Feedback Loops and Stability

The core idea: Autoscaling is a feedback loop between observed load, desired capacity, scheduling, and real startup time, so the design trade-off is between reacting quickly and avoiding oscillation when signals lag or demand spikes.

Core Insight

Suppose risk-api suddenly sees traffic double in eu-central, while fraud-batch and notebooks are already competing for the same GPU and CPU pool. The autoscaler observes high request latency and high CPU usage, decides the service needs more replicas, and writes a new desired replica count. The scheduler then tries to place those replicas, but some nodes are full, quota state is changing, and new replicas take time to start and become useful.

The tempting mental model is simple: load rises, add replicas; load falls, remove replicas. Real autoscaling is slower and more coupled. Metrics are delayed, placement is constrained, startup takes time, and the act of adding work changes the scheduler queue and capacity model. If the autoscaler reacts to every noisy sample, it can create thrash: scale up, fail to place, retry, overcorrect, then scale down just as new capacity becomes ready.

A stable autoscaling design treats scaling as a control loop, not a reflex. It chooses signals carefully, smooths noisy inputs, respects cooldowns, caps rate of change, and distinguishes "more replicas would help" from "more replicas can actually be placed and become healthy." The scheduler and autoscaler are separate controllers, but they are part of one system.

The Autoscaling Loop

A basic loop has five stages:

measure -> decide -> update desired state -> schedule/actuate -> observe effect

Each stage can lag:

For risk-api, the autoscaler may decide replicas=12 from a metric sample taken thirty seconds ago. The scheduler may place only eight replicas because the protected GPU lane is full. Four pending replicas may still count as desired capacity but not as serving capacity. If the autoscaler ignores that gap, it can keep raising desired replicas even though the bottleneck is placement, not replica count.

The useful question is not only "what target should the autoscaler choose?" It is "which part of the loop is currently limiting progress?"

Choosing Signals

Autoscaling signals should describe demand or saturation that additional capacity can actually relieve. Common signals include:

Not every high metric should trigger scaling. High latency caused by a downstream database outage may not improve with more risk-api replicas. High CPU during startup can make a new replica look overloaded before it is useful. A deep queue caused by quota exhaustion may need capacity, but a deep queue caused by invalid work needs rejection or policy repair.

Signals should also match the resource being scaled. If the autoscaler is adding application replicas, request rate and per-replica work may be useful. If it is adding nodes, unschedulable pods and resource requests matter more. If it is scaling GPU workers, accelerator saturation and queue age may matter more than average CPU.

Stability Controls

Stable autoscaling usually needs damping. Without it, the controller reacts faster than the system can show the result of its previous action.

Common stability controls include:

The trade-off is direct. More damping reduces oscillation and protects the scheduler from bursts of desired-state changes. Too much damping makes the system slow to react during real demand spikes. A high-priority API may accept faster scale-up and slower scale-down. A batch worker pool may accept slower reactions to avoid wasting capacity.

Scheduler Coupling

Autoscalers and schedulers interact through desired state and pending work. That coupling creates feedback paths:

autoscaler raises desired replicas
scheduler tries to place replicas
pending replicas increase queue pressure
cluster autoscaler may add nodes
new nodes become ready later
application metrics improve later still
autoscaler reads new metrics and adjusts again

If each controller reads only its local signal, the combined system can behave badly. The application autoscaler may keep increasing replicas because latency remains high. The scheduler may keep reporting unschedulable work. The cluster autoscaler may add nodes, but nodes take minutes to become ready. By the time capacity appears, the application autoscaler may have overshot.

A safer design shares enough state to explain the bottleneck:

This does not mean one controller owns everything. It means each controller should distinguish demand that it can satisfy from demand blocked by another boundary.

Worked Example: Stabilizing risk-api

Imagine risk-api normally runs four replicas. During an incident, request rate doubles and latency rises.

An unstable loop might behave like this:

00:00  latency high, scale 4 -> 12
00:30  only 6 replicas ready, latency still high, scale 12 -> 24
01:00  many replicas pending, cluster adds nodes
02:30  nodes arrive, 24 replicas start, traffic stabilizes
03:00  utilization low, scale 24 -> 5
03:30  traffic spikes again while old replicas terminate

A more stable loop records the same pressure but adds guardrails:

00:00  latency high, scale 4 -> 10 with max step of +6
00:30  6 ready, 4 pending due to capacity, hold decision during cooldown
01:00  cluster provisioning in progress, do not keep increasing desired replicas
02:30  10 ready, latency falling, hold scale-down delay
05:00  reduce gradually if low utilization persists

The second loop is slower to declare victory, but it is easier to reason about. It separates demand, placement, provisioning, readiness, and scale-down. It also prevents desired state from racing far ahead of usable capacity.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Capacity Models, Quotas, and Overcommitment NEXT Distributed Schedulers and Control Planes: Rollouts, Reconfiguration, and Safe Change