Distributed Schedulers and Control Planes: Autoscaling Feedback Loops and Stability
LESSON
Distributed Schedulers and Control Planes: Autoscaling Feedback Loops and Stability
The core idea: Autoscaling is a feedback loop between observed load, desired capacity, scheduling, and real startup time, so the design trade-off is between reacting quickly and avoiding oscillation when signals lag or demand spikes.
Core Insight
Suppose risk-api suddenly sees traffic double in eu-central, while fraud-batch and notebooks are already competing for the same GPU and CPU pool. The autoscaler observes high request latency and high CPU usage, decides the service needs more replicas, and writes a new desired replica count. The scheduler then tries to place those replicas, but some nodes are full, quota state is changing, and new replicas take time to start and become useful.
The tempting mental model is simple: load rises, add replicas; load falls, remove replicas. Real autoscaling is slower and more coupled. Metrics are delayed, placement is constrained, startup takes time, and the act of adding work changes the scheduler queue and capacity model. If the autoscaler reacts to every noisy sample, it can create thrash: scale up, fail to place, retry, overcorrect, then scale down just as new capacity becomes ready.
A stable autoscaling design treats scaling as a control loop, not a reflex. It chooses signals carefully, smooths noisy inputs, respects cooldowns, caps rate of change, and distinguishes "more replicas would help" from "more replicas can actually be placed and become healthy." The scheduler and autoscaler are separate controllers, but they are part of one system.
The Autoscaling Loop
A basic loop has five stages:
measure -> decide -> update desired state -> schedule/actuate -> observe effect
Each stage can lag:
- Measure: metrics arrive after scraping, aggregation, or queueing delay.
- Decide: the autoscaler computes a target from current and historical signal.
- Update desired state: the control plane records a new replica count or capacity request.
- Schedule/actuate: the scheduler binds work and the runtime starts it.
- Observe effect: the new capacity becomes healthy, receives traffic, and changes metrics.
For risk-api, the autoscaler may decide replicas=12 from a metric sample taken thirty seconds ago. The scheduler may place only eight replicas because the protected GPU lane is full. Four pending replicas may still count as desired capacity but not as serving capacity. If the autoscaler ignores that gap, it can keep raising desired replicas even though the bottleneck is placement, not replica count.
The useful question is not only "what target should the autoscaler choose?" It is "which part of the loop is currently limiting progress?"
Choosing Signals
Autoscaling signals should describe demand or saturation that additional capacity can actually relieve. Common signals include:
- CPU or memory utilization
- request rate
- request latency
- queue depth or oldest item age
- work-in-progress per replica
- GPU utilization or accelerator queue length
- error rate caused by saturation
- pending work that is blocked by capacity
Not every high metric should trigger scaling. High latency caused by a downstream database outage may not improve with more risk-api replicas. High CPU during startup can make a new replica look overloaded before it is useful. A deep queue caused by quota exhaustion may need capacity, but a deep queue caused by invalid work needs rejection or policy repair.
Signals should also match the resource being scaled. If the autoscaler is adding application replicas, request rate and per-replica work may be useful. If it is adding nodes, unschedulable pods and resource requests matter more. If it is scaling GPU workers, accelerator saturation and queue age may matter more than average CPU.
Stability Controls
Stable autoscaling usually needs damping. Without it, the controller reacts faster than the system can show the result of its previous action.
Common stability controls include:
- Smoothing: average noisy signals over a useful window.
- Cooldowns: wait after scaling before taking another action.
- Rate limits: cap how much desired capacity can change at once.
- Hysteresis: require a stronger signal to reverse direction than to continue.
- Readiness gates: count only replicas that are actually serving.
- Scale-down delays: avoid removing capacity immediately after a short dip.
- Min and max bounds: keep the loop inside operationally safe limits.
- Blocked reasons: stop treating unschedulable work as proof that demand is infinite.
The trade-off is direct. More damping reduces oscillation and protects the scheduler from bursts of desired-state changes. Too much damping makes the system slow to react during real demand spikes. A high-priority API may accept faster scale-up and slower scale-down. A batch worker pool may accept slower reactions to avoid wasting capacity.
Scheduler Coupling
Autoscalers and schedulers interact through desired state and pending work. That coupling creates feedback paths:
autoscaler raises desired replicas
scheduler tries to place replicas
pending replicas increase queue pressure
cluster autoscaler may add nodes
new nodes become ready later
application metrics improve later still
autoscaler reads new metrics and adjusts again
If each controller reads only its local signal, the combined system can behave badly. The application autoscaler may keep increasing replicas because latency remains high. The scheduler may keep reporting unschedulable work. The cluster autoscaler may add nodes, but nodes take minutes to become ready. By the time capacity appears, the application autoscaler may have overshot.
A safer design shares enough state to explain the bottleneck:
- desired replicas
- ready replicas
- pending replicas
- pending reasons
- quota blocks
- node provisioning state
- recent scale actions
- startup and readiness delay
This does not mean one controller owns everything. It means each controller should distinguish demand that it can satisfy from demand blocked by another boundary.
Worked Example: Stabilizing risk-api
Imagine risk-api normally runs four replicas. During an incident, request rate doubles and latency rises.
An unstable loop might behave like this:
00:00 latency high, scale 4 -> 12
00:30 only 6 replicas ready, latency still high, scale 12 -> 24
01:00 many replicas pending, cluster adds nodes
02:30 nodes arrive, 24 replicas start, traffic stabilizes
03:00 utilization low, scale 24 -> 5
03:30 traffic spikes again while old replicas terminate
A more stable loop records the same pressure but adds guardrails:
00:00 latency high, scale 4 -> 10 with max step of +6
00:30 6 ready, 4 pending due to capacity, hold decision during cooldown
01:00 cluster provisioning in progress, do not keep increasing desired replicas
02:30 10 ready, latency falling, hold scale-down delay
05:00 reduce gradually if low utilization persists
The second loop is slower to declare victory, but it is easier to reason about. It separates demand, placement, provisioning, readiness, and scale-down. It also prevents desired state from racing far ahead of usable capacity.
Operational Failure Modes
- Scaling on the wrong signal: adding replicas for latency caused by a downstream dependency. The fix is to pair saturation signals with bottleneck diagnosis.
- Counting pending as serving: desired replicas rise, but ready capacity does not. The fix is to track desired, pending, bound, starting, and ready separately.
- No cooldown: the autoscaler reacts before earlier changes take effect. The fix is cooldown, rate limits, and readiness-aware decisions.
- Scale-down too fast: capacity disappears after a temporary dip, causing another spike. The fix is stabilization windows and gradual reduction.
- Unschedulable means infinite demand: placement failures trigger more desired replicas. The fix is pending reasons, quota awareness, and scheduler feedback.
- Competing autoscalers: application, queue, and cluster autoscalers each amplify the others. The fix is clear ownership, bounded actions, and shared observability.
Connections
- The previous lesson,
009.md, explained capacity, quotas, and overcommitment. Autoscaling depends on those signals to avoid scaling against imaginary capacity. - The next lesson,
011.md, applies the same stability thinking to rollouts, reconfiguration, and safe change. control-theory-and-feedback-systemsgives deeper vocabulary for feedback, delay, damping, and oscillation.
Resources
- [DOC] Kubernetes Horizontal Pod Autoscaling
- Focus: Study how measured utilization, desired replicas, readiness, and stabilization interact.
- [DOC] Kubernetes Cluster Autoscaler FAQ
- Focus: Connect unschedulable work, node provisioning delay, and scheduler-visible capacity.
- [DOC] KEDA Concepts
- Focus: Look at event-driven scaling signals, triggers, cooldowns, and activation thresholds.
- [BOOK] Site Reliability Engineering: Handling Overload
- Focus: Use overload handling to reason about when scaling helps and when backpressure is the safer control.
Key Takeaways
- Autoscaling is a feedback loop with delayed measurement, scheduling, startup, and observed effect.
- Stable autoscaling needs smoothing, cooldowns, rate limits, bounds, and readiness-aware decisions.
- Scheduler feedback matters because pending replicas may be blocked by quota, topology, provisioning, or capacity rather than application demand.
- The central trade-off is fast reaction to real demand versus damping enough to avoid oscillation and scheduler churn.