Distributed Schedulers and Control Planes: Human Overrides, Runbooks, and Operational Control

LESSON

Distributed Schedulers and Control Planes

021 35 min advanced

Distributed Schedulers and Control Planes: Human Overrides, Runbooks, and Operational Control

The core idea: Human intervention should enter a scheduler control plane through explicit, bounded, auditable desired state, because ad hoc fixes can fight reconciliation and create a second incident.

Core Insight

Imagine the simulation tests from the previous lesson found the duplicate-reservation bug, but a live incident still happens. risk-api is degraded, the scheduler is slow to react to a quota update, and the on-call engineer has a credible short-term fix: pause autoscaling, stop placing new work in one zone, and allow two high-priority recovery replicas to use reserved capacity.

The tempting move is to "just patch the objects." Edit a deployment here, delete a stuck reservation there, add a label directly to a node, maybe remove a finalizer. In a desired-state control plane, those edits are not isolated manual actions. They become inputs that controllers will observe, reconcile, retry, or overwrite. A human override is another control-plane decision, and it must be designed as carefully as an automated one.

The non-obvious lesson is that manual control is not the absence of automation. Good operational control gives humans a safe way to change authority temporarily. The trade-off is speed versus guardrails: during an incident, operators need fast action, but the system still needs scope, ownership, expiration, audit, and an undo path.

Overrides Are Desired State

An override should be represented as explicit state, not as a hidden side effect. That state should answer:

For example, an emergency placement override might look conceptually like this:

kind: SchedulerOverride
name: risk-api-eu-west-recovery
spec:
  target: workload/risk-api
  scope: region/eu-west
  reason: recovery-capacity
  expiresAfter: 45m
  changes:
    pauseAutoscaler: true
    avoidZones: [eu-west-a]
    allowReservedCapacityClass: recovery
status:
  observedGeneration: 12
  active: true
  affectedControllers: [scheduler, autoscaler, quota-controller]

The exact API shape is less important than the properties. The override is visible, scoped, time-bounded, and processed by normal controllers. It is not a one-off shell command that leaves the next engineer guessing which state is authoritative.

A Taxonomy of Human Controls

Different incidents need different kinds of intervention. Treating every intervention as "manual mode" makes the control surface too blunt.

Control What it changes Risk
Pause stops expansion, rollout, repair, or autoscaling temporarily stale bad state remains longer
Placement constraint avoids a zone, node class, topology, or capacity pool stranded capacity or overload elsewhere
Priority or quota override allows urgent work to bypass normal fairness limits tenant unfairness or starvation
Drain or cordon stops new work or moves existing work away from infrastructure disruption and capacity pressure
Repair approval allows cleanup of ambiguous partial state deleting useful progress
Failover or traffic shift moves user traffic to a safer region or pool overload, data locality, or consistency effects
Break-glass access grants temporary elevated authority accidental broad change or weak audit

The design job is to make each control narrow enough to be safe and strong enough to be useful. A "pause autoscaler for this workload for 30 minutes" control is easier to reason about than "disable autoscaling." A "avoid this node pool for new placement" control is safer than deleting every existing workload in the pool.

Runbooks as Control-Plane Clients

A runbook is not just a document. In a mature system, it is a controlled path through the API.

A useful runbook has preconditions:

It also has an execution path:

And it has postconditions:

This framing prevents runbooks from becoming folklore. The runbook should reduce judgment under pressure without pretending judgment is unnecessary. If the operator cannot verify preconditions or postconditions, the runbook is not operationally complete.

Guardrails for Manual Authority

Human overrides need guardrails because they bypass some normal automated judgment. The key guardrails are practical:

The goal is not bureaucracy. The goal is to keep the override inside the same safety model as the rest of the control plane. A fast command with no scope, TTL, or audit may solve one symptom while creating hidden coupling that takes hours to unwind.

Worked Example: A Safe Recovery Override

Suppose risk-api is still below its recovery target. The scheduler is rejecting replicas because one zone's capacity cache is stale, and the autoscaler is amplifying the problem by adding more desired replicas.

A risky manual response is:

delete pending replicas
increase desired replicas by hand
remove quota finalizers
label random nodes as eligible

This can fight four controllers at once. The scheduler may recreate pending work. The autoscaler may keep scaling. The quota controller may restore limits. Repair may interpret the finalizer edits as completed cleanup.

A safer runbook turns the response into bounded control-plane state:

1. Confirm scheduler lag and quota freshness are the blocker.
2. Create a 30-minute override scoped to risk-api in eu-west.
3. Pause autoscaler expansion for that workload.
4. Avoid the stale zone for new placement only.
5. Allow two replicas to use recovery capacity.
6. Watch SchedulerOverrideObserved=True and ReadyReplicas.
7. Remove or let the override expire after normal scheduling resumes.
8. Verify no emergency capacity or reservations remain.

This intervention still changes production behavior. It is not "safe" because humans are involved. It is safer because the control plane can see it, enforce it, expire it, and report whether the right controllers acted on it.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Testing, Simulation, and Deterministic Replay NEXT Distributed Schedulers and Control Planes: Alternative Architectures and Coordination-Avoiding Designs