Distributed Schedulers and Control Planes: Human Overrides, Runbooks, and Operational Control

LESSON

Distributed Schedulers and Control Planes

021 35 min advanced

Distributed Schedulers and Control Planes: Human Overrides, Runbooks, and Operational Control

The core idea: Human intervention should enter a scheduler control plane through explicit, bounded, auditable desired state, because ad hoc fixes can fight reconciliation and create a second incident.

Core Insight

Imagine the simulation tests from the previous lesson found the duplicate-reservation bug, but a live incident still happens. risk-api is degraded, the scheduler is slow to react to a quota update, and the on-call engineer has a credible short-term fix: pause autoscaling, stop placing new work in one zone, and allow two high-priority recovery replicas to use reserved capacity.

The tempting move is to "just patch the objects." Edit a deployment here, delete a stuck reservation there, add a label directly to a node, maybe remove a finalizer. In a desired-state control plane, those edits are not isolated manual actions. They become inputs that controllers will observe, reconcile, retry, or overwrite. A human override is another control-plane decision, and it must be designed as carefully as an automated one.

The non-obvious lesson is that manual control is not the absence of automation. Good operational control gives humans a safe way to change authority temporarily. The trade-off is speed versus guardrails: during an incident, operators need fast action, but the system still needs scope, ownership, expiration, audit, and an undo path.

Overrides Are Desired State

An override should be represented as explicit state, not as a hidden side effect. That state should answer:

who requested the override
what object, tenant, region, zone, or controller it affects
what behavior is changed
why the override exists
when it expires
which controller has observed it
which condition proves it took effect
how the system should return to normal

For example, an emergency placement override might look conceptually like this:

kind: SchedulerOverride
name: risk-api-eu-west-recovery
spec:
  target: workload/risk-api
  scope: region/eu-west
  reason: recovery-capacity
  expiresAfter: 45m
  changes:
    pauseAutoscaler: true
    avoidZones: [eu-west-a]
    allowReservedCapacityClass: recovery
status:
  observedGeneration: 12
  active: true
  affectedControllers: [scheduler, autoscaler, quota-controller]

The exact API shape is less important than the properties. The override is visible, scoped, time-bounded, and processed by normal controllers. It is not a one-off shell command that leaves the next engineer guessing which state is authoritative.

A Taxonomy of Human Controls

Different incidents need different kinds of intervention. Treating every intervention as "manual mode" makes the control surface too blunt.

Control	What it changes	Risk
Pause	stops expansion, rollout, repair, or autoscaling temporarily	stale bad state remains longer
Placement constraint	avoids a zone, node class, topology, or capacity pool	stranded capacity or overload elsewhere
Priority or quota override	allows urgent work to bypass normal fairness limits	tenant unfairness or starvation
Drain or cordon	stops new work or moves existing work away from infrastructure	disruption and capacity pressure
Repair approval	allows cleanup of ambiguous partial state	deleting useful progress
Failover or traffic shift	moves user traffic to a safer region or pool	overload, data locality, or consistency effects
Break-glass access	grants temporary elevated authority	accidental broad change or weak audit

The design job is to make each control narrow enough to be safe and strong enough to be useful. A "pause autoscaler for this workload for 30 minutes" control is easier to reason about than "disable autoscaling." A "avoid this node pool for new placement" control is safer than deleting every existing workload in the pool.

Runbooks as Control-Plane Clients

A runbook is not just a document. In a mature system, it is a controlled path through the API.

A useful runbook has preconditions:

the symptom that justifies the action
the telemetry that identifies the affected object or scope
the checks that rule out more dangerous causes
the approval or role required for the action

It also has an execution path:

the exact override or operation to apply
the expected status condition after the operation
the metrics or events to watch
the deadline for reassessment
the rollback or expiry path

And it has postconditions:

the override expired or was removed
affected controllers observed the normal generation again
no orphaned reservations, finalizers, or emergency quotas remain
the incident timeline records what changed and why

This framing prevents runbooks from becoming folklore. The runbook should reduce judgment under pressure without pretending judgment is unnecessary. If the operator cannot verify preconditions or postconditions, the runbook is not operationally complete.

Guardrails for Manual Authority

Human overrides need guardrails because they bypass some normal automated judgment. The key guardrails are practical:

scope: limit by workload, tenant, region, zone, node pool, controller, or time window
time to live: emergency state should expire unless renewed intentionally
audit: record actor, reason, ticket, command, diff, and observed effect
dry run: show what would change before committing when time allows
idempotency: rerunning the runbook should not create duplicate state
conflict checks: fail or pause when desired state changed since the runbook was prepared
status feedback: publish whether the override was observed and by whom
least privilege: break-glass roles should be narrow and temporary
undo path: define how the system returns to normal control

The goal is not bureaucracy. The goal is to keep the override inside the same safety model as the rest of the control plane. A fast command with no scope, TTL, or audit may solve one symptom while creating hidden coupling that takes hours to unwind.

Worked Example: A Safe Recovery Override

Suppose risk-api is still below its recovery target. The scheduler is rejecting replicas because one zone's capacity cache is stale, and the autoscaler is amplifying the problem by adding more desired replicas.

A risky manual response is:

delete pending replicas
increase desired replicas by hand
remove quota finalizers
label random nodes as eligible

This can fight four controllers at once. The scheduler may recreate pending work. The autoscaler may keep scaling. The quota controller may restore limits. Repair may interpret the finalizer edits as completed cleanup.

A safer runbook turns the response into bounded control-plane state:

1. Confirm scheduler lag and quota freshness are the blocker.
2. Create a 30-minute override scoped to risk-api in eu-west.
3. Pause autoscaler expansion for that workload.
4. Avoid the stale zone for new placement only.
5. Allow two replicas to use recovery capacity.
6. Watch SchedulerOverrideObserved=True and ReadyReplicas.
7. Remove or let the override expire after normal scheduling resumes.
8. Verify no emergency capacity or reservations remain.

This intervention still changes production behavior. It is not "safe" because humans are involved. It is safer because the control plane can see it, enforce it, expire it, and report whether the right controllers acted on it.

Operational Failure Modes

Override fights reconciliation: a manual edit is repeatedly undone or amplified by controllers. The fix is to expose the override as desired state that controllers understand.
No expiration: emergency capacity, labels, or bypasses remain after the incident. The fix is TTL by default and explicit renewal.
Too broad a control: disabling an entire controller solves one workload but damages others. The fix is scoped controls by workload, tenant, region, or policy.
Invisible break-glass action: nobody can later explain who changed authority or why. The fix is audit, reason fields, and incident-linked records.
Runbook lacks preconditions: operators apply the right action to the wrong failure mode. The fix is decision checks tied to telemetry and status.
No cleanup verification: the service recovers but leaked reservations, finalizers, or quotas remain. The fix is postcondition checks and repair status.

Connections

The previous lesson, 020.md, showed how simulation and replay expose failure paths before production. Human runbooks should be tested against those same failure paths.
The next lesson, 022.md, explores architectures that avoid some coordination pressure. Even coordination-avoiding designs still need explicit operational control surfaces.
incident-management-and-operational-learning is adjacent context for turning overrides, runbook outcomes, and incident evidence into better operating practice.

Resources

[BOOK] Site Reliability Engineering: Emergency Response
- Focus: Use the discussion of preparation and response to think about runbooks as operational systems, not just notes.
[DOC] Kubernetes: Safely Drain a Node
- Focus: Study cordon and drain as explicit operational controls with disruption consequences.
[DOC] Kubernetes: Taints and Tolerations
- Focus: Connect placement overrides with scheduler-visible state rather than hidden manual choices.
[DOC] Kubernetes: Pod Disruption Budgets
- Focus: Look at how disruption policy protects availability during human or automated operations.
[DOC] Kubernetes Auditing
- Focus: Treat audit records as part of the control surface for privileged operational actions.

Key Takeaways

Human overrides are control-plane inputs and should be represented as explicit, bounded, auditable desired state.
The main trade-off is fast intervention versus guardrails such as scope, TTL, status feedback, least privilege, and undo paths.
Runbooks should define preconditions, exact operations, expected status, deadlines, and postcondition cleanup checks.
A safe manual intervention cooperates with reconciliation instead of secretly fighting the controllers.

← Back to Distributed Schedulers and Control Planes

← Back to Distributed Systems

← Back to Learning Hub