Distributed Schedulers and Control Planes: Human Overrides, Runbooks, and Operational Control
LESSON
Distributed Schedulers and Control Planes: Human Overrides, Runbooks, and Operational Control
The core idea: Human intervention should enter a scheduler control plane through explicit, bounded, auditable desired state, because ad hoc fixes can fight reconciliation and create a second incident.
Core Insight
Imagine the simulation tests from the previous lesson found the duplicate-reservation bug, but a live incident still happens. risk-api is degraded, the scheduler is slow to react to a quota update, and the on-call engineer has a credible short-term fix: pause autoscaling, stop placing new work in one zone, and allow two high-priority recovery replicas to use reserved capacity.
The tempting move is to "just patch the objects." Edit a deployment here, delete a stuck reservation there, add a label directly to a node, maybe remove a finalizer. In a desired-state control plane, those edits are not isolated manual actions. They become inputs that controllers will observe, reconcile, retry, or overwrite. A human override is another control-plane decision, and it must be designed as carefully as an automated one.
The non-obvious lesson is that manual control is not the absence of automation. Good operational control gives humans a safe way to change authority temporarily. The trade-off is speed versus guardrails: during an incident, operators need fast action, but the system still needs scope, ownership, expiration, audit, and an undo path.
Overrides Are Desired State
An override should be represented as explicit state, not as a hidden side effect. That state should answer:
- who requested the override
- what object, tenant, region, zone, or controller it affects
- what behavior is changed
- why the override exists
- when it expires
- which controller has observed it
- which condition proves it took effect
- how the system should return to normal
For example, an emergency placement override might look conceptually like this:
kind: SchedulerOverride
name: risk-api-eu-west-recovery
spec:
target: workload/risk-api
scope: region/eu-west
reason: recovery-capacity
expiresAfter: 45m
changes:
pauseAutoscaler: true
avoidZones: [eu-west-a]
allowReservedCapacityClass: recovery
status:
observedGeneration: 12
active: true
affectedControllers: [scheduler, autoscaler, quota-controller]
The exact API shape is less important than the properties. The override is visible, scoped, time-bounded, and processed by normal controllers. It is not a one-off shell command that leaves the next engineer guessing which state is authoritative.
A Taxonomy of Human Controls
Different incidents need different kinds of intervention. Treating every intervention as "manual mode" makes the control surface too blunt.
| Control | What it changes | Risk |
|---|---|---|
| Pause | stops expansion, rollout, repair, or autoscaling temporarily | stale bad state remains longer |
| Placement constraint | avoids a zone, node class, topology, or capacity pool | stranded capacity or overload elsewhere |
| Priority or quota override | allows urgent work to bypass normal fairness limits | tenant unfairness or starvation |
| Drain or cordon | stops new work or moves existing work away from infrastructure | disruption and capacity pressure |
| Repair approval | allows cleanup of ambiguous partial state | deleting useful progress |
| Failover or traffic shift | moves user traffic to a safer region or pool | overload, data locality, or consistency effects |
| Break-glass access | grants temporary elevated authority | accidental broad change or weak audit |
The design job is to make each control narrow enough to be safe and strong enough to be useful. A "pause autoscaler for this workload for 30 minutes" control is easier to reason about than "disable autoscaling." A "avoid this node pool for new placement" control is safer than deleting every existing workload in the pool.
Runbooks as Control-Plane Clients
A runbook is not just a document. In a mature system, it is a controlled path through the API.
A useful runbook has preconditions:
- the symptom that justifies the action
- the telemetry that identifies the affected object or scope
- the checks that rule out more dangerous causes
- the approval or role required for the action
It also has an execution path:
- the exact override or operation to apply
- the expected status condition after the operation
- the metrics or events to watch
- the deadline for reassessment
- the rollback or expiry path
And it has postconditions:
- the override expired or was removed
- affected controllers observed the normal generation again
- no orphaned reservations, finalizers, or emergency quotas remain
- the incident timeline records what changed and why
This framing prevents runbooks from becoming folklore. The runbook should reduce judgment under pressure without pretending judgment is unnecessary. If the operator cannot verify preconditions or postconditions, the runbook is not operationally complete.
Guardrails for Manual Authority
Human overrides need guardrails because they bypass some normal automated judgment. The key guardrails are practical:
- scope: limit by workload, tenant, region, zone, node pool, controller, or time window
- time to live: emergency state should expire unless renewed intentionally
- audit: record actor, reason, ticket, command, diff, and observed effect
- dry run: show what would change before committing when time allows
- idempotency: rerunning the runbook should not create duplicate state
- conflict checks: fail or pause when desired state changed since the runbook was prepared
- status feedback: publish whether the override was observed and by whom
- least privilege: break-glass roles should be narrow and temporary
- undo path: define how the system returns to normal control
The goal is not bureaucracy. The goal is to keep the override inside the same safety model as the rest of the control plane. A fast command with no scope, TTL, or audit may solve one symptom while creating hidden coupling that takes hours to unwind.
Worked Example: A Safe Recovery Override
Suppose risk-api is still below its recovery target. The scheduler is rejecting replicas because one zone's capacity cache is stale, and the autoscaler is amplifying the problem by adding more desired replicas.
A risky manual response is:
delete pending replicas
increase desired replicas by hand
remove quota finalizers
label random nodes as eligible
This can fight four controllers at once. The scheduler may recreate pending work. The autoscaler may keep scaling. The quota controller may restore limits. Repair may interpret the finalizer edits as completed cleanup.
A safer runbook turns the response into bounded control-plane state:
1. Confirm scheduler lag and quota freshness are the blocker.
2. Create a 30-minute override scoped to risk-api in eu-west.
3. Pause autoscaler expansion for that workload.
4. Avoid the stale zone for new placement only.
5. Allow two replicas to use recovery capacity.
6. Watch SchedulerOverrideObserved=True and ReadyReplicas.
7. Remove or let the override expire after normal scheduling resumes.
8. Verify no emergency capacity or reservations remain.
This intervention still changes production behavior. It is not "safe" because humans are involved. It is safer because the control plane can see it, enforce it, expire it, and report whether the right controllers acted on it.
Operational Failure Modes
- Override fights reconciliation: a manual edit is repeatedly undone or amplified by controllers. The fix is to expose the override as desired state that controllers understand.
- No expiration: emergency capacity, labels, or bypasses remain after the incident. The fix is TTL by default and explicit renewal.
- Too broad a control: disabling an entire controller solves one workload but damages others. The fix is scoped controls by workload, tenant, region, or policy.
- Invisible break-glass action: nobody can later explain who changed authority or why. The fix is audit, reason fields, and incident-linked records.
- Runbook lacks preconditions: operators apply the right action to the wrong failure mode. The fix is decision checks tied to telemetry and status.
- No cleanup verification: the service recovers but leaked reservations, finalizers, or quotas remain. The fix is postcondition checks and repair status.
Connections
- The previous lesson,
020.md, showed how simulation and replay expose failure paths before production. Human runbooks should be tested against those same failure paths. - The next lesson,
022.md, explores architectures that avoid some coordination pressure. Even coordination-avoiding designs still need explicit operational control surfaces. incident-management-and-operational-learningis adjacent context for turning overrides, runbook outcomes, and incident evidence into better operating practice.
Resources
- [BOOK] Site Reliability Engineering: Emergency Response
- Focus: Use the discussion of preparation and response to think about runbooks as operational systems, not just notes.
- [DOC] Kubernetes: Safely Drain a Node
- Focus: Study cordon and drain as explicit operational controls with disruption consequences.
- [DOC] Kubernetes: Taints and Tolerations
- Focus: Connect placement overrides with scheduler-visible state rather than hidden manual choices.
- [DOC] Kubernetes: Pod Disruption Budgets
- Focus: Look at how disruption policy protects availability during human or automated operations.
- [DOC] Kubernetes Auditing
- Focus: Treat audit records as part of the control surface for privileged operational actions.
Key Takeaways
- Human overrides are control-plane inputs and should be represented as explicit, bounded, auditable desired state.
- The main trade-off is fast intervention versus guardrails such as scope, TTL, status feedback, least privilege, and undo paths.
- Runbooks should define preconditions, exact operations, expected status, deadlines, and postcondition cleanup checks.
- A safe manual intervention cooperates with reconciliation instead of secretly fighting the controllers.