Distributed Schedulers and Control Planes: Recovery, Rollback, and Repair Controllers

LESSON

Distributed Schedulers and Control Planes

018 35 min advanced

Distributed Schedulers and Control Planes: Recovery, Rollback, and Repair Controllers

The core idea: Recovery restores useful service, rollback changes the authoritative intent, and repair reconciles leftover side effects; a control plane that treats them as one action creates duplicate work and hides damage.

Core Insight

Imagine risk-api loses capacity in eu-central, so the control plane tries to recover in eu-west. The scheduler binds replacement replicas, a rollout controller waits for readiness, and an autoscaler sees that traffic is still under-served. Then the team discovers that scheduler-policy-v5 is placing replicas onto nodes that look cheap but start too slowly for the recovery objective.

The obvious request is "roll it back." That sounds like one operation, but the control plane has several different jobs. It must recover enough serving capacity for users. It must move the scheduler or rollout policy back to a known-good revision. It must also repair the partial state already created by the bad attempt: reservations, bindings, finalizers, stale conditions, and workloads that are neither healthy nor clearly dead.

The non-obvious lesson is that rollback is not the opposite of recovery. A rollback changes the desired control decision. It does not automatically undo every side effect that previous controllers already committed. Repair controllers are the loops that make those side effects converge back to a safe shape. Without that separation, the system may restore an old policy while leaving new damage behind, or it may clean up aggressively and remove the capacity that recovery still needs.

Three Different Responses

Recovery, rollback, and repair often run during the same incident, but they answer different questions.

Response	Question	Typical owner	Main risk
Recovery	How do we regain useful service?	rollout, scheduler, failover, autoscaler	creating unsafe duplicate work or overload
Rollback	Which previous intent should become authoritative again?	release, policy, API, configuration controller	assuming old code implies old state
Repair	What partial side effects need reconciliation?	garbage collection, reservation, node, finalizer, custom repair controller	deleting useful progress or missing leaked state

Recovery is about service continuity. It may create replacement replicas, move traffic, relax non-critical constraints, or allocate emergency capacity. The recovery controller should publish what it has restored and what is still degraded.

Rollback is about authority. It says that revision v4 should govern future decisions instead of revision v5, or that a previous rollout template should replace the current one. A rollback needs versioned desired state, observed generation, and a way to tell each controller which revision it has acted on.

Repair is about convergence after imperfect progress. It looks for objects that no longer match the authoritative intent: an old reservation with no owner, a workload stuck behind a finalizer, a bound replica on a node that should no longer receive work, or a condition that says progress is blocked even though the cause is gone.

The design problem is not choosing one of the three. The design problem is deciding which controller owns each part of the response and which state proves that part is complete.

Authority and State Boundaries

A rollback or repair operation is dangerous when authority is vague. If three controllers can all decide to delete a failed recovery replica, then the first successful cleanup can be followed by two stale cleanup attempts. If a scheduler changes policy but existing bindings have no revision marker, later controllers cannot distinguish "placed by the bad policy" from "placed earlier and still valid."

A useful control plane keeps a state path like this:

desired revision
    -> admitted intent
    -> scheduled and bound children
    -> started workloads
    -> ready and serving status
    -> cleanup and repair conditions

Each step should expose enough metadata for the next step to act safely:

the desired generation or policy revision that produced the action
the controller that owns the child object or reservation
the condition that describes current progress
the deadline after which partial progress should be reconsidered
the finalizer or cleanup marker that prevents silent leaks
the lease or leadership state that says which actor may repair

This metadata is not decoration. It is what lets rollback and repair avoid guessing. If a replacement replica was created by scheduler-policy-v5, a repair controller can decide whether to keep it, drain it, or recreate it under v4. If a reservation belongs to a workload that no longer exists, ownership and garbage collection can clean it up. If a finalizer is still present, the deleting controller can explain which external side effect has not been cleaned.

Rollback Is State-Aware

The simplest rollback story is a release story: deploy version v5, detect trouble, return to v4. That mental model is useful but incomplete for schedulers and control planes because many side effects are outside the rolled-back object.

Suppose scheduler-policy-v5 changed scoring to prefer cheaper nodes in eu-west. During the first five minutes of the incident it created these effects:

six risk-api replicas were bound under the new scoring policy
two replicas became ready and now serve traffic
two replicas are still pulling images
one reservation exists for a replica that was later deleted
one node was marked temporarily unhealthy
the autoscaler increased desired capacity because readiness lagged

Rolling the policy object back to v4 does not answer what should happen to those six replicas. Keeping the two ready replicas may be the safest recovery choice. Recreating the two slow replicas may improve latency. Deleting the leaked reservation is repair, not rollback. Reducing desired capacity may be correct only after the rollout controller sees enough ready capacity.

A state-aware rollback has to name the treatment of existing side effects:

preserve useful progress that is safe under the restored policy
drain work that is serving but violates the restored constraints
requeue work whose placement is invalid but whose intent remains valid
delete orphaned or duplicate children that no longer have a valid owner
compensate for external side effects that cannot be undone by changing spec
pause expansion while controllers converge on the restored revision

This is why many systems prefer "roll forward" when the bad state is already widespread. The question is not whether rollback is morally cleaner. The question is which path gives controllers a clearer, safer convergence target.

Repair Controllers

A repair controller is a reconciliation loop specialized for partial, stale, or contradictory state. It does not usually decide the product-level desired state. It enforces invariants around ownership, lifecycle, reservations, leases, and cleanup.

Useful repair controllers tend to be boring and explicit. They scan for conditions such as:

child objects whose owner no longer exists
reservations that passed their deadline without a matching binding
bindings on nodes that moved into a terminal unhealthy state
workloads stuck behind finalizers after the external resource is gone
leases held by controllers that have not renewed
status conditions that reference an old observed generation
rollout or recovery attempts that exceeded their deadline

They then act through normal APIs rather than through hidden side channels. That matters because repair is part of the same distributed system as the failure. It should be rate-limited, observable, conflict-aware, and idempotent. A repair controller that performs fast direct deletes can be more damaging than the leak it is trying to clean up.

Repair also needs a notion of strength of evidence. A delayed node heartbeat may be enough to stop new placement, but not enough to delete every bound workload. A reservation with an expired deadline and no owner is stronger evidence. A finalizer whose external resource has already disappeared is stronger still. The action should match the confidence of the observation.

Worked Example: Undoing a Bad Recovery Placement

Consider this incident timeline:

00:00 eu-central capacity drops
00:02 recovery controller asks for four extra risk-api replicas in eu-west
00:03 scheduler-policy-v5 binds replicas to cheap cold nodes
00:06 readiness is still low, autoscaler adds two more desired replicas
00:08 operators roll scheduler policy back to v4
00:09 repair controller finds one leaked reservation and two slow placements

A weak response treats rollback as a reset button:

install v4
delete all replicas created during v5
create replacements immediately
let autoscaler keep reacting to low readiness

That can turn a placement bug into a second outage. The two replicas that were ready disappear, new image pulls start, the autoscaler sees even lower readiness, and the control plane receives another burst of scheduling and binding work.

A stronger response separates the jobs:

1. Mark scheduler-policy-v4 as the desired revision.
2. Pause new expansion until controllers observe the rollback generation.
3. Keep v5-created replicas that are ready and still satisfy hard constraints.
4. Requeue slow or invalid placements with stable workload identity.
5. Delete the leaked reservation through owner-aware repair.
6. Publish conditions for Recovered, RollbackObserved, and RepairComplete.

This path may leave some imperfect placements in service temporarily. That is often the right trade-off. The system is explicit about which imperfection it accepts: temporary cost or locality inefficiency, not hidden duplicate capacity, orphaned reservations, or a permanent policy split.

Operational Failure Modes

Rollback fights recovery: a rollback deletes the only healthy replacement capacity. The fix is to classify existing side effects before deleting them.
Repair has no owner model: cleanup cannot tell leaked state from useful partial progress. The fix is owner references, stable operation IDs, finalizers, and revision metadata.
Manual cleanup bypasses the API: operators remove external state without updating authoritative objects. The fix is a runbook that drives cleanup through the control surface or records explicit repair conditions.
Repair storm: many repair loops scan and delete aggressively during a control-plane outage. The fix is rate limits, backoff, leader ownership, and graduated actions.
Rollback leaves stale status: conditions still report progress for the old generation. The fix is observed-generation checks and status transitions tied to the active revision.
Compensation is ignored: external side effects such as load balancer entries, reservations, or cloud resources remain after desired state changes. The fix is a repair controller with explicit finalization semantics.

Connections

The previous lesson, 017.md, showed why failure detection and retries produce partial progress. This lesson shows how that partial progress is recovered, rolled back, or repaired.
The next lesson, 019.md, depends on these distinctions because observability has to expose whether an incident is blocked on recovery, rollback observation, or repair.
distributed-testing-simulation-and-deterministic-replay is useful adjacent context for testing rollback and repair races under partitions, stale watches, and controller restarts.

Resources

[DOC] Kubernetes Controllers
- Focus: Use the controller pattern as the baseline for recovery and repair loops that repeatedly observe and reconcile state.
[DOC] Kubernetes Deployments
- Focus: Study rollout history, revisions, paused rollouts, and rollback behavior as a concrete control-plane surface.
[DOC] Kubernetes Finalizers
- Focus: Connect deletion, cleanup, and external side effects to explicit lifecycle state.
[DOC] Kubernetes Garbage Collection
- Focus: Look at owner references and cleanup as part of normal convergence, not only incident response.
[BOOK] Site Reliability Workbook: Canarying Releases
- Focus: Use rollout and rollback discipline to reason about progressive change and limiting blast radius.

Key Takeaways

Recovery restores useful service, rollback changes authoritative intent, and repair reconciles partial or stale side effects.
A rollback is safe only when controllers know how to treat already-created state: preserve, drain, requeue, delete, compensate, or pause.
Repair controllers should be ordinary, observable, idempotent reconciliation loops with clear ownership and rate limits.
The central trade-off is fast correction versus preserving useful progress while avoiding hidden leaks and duplicate work.

← Back to Distributed Schedulers and Control Planes

← Back to Distributed Systems

← Back to Learning Hub