Distributed Schedulers and Control Planes: Recovery, Rollback, and Repair Controllers

LESSON

Distributed Schedulers and Control Planes

018 35 min advanced

Distributed Schedulers and Control Planes: Recovery, Rollback, and Repair Controllers

The core idea: Recovery restores useful service, rollback changes the authoritative intent, and repair reconciles leftover side effects; a control plane that treats them as one action creates duplicate work and hides damage.

Core Insight

Imagine risk-api loses capacity in eu-central, so the control plane tries to recover in eu-west. The scheduler binds replacement replicas, a rollout controller waits for readiness, and an autoscaler sees that traffic is still under-served. Then the team discovers that scheduler-policy-v5 is placing replicas onto nodes that look cheap but start too slowly for the recovery objective.

The obvious request is "roll it back." That sounds like one operation, but the control plane has several different jobs. It must recover enough serving capacity for users. It must move the scheduler or rollout policy back to a known-good revision. It must also repair the partial state already created by the bad attempt: reservations, bindings, finalizers, stale conditions, and workloads that are neither healthy nor clearly dead.

The non-obvious lesson is that rollback is not the opposite of recovery. A rollback changes the desired control decision. It does not automatically undo every side effect that previous controllers already committed. Repair controllers are the loops that make those side effects converge back to a safe shape. Without that separation, the system may restore an old policy while leaving new damage behind, or it may clean up aggressively and remove the capacity that recovery still needs.

Three Different Responses

Recovery, rollback, and repair often run during the same incident, but they answer different questions.

Response Question Typical owner Main risk
Recovery How do we regain useful service? rollout, scheduler, failover, autoscaler creating unsafe duplicate work or overload
Rollback Which previous intent should become authoritative again? release, policy, API, configuration controller assuming old code implies old state
Repair What partial side effects need reconciliation? garbage collection, reservation, node, finalizer, custom repair controller deleting useful progress or missing leaked state

Recovery is about service continuity. It may create replacement replicas, move traffic, relax non-critical constraints, or allocate emergency capacity. The recovery controller should publish what it has restored and what is still degraded.

Rollback is about authority. It says that revision v4 should govern future decisions instead of revision v5, or that a previous rollout template should replace the current one. A rollback needs versioned desired state, observed generation, and a way to tell each controller which revision it has acted on.

Repair is about convergence after imperfect progress. It looks for objects that no longer match the authoritative intent: an old reservation with no owner, a workload stuck behind a finalizer, a bound replica on a node that should no longer receive work, or a condition that says progress is blocked even though the cause is gone.

The design problem is not choosing one of the three. The design problem is deciding which controller owns each part of the response and which state proves that part is complete.

Authority and State Boundaries

A rollback or repair operation is dangerous when authority is vague. If three controllers can all decide to delete a failed recovery replica, then the first successful cleanup can be followed by two stale cleanup attempts. If a scheduler changes policy but existing bindings have no revision marker, later controllers cannot distinguish "placed by the bad policy" from "placed earlier and still valid."

A useful control plane keeps a state path like this:

desired revision
    -> admitted intent
    -> scheduled and bound children
    -> started workloads
    -> ready and serving status
    -> cleanup and repair conditions

Each step should expose enough metadata for the next step to act safely:

This metadata is not decoration. It is what lets rollback and repair avoid guessing. If a replacement replica was created by scheduler-policy-v5, a repair controller can decide whether to keep it, drain it, or recreate it under v4. If a reservation belongs to a workload that no longer exists, ownership and garbage collection can clean it up. If a finalizer is still present, the deleting controller can explain which external side effect has not been cleaned.

Rollback Is State-Aware

The simplest rollback story is a release story: deploy version v5, detect trouble, return to v4. That mental model is useful but incomplete for schedulers and control planes because many side effects are outside the rolled-back object.

Suppose scheduler-policy-v5 changed scoring to prefer cheaper nodes in eu-west. During the first five minutes of the incident it created these effects:

Rolling the policy object back to v4 does not answer what should happen to those six replicas. Keeping the two ready replicas may be the safest recovery choice. Recreating the two slow replicas may improve latency. Deleting the leaked reservation is repair, not rollback. Reducing desired capacity may be correct only after the rollout controller sees enough ready capacity.

A state-aware rollback has to name the treatment of existing side effects:

This is why many systems prefer "roll forward" when the bad state is already widespread. The question is not whether rollback is morally cleaner. The question is which path gives controllers a clearer, safer convergence target.

Repair Controllers

A repair controller is a reconciliation loop specialized for partial, stale, or contradictory state. It does not usually decide the product-level desired state. It enforces invariants around ownership, lifecycle, reservations, leases, and cleanup.

Useful repair controllers tend to be boring and explicit. They scan for conditions such as:

They then act through normal APIs rather than through hidden side channels. That matters because repair is part of the same distributed system as the failure. It should be rate-limited, observable, conflict-aware, and idempotent. A repair controller that performs fast direct deletes can be more damaging than the leak it is trying to clean up.

Repair also needs a notion of strength of evidence. A delayed node heartbeat may be enough to stop new placement, but not enough to delete every bound workload. A reservation with an expired deadline and no owner is stronger evidence. A finalizer whose external resource has already disappeared is stronger still. The action should match the confidence of the observation.

Worked Example: Undoing a Bad Recovery Placement

Consider this incident timeline:

00:00 eu-central capacity drops
00:02 recovery controller asks for four extra risk-api replicas in eu-west
00:03 scheduler-policy-v5 binds replicas to cheap cold nodes
00:06 readiness is still low, autoscaler adds two more desired replicas
00:08 operators roll scheduler policy back to v4
00:09 repair controller finds one leaked reservation and two slow placements

A weak response treats rollback as a reset button:

install v4
delete all replicas created during v5
create replacements immediately
let autoscaler keep reacting to low readiness

That can turn a placement bug into a second outage. The two replicas that were ready disappear, new image pulls start, the autoscaler sees even lower readiness, and the control plane receives another burst of scheduling and binding work.

A stronger response separates the jobs:

1. Mark scheduler-policy-v4 as the desired revision.
2. Pause new expansion until controllers observe the rollback generation.
3. Keep v5-created replicas that are ready and still satisfy hard constraints.
4. Requeue slow or invalid placements with stable workload identity.
5. Delete the leaked reservation through owner-aware repair.
6. Publish conditions for Recovered, RollbackObserved, and RepairComplete.

This path may leave some imperfect placements in service temporarily. That is often the right trade-off. The system is explicit about which imperfection it accepts: temporary cost or locality inefficiency, not hidden duplicate capacity, orphaned reservations, or a permanent policy split.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Failure Detection, Retries, and Partial Progress NEXT Distributed Schedulers and Control Planes: Observability, Debuggability, and Hidden Coupling