Distributed Schedulers and Control Planes: Failure Detection, Retries, and Partial Progress

LESSON

Distributed Schedulers and Control Planes

017 35 min advanced

Distributed Schedulers and Control Planes: Failure Detection, Retries, and Partial Progress

The core idea: Control planes cannot detect every failure perfectly, so they make progress by recording what was observed, retrying bounded and idempotent actions, and exposing partial progress instead of pretending operations are all-or-nothing.

Core Insight

Suppose risk-api needs four recovery replicas in eu-west after eu-central degrades. The scheduler selects nodes, the API server records bindings, kubelets start containers, images pull slowly, one node stops reporting heartbeats, and the rollout controller waits for readiness. Some parts of the operation succeeded. Some are delayed. Some may have failed. The control plane has to decide whether to retry, wait, roll back, or mark the operation stuck.

The naive view is that failure detection tells the system what happened: a node is alive or dead, a bind succeeded or failed, a pod is ready or not ready. Real control planes mostly observe hints. Heartbeats can be late. Watches can lag. A request can time out after the server committed it. A retry can run after the first attempt actually succeeded. Treating those hints as perfect facts creates duplicate work, lost capacity, and retry storms.

A robust scheduler and control plane assume uncertainty. They make each step idempotent where possible, attach stable identifiers to actions, record conditions that explain progress, and retry with backoff and deadlines. They also separate "not complete yet" from "failed beyond repair." That distinction is partial progress: the system may have done enough to continue safely, or enough to require cleanup, even if the whole operation is not finished.

Failure Detection Is Evidence

Failure detection in distributed control planes is usually evidence, not proof. Common signals include:

Each signal has a failure mode. A node heartbeat can be delayed by network pressure. A readiness probe can fail because a dependency is down. A request timeout can hide a successful write. A watch disconnect can mean either the server is unhealthy or the client fell behind.

The control plane should ask what each signal is allowed to decide. A missed heartbeat may be enough to stop placing new work on a node. It may not be enough to delete every workload immediately. A bind conflict may be enough to retry scheduling. It may not mean the workload itself is invalid. A rollout deadline may be enough to pause expansion. It may not mean all already-created replicas should be removed.

This is the same discipline as cache staleness and admission policy from earlier lessons: use the signal at the boundary where its uncertainty is acceptable.

Retries Need Shape

Retries are necessary because control-plane actions often fail temporarily. They are also dangerous because retries multiply load exactly when the system is already stressed.

A useful retry has shape:

For example, a scheduler binding request should not blindly create another binding each time an HTTP request times out. It should use a stable workload identity and an authoritative write path that either confirms the binding, rejects a conflict, or lets the scheduler observe the committed state before trying again.

The point is not to avoid retries. The point is to make each retry safer than the failure it is responding to.

Partial Progress as State

A large control-plane operation rarely jumps from "not started" to "done." It moves through stages:

admitted -> queued -> scheduled -> bound -> starting -> ready -> serving

Each stage can complete independently. If risk-api has four desired recovery replicas and only two are ready, that is not the same as zero. If all four are bound but image pulls are slow, the scheduler has done its part and another boundary is now limiting progress. If two replicas are bound to a node that later becomes unhealthy, the operation needs repair, not just another generic retry.

Partial progress should be recorded as state that other controllers can read:

This record prevents controllers from guessing. The autoscaler should not keep raising desired replicas when the scheduler is blocked by topology. The rollout controller should not advance when ready replicas lag. Operators should not have to reconstruct progress from logs.

Idempotency and Ownership

Idempotency means repeating an operation has the same intended effect as doing it once. In control planes, idempotency often comes from ownership and version checks rather than from the operation being naturally harmless.

Useful patterns include:

Imagine a controller creates a reservation for a risk-api recovery pod and then crashes before recording success. On restart, it should be able to find the reservation by owner and intent. It should not create a second reservation because the first status update was lost. If the original workload was deleted, the reservation should be cleaned up through ownership or reconciliation.

Ownership makes retry safe because the controller can ask: "Do I already own the thing I was trying to create?" Version checks make retry safe because the controller can ask: "Is the decision I made still based on current enough state?"

Worked Example: Binding During Node Failure

Imagine the scheduler is binding one risk-api recovery replica:

00:00 scheduler selects node-b7
00:01 binding request sent
00:02 API request times out
00:03 node-b7 heartbeat becomes delayed
00:04 scheduler sees workload still not ready

A weak controller retries from the beginning without checking what happened:

select another node
send another binding
possibly leave two reservations
increase desired replicas because readiness is still low
create more pressure on the control plane

A stronger controller treats each observation as partial evidence:

1. Read the workload by stable identity.
2. Check whether a binding or reservation already exists.
3. If the first binding committed, wait for kubelet or node-health evidence.
4. If the binding did not commit, retry with backoff and a fresh candidate.
5. If node-b7 is uncertain, stop new placement there but avoid immediate destructive cleanup.
6. Publish condition: Bound=True or SchedulingRetrying=True with a reason.

The stronger path may be slower for one replica, but it prevents the system from creating duplicate side effects. It also tells the next controller where progress stopped: scheduling, binding, node startup, readiness, or traffic serving.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Cost, Latency, and Utilization Trade-Offs NEXT Distributed Schedulers and Control Planes: Recovery, Rollback, and Repair Controllers