Distributed Schedulers and Control Planes: Failure Detection, Retries, and Partial Progress
LESSON
Distributed Schedulers and Control Planes: Failure Detection, Retries, and Partial Progress
The core idea: Control planes cannot detect every failure perfectly, so they make progress by recording what was observed, retrying bounded and idempotent actions, and exposing partial progress instead of pretending operations are all-or-nothing.
Core Insight
Suppose risk-api needs four recovery replicas in eu-west after eu-central degrades. The scheduler selects nodes, the API server records bindings, kubelets start containers, images pull slowly, one node stops reporting heartbeats, and the rollout controller waits for readiness. Some parts of the operation succeeded. Some are delayed. Some may have failed. The control plane has to decide whether to retry, wait, roll back, or mark the operation stuck.
The naive view is that failure detection tells the system what happened: a node is alive or dead, a bind succeeded or failed, a pod is ready or not ready. Real control planes mostly observe hints. Heartbeats can be late. Watches can lag. A request can time out after the server committed it. A retry can run after the first attempt actually succeeded. Treating those hints as perfect facts creates duplicate work, lost capacity, and retry storms.
A robust scheduler and control plane assume uncertainty. They make each step idempotent where possible, attach stable identifiers to actions, record conditions that explain progress, and retry with backoff and deadlines. They also separate "not complete yet" from "failed beyond repair." That distinction is partial progress: the system may have done enough to continue safely, or enough to require cleanup, even if the whole operation is not finished.
Failure Detection Is Evidence
Failure detection in distributed control planes is usually evidence, not proof. Common signals include:
- missed node heartbeats
- pod readiness and liveness transitions
- API request timeouts
- watch disconnects
- lease expiration
- queue age
- repeated bind conflicts
- controller error counts
- lack of observed generation updates
- user-visible SLO burn
Each signal has a failure mode. A node heartbeat can be delayed by network pressure. A readiness probe can fail because a dependency is down. A request timeout can hide a successful write. A watch disconnect can mean either the server is unhealthy or the client fell behind.
The control plane should ask what each signal is allowed to decide. A missed heartbeat may be enough to stop placing new work on a node. It may not be enough to delete every workload immediately. A bind conflict may be enough to retry scheduling. It may not mean the workload itself is invalid. A rollout deadline may be enough to pause expansion. It may not mean all already-created replicas should be removed.
This is the same discipline as cache staleness and admission policy from earlier lessons: use the signal at the boundary where its uncertainty is acceptable.
Retries Need Shape
Retries are necessary because control-plane actions often fail temporarily. They are also dangerous because retries multiply load exactly when the system is already stressed.
A useful retry has shape:
- idempotency: repeating the action should not create a different side effect.
- stable action identity: duplicate attempts can be recognized as the same intent.
- bounded attempts: the controller eventually changes state instead of retrying forever.
- backoff and jitter: retries spread out instead of lining up into bursts.
- deadline: the operation knows when the old intent is no longer useful.
- fresh read before commitment: the retry checks whether desired or observed state has changed.
- reasoned status: users can see whether the retry is waiting on quota, node health, image pull, or conflict.
For example, a scheduler binding request should not blindly create another binding each time an HTTP request times out. It should use a stable workload identity and an authoritative write path that either confirms the binding, rejects a conflict, or lets the scheduler observe the committed state before trying again.
The point is not to avoid retries. The point is to make each retry safer than the failure it is responding to.
Partial Progress as State
A large control-plane operation rarely jumps from "not started" to "done." It moves through stages:
admitted -> queued -> scheduled -> bound -> starting -> ready -> serving
Each stage can complete independently. If risk-api has four desired recovery replicas and only two are ready, that is not the same as zero. If all four are bound but image pulls are slow, the scheduler has done its part and another boundary is now limiting progress. If two replicas are bound to a node that later becomes unhealthy, the operation needs repair, not just another generic retry.
Partial progress should be recorded as state that other controllers can read:
- desired replicas
- admitted replicas
- pending replicas and reasons
- bound replicas
- starting replicas
- ready replicas
- serving replicas
- last attempted action
- last observed generation
- last error reason
- deadline or timeout state
This record prevents controllers from guessing. The autoscaler should not keep raising desired replicas when the scheduler is blocked by topology. The rollout controller should not advance when ready replicas lag. Operators should not have to reconstruct progress from logs.
Idempotency and Ownership
Idempotency means repeating an operation has the same intended effect as doing it once. In control planes, idempotency often comes from ownership and version checks rather than from the operation being naturally harmless.
Useful patterns include:
- owner references: cleanup and reconciliation know which object owns a child object.
- finalizers: deletion waits until required cleanup has happened.
- generation and observed generation: controllers report which desired-state version they have acted on.
- compare-and-swap writes: updates fail if the object changed since it was read.
- operation IDs: duplicate requests can be matched to one logical action.
- conditions: controllers publish named progress and failure states.
- leases: only the current owner performs a time-bounded action.
Imagine a controller creates a reservation for a risk-api recovery pod and then crashes before recording success. On restart, it should be able to find the reservation by owner and intent. It should not create a second reservation because the first status update was lost. If the original workload was deleted, the reservation should be cleaned up through ownership or reconciliation.
Ownership makes retry safe because the controller can ask: "Do I already own the thing I was trying to create?" Version checks make retry safe because the controller can ask: "Is the decision I made still based on current enough state?"
Worked Example: Binding During Node Failure
Imagine the scheduler is binding one risk-api recovery replica:
00:00 scheduler selects node-b7
00:01 binding request sent
00:02 API request times out
00:03 node-b7 heartbeat becomes delayed
00:04 scheduler sees workload still not ready
A weak controller retries from the beginning without checking what happened:
select another node
send another binding
possibly leave two reservations
increase desired replicas because readiness is still low
create more pressure on the control plane
A stronger controller treats each observation as partial evidence:
1. Read the workload by stable identity.
2. Check whether a binding or reservation already exists.
3. If the first binding committed, wait for kubelet or node-health evidence.
4. If the binding did not commit, retry with backoff and a fresh candidate.
5. If node-b7 is uncertain, stop new placement there but avoid immediate destructive cleanup.
6. Publish condition: Bound=True or SchedulingRetrying=True with a reason.
The stronger path may be slower for one replica, but it prevents the system from creating duplicate side effects. It also tells the next controller where progress stopped: scheduling, binding, node startup, readiness, or traffic serving.
Operational Failure Modes
- Timeout means failure: a client retries after a timeout even though the write committed. The fix is stable action identity, read-after-timeout, and conflict-aware writes.
- Retry storm: many controllers retry immediately during an API or network incident. The fix is exponential backoff, jitter, work-queue rate limiting, and circuit breakers.
- Progress hidden in logs: operators cannot tell whether work is admitted, bound, starting, or ready. The fix is conditions, observed generations, and stage-specific metrics.
- Duplicate ownership: two controllers believe they own the same action. The fix is leases, leader election, owner references, and authoritative state transitions.
- Permanent partial state: reservations, finalizers, or bindings remain after the workload disappears. The fix is cleanup controllers, timeouts, and ownership-based reconciliation.
- Failure detector too aggressive: delayed heartbeats trigger destructive action during transient network pressure. The fix is graduated responses: stop new placement first, repair or evict only after stronger evidence.
Connections
- The previous lesson,
016.md, showed how cost and utilization trade-offs can create pressure. Retry storms and hidden partial progress often appear when those trade-offs are too tight. - The next lesson,
018.md, builds on these mechanics with repair, rollback, and recovery controllers. distributed-testing-simulation-and-deterministic-replayconnects these patterns to testing retries, crashes, partitions, and controller races.
Resources
- [DOC] Kubernetes Pod Lifecycle
- Focus: Study phases, conditions, readiness, and how pod state represents partial progress.
- [DOC] Kubernetes Jobs
- Focus: Look at retries, backoff limits, completions, and failure accounting for batch work.
- [DOC] Kubernetes Controllers
- Focus: Connect reconciliation with repeated observation and safe progress toward desired state.
- [DOC] client-go workqueue
- Focus: Inspect rate-limited queues, retries, and backoff as implementation tools for controllers.
- [BOOK] Site Reliability Engineering: Addressing Cascading Failures
- Focus: Use cascading failure patterns to reason about retry amplification and overload.
Key Takeaways
- Failure detectors provide evidence, not perfect truth.
- Retries need idempotency, stable identity, backoff, deadlines, and fresh checks before commitment.
- Partial progress should be represented as state so controllers and operators can see where work stopped.
- The central trade-off is fast repair versus avoiding duplicate side effects and retry amplification.