Distributed Schedulers and Control Planes: Testing, Simulation, and Deterministic Replay

LESSON

Distributed Schedulers and Control Planes

020 35 min advanced

Distributed Schedulers and Control Planes: Testing, Simulation, and Deterministic Replay

The core idea: Control-plane tests need to explore timing, failure, and controller interleavings, and deterministic replay turns rare scheduler bugs from anecdotes into reproducible engineering work.

Core Insight

Imagine the team fixes the risk-api observability gap from the previous lesson. A later incident now has a clear timeline: the scheduler read a stale quota cache, the autoscaler added capacity, repair finished correctly, and the pending replica eventually bound. The team can explain what happened. The harder question is whether they can prove the same pattern will not break the next rollback, tenant, or region.

A normal unit test can check that a scoring function ranks nodes correctly. An integration test can check that a controller updates status after a binding. Those tests are useful, but many control-plane bugs live between correct pieces: a watch event arrives late, a leader lease expires mid-update, a retry observes a new generation, an admission plugin mutates labels, and a repair controller cleans up state at the same time.

The practical answer is not "test everything in production." It is to build a testing ladder that includes small deterministic tests, API-level integration tests, simulated worlds, fault injection, and replay of real decision timelines. The main trade-off is realism versus repeatability: a real cluster has messy timing but is hard to reproduce; a simulation is controllable but only finds bugs that the model is rich enough to express.

Why Unit Tests Miss Control-Plane Bugs

Control planes are asynchronous. A controller reads observed state, compares it with desired state, and writes an action or status. Another controller may be doing the same thing against overlapping state. The bug often depends on the order in which those reads and writes happen.

Consider a scheduler and quota controller:

1. scheduler reads quota: tenant risk has no zone-c capacity
2. quota controller frees capacity in zone-c
3. scheduler marks replica-042 unschedulable
4. autoscaler sees low readiness and adds desired replicas
5. scheduler cache observes the quota update

Each component may pass its own tests. The scheduler made a valid decision based on its cache. The quota controller freed capacity. The autoscaler responded to readiness. The combined behavior is the risk: extra desired capacity may be created because the system observed a temporary stale boundary as durable shortage.

Useful tests need to cover both local behavior and cross-controller properties:

a scheduler never binds one workload to two nodes
a retry does not create duplicate reservations
rollback does not delete the only healthy recovery capacity
repair eventually removes orphaned state
an autoscaler does not amplify transient scheduler lag without a bound
status conditions refer to the generation they actually observed

Those are invariants. They describe what must remain true across many possible timings, not just what one function returns for one input.

Simulation as a Small World

A simulation is a small, controlled version of the control plane. It does not need to run every real binary or every cloud integration to be valuable. It needs to model the state and failures that matter to the decision under test.

For a scheduler control plane, the simulated world might include:

workloads with desired generations and priorities
nodes, zones, capacities, labels, taints, and health transitions
quotas, reservations, and tenant limits
watch streams with delay, disconnects, and stale caches
controller queues, retries, backoff, leases, and deadlines
admission mutations and policy revisions
API writes that can commit, conflict, or time out
repair and garbage-collection loops

The simulator can then generate many event orders:

seed: 82219
00:01 admit risk-api generation 42
00:02 delay quota watch by 8 seconds
00:03 schedule replica-042
00:04 timeout binding write after commit
00:05 restart scheduler leader
00:07 rollback policy to v4
00:08 repair scans reservations

The seed matters. If the simulator finds a duplicate reservation at seed 82219, the team should be able to rerun exactly that seed and inspect the same sequence. Random exploration without replay becomes a bug generator that cannot help engineers fix the bug.

Deterministic Replay

Deterministic replay means recording enough inputs, timing decisions, and nondeterministic choices to run the same scenario again. The goal is not to replay wall-clock time perfectly. The goal is to make the same logical interleaving happen.

Useful replay inputs include:

initial object state
controller versions and feature flags
policy and admission configuration
watch events and resource versions
queue order and retry delays
injected faults, timeouts, conflicts, and restarts
random seeds used for scheduling choices
external decisions that affect state, such as capacity or quota updates

Replay forces a design discipline: controllers should take their nondeterminism through explicit seams. Time should come from a clock abstraction in tests. Random selection should use a seedable source. API responses should be modelable. Work queues should expose enough order to reproduce a failure.

This does not mean production code should become artificial. It means the parts that make decisions should be separable from the parts that talk to the real clock, network, and API server. That separation also improves debuggability because the decision inputs become visible.

Testing Ladder

Different tests buy different confidence. A useful test strategy stacks them instead of expecting one layer to do all the work.

Layer	What it catches	Example
Unit tests	pure policy, scoring, filtering, and status transitions	topology filter rejects nodes without required labels
API-level integration tests	controller behavior against real API semantics	reconcile updates status only for the observed generation
Property tests	invariants across many generated inputs	no workload has two active bindings
Simulation	timing, partitions, stale watches, retries, and controller restarts	rollback races with repair and quota update
Deterministic replay	reproduction of a real or simulated incident	seed `82219` recreates duplicate reservation
Staging or chaos tests	infrastructure interactions outside the model	API server latency and watch disconnects under load

The trade-off is cost placement. Unit tests are cheap and precise but miss interactions. Staging tests are realistic but expensive and flaky. Simulation and replay sit in the middle: they are more work to build than unit tests, but they can explore interleavings that are almost impossible to trigger reliably by hand.

Worked Example: Replaying a Duplicate Reservation

Suppose an incident created two reservations for one risk-api recovery replica. The observability layer captured this decision timeline:

workload: risk-api replica-042 generation 42
00:00 scheduler creates reservation res-a
00:01 API write commits but client sees timeout
00:02 scheduler leader restarts
00:03 new leader reads stale reservation cache
00:04 new leader retries and creates reservation res-b
00:06 repair sees two reservations with same owner intent

The first fix might be "read after timeout." That is a reasonable patch, but a good test encodes the invariant:

For any workload identity and generation,
there must be at most one active reservation for the same scheduling intent.

The replay should force the same failure path:

1. Start with replica-042 pending.
2. Commit reservation create, but return timeout to the scheduler.
3. Restart the leader before status is updated.
4. Delay the reservation watch for the new leader.
5. Retry scheduling.
6. Assert that the controller reuses, confirms, or conflicts with res-a instead of creating res-b.

This test is better than checking one specific error message. It names the system property the control plane must preserve across timeout, restart, cache lag, and retry. Once the property is encoded, future scheduler changes can be tested against the same failure shape.

Operational Failure Modes

Only happy-path integration tests: controllers pass against clean API behavior but fail when writes commit after client timeouts. The fix is fault injection for timeouts, conflicts, partial commits, and retries.
No replay artifact: an incident is understood once but cannot be rerun. The fix is to persist seeds, event timelines, object snapshots, controller versions, and injected faults.
Simulation model is too polite: watches are always fresh and controllers never restart. The fix is to model stale caches, queue reorderings, leader loss, and delayed status.
Assertions check implementation details: tests break when code is refactored but miss safety regressions. The fix is invariant-based assertions around ownership, binding, status, and cleanup.
Chaos without diagnosis: staging failures create noise but no reproducible case. The fix is to connect chaos experiments to decision telemetry and replay inputs.
Replay hides external dependencies: capacity, quota, or admission inputs are not captured. The fix is to record the external state that changed the controller decision.

Connections

The previous lesson, 019.md, focused on reconstructing decision timelines. Those timelines become replay artifacts when the team captures enough inputs and nondeterminism.
The next lesson, 021.md, covers human overrides and runbooks. A runbook is safer when its failure modes have already been simulated and replayed.
distributed-testing-simulation-and-deterministic-replay goes deeper on model design, fault injection, and replay systems.

Resources

[DOC] Kubebuilder: Configuring EnvTest for Integration Tests
- Focus: See how Kubernetes controllers can be tested against API-server behavior without a full production cluster.
[DOC] Kubernetes API Concepts
- Focus: Study watches, resource versions, and consistency because replay needs to model the observations controllers actually saw.
[ARTICLE] FoundationDB: Testing
- Focus: Use deterministic simulation as a concrete example of finding distributed bugs through controlled fault exploration.
[PAPER] Lineage-Driven Fault Injection
- Focus: Connect fault injection to system outcomes instead of injecting random failures without a hypothesis.
[DOC] client-go workqueue
- Focus: Inspect rate-limited queues and retries as test targets for controller timing and backoff behavior.

Key Takeaways

Scheduler and control-plane bugs often live in interleavings between correct controllers, not inside one isolated function.
Simulations should model the state, timing, and failures that affect decisions, then preserve the seed or trace that exposed a bug.
Deterministic replay needs explicit inputs for time, randomness, API responses, watch events, queues, and external state.
A strong test strategy combines cheap local tests with invariants, simulation, replay, and carefully diagnosed staging failures.

← Back to Distributed Schedulers and Control Planes

← Back to Distributed Systems

← Back to Learning Hub