Distributed Schedulers and Control Planes: Testing, Simulation, and Deterministic Replay
LESSON
Distributed Schedulers and Control Planes: Testing, Simulation, and Deterministic Replay
The core idea: Control-plane tests need to explore timing, failure, and controller interleavings, and deterministic replay turns rare scheduler bugs from anecdotes into reproducible engineering work.
Core Insight
Imagine the team fixes the risk-api observability gap from the previous lesson. A later incident now has a clear timeline: the scheduler read a stale quota cache, the autoscaler added capacity, repair finished correctly, and the pending replica eventually bound. The team can explain what happened. The harder question is whether they can prove the same pattern will not break the next rollback, tenant, or region.
A normal unit test can check that a scoring function ranks nodes correctly. An integration test can check that a controller updates status after a binding. Those tests are useful, but many control-plane bugs live between correct pieces: a watch event arrives late, a leader lease expires mid-update, a retry observes a new generation, an admission plugin mutates labels, and a repair controller cleans up state at the same time.
The practical answer is not "test everything in production." It is to build a testing ladder that includes small deterministic tests, API-level integration tests, simulated worlds, fault injection, and replay of real decision timelines. The main trade-off is realism versus repeatability: a real cluster has messy timing but is hard to reproduce; a simulation is controllable but only finds bugs that the model is rich enough to express.
Why Unit Tests Miss Control-Plane Bugs
Control planes are asynchronous. A controller reads observed state, compares it with desired state, and writes an action or status. Another controller may be doing the same thing against overlapping state. The bug often depends on the order in which those reads and writes happen.
Consider a scheduler and quota controller:
1. scheduler reads quota: tenant risk has no zone-c capacity
2. quota controller frees capacity in zone-c
3. scheduler marks replica-042 unschedulable
4. autoscaler sees low readiness and adds desired replicas
5. scheduler cache observes the quota update
Each component may pass its own tests. The scheduler made a valid decision based on its cache. The quota controller freed capacity. The autoscaler responded to readiness. The combined behavior is the risk: extra desired capacity may be created because the system observed a temporary stale boundary as durable shortage.
Useful tests need to cover both local behavior and cross-controller properties:
- a scheduler never binds one workload to two nodes
- a retry does not create duplicate reservations
- rollback does not delete the only healthy recovery capacity
- repair eventually removes orphaned state
- an autoscaler does not amplify transient scheduler lag without a bound
- status conditions refer to the generation they actually observed
Those are invariants. They describe what must remain true across many possible timings, not just what one function returns for one input.
Simulation as a Small World
A simulation is a small, controlled version of the control plane. It does not need to run every real binary or every cloud integration to be valuable. It needs to model the state and failures that matter to the decision under test.
For a scheduler control plane, the simulated world might include:
- workloads with desired generations and priorities
- nodes, zones, capacities, labels, taints, and health transitions
- quotas, reservations, and tenant limits
- watch streams with delay, disconnects, and stale caches
- controller queues, retries, backoff, leases, and deadlines
- admission mutations and policy revisions
- API writes that can commit, conflict, or time out
- repair and garbage-collection loops
The simulator can then generate many event orders:
seed: 82219
00:01 admit risk-api generation 42
00:02 delay quota watch by 8 seconds
00:03 schedule replica-042
00:04 timeout binding write after commit
00:05 restart scheduler leader
00:07 rollback policy to v4
00:08 repair scans reservations
The seed matters. If the simulator finds a duplicate reservation at seed 82219, the team should be able to rerun exactly that seed and inspect the same sequence. Random exploration without replay becomes a bug generator that cannot help engineers fix the bug.
Deterministic Replay
Deterministic replay means recording enough inputs, timing decisions, and nondeterministic choices to run the same scenario again. The goal is not to replay wall-clock time perfectly. The goal is to make the same logical interleaving happen.
Useful replay inputs include:
- initial object state
- controller versions and feature flags
- policy and admission configuration
- watch events and resource versions
- queue order and retry delays
- injected faults, timeouts, conflicts, and restarts
- random seeds used for scheduling choices
- external decisions that affect state, such as capacity or quota updates
Replay forces a design discipline: controllers should take their nondeterminism through explicit seams. Time should come from a clock abstraction in tests. Random selection should use a seedable source. API responses should be modelable. Work queues should expose enough order to reproduce a failure.
This does not mean production code should become artificial. It means the parts that make decisions should be separable from the parts that talk to the real clock, network, and API server. That separation also improves debuggability because the decision inputs become visible.
Testing Ladder
Different tests buy different confidence. A useful test strategy stacks them instead of expecting one layer to do all the work.
| Layer | What it catches | Example |
|---|---|---|
| Unit tests | pure policy, scoring, filtering, and status transitions | topology filter rejects nodes without required labels |
| API-level integration tests | controller behavior against real API semantics | reconcile updates status only for the observed generation |
| Property tests | invariants across many generated inputs | no workload has two active bindings |
| Simulation | timing, partitions, stale watches, retries, and controller restarts | rollback races with repair and quota update |
| Deterministic replay | reproduction of a real or simulated incident | seed 82219 recreates duplicate reservation |
| Staging or chaos tests | infrastructure interactions outside the model | API server latency and watch disconnects under load |
The trade-off is cost placement. Unit tests are cheap and precise but miss interactions. Staging tests are realistic but expensive and flaky. Simulation and replay sit in the middle: they are more work to build than unit tests, but they can explore interleavings that are almost impossible to trigger reliably by hand.
Worked Example: Replaying a Duplicate Reservation
Suppose an incident created two reservations for one risk-api recovery replica. The observability layer captured this decision timeline:
workload: risk-api replica-042 generation 42
00:00 scheduler creates reservation res-a
00:01 API write commits but client sees timeout
00:02 scheduler leader restarts
00:03 new leader reads stale reservation cache
00:04 new leader retries and creates reservation res-b
00:06 repair sees two reservations with same owner intent
The first fix might be "read after timeout." That is a reasonable patch, but a good test encodes the invariant:
For any workload identity and generation,
there must be at most one active reservation for the same scheduling intent.
The replay should force the same failure path:
1. Start with replica-042 pending.
2. Commit reservation create, but return timeout to the scheduler.
3. Restart the leader before status is updated.
4. Delay the reservation watch for the new leader.
5. Retry scheduling.
6. Assert that the controller reuses, confirms, or conflicts with res-a instead of creating res-b.
This test is better than checking one specific error message. It names the system property the control plane must preserve across timeout, restart, cache lag, and retry. Once the property is encoded, future scheduler changes can be tested against the same failure shape.
Operational Failure Modes
- Only happy-path integration tests: controllers pass against clean API behavior but fail when writes commit after client timeouts. The fix is fault injection for timeouts, conflicts, partial commits, and retries.
- No replay artifact: an incident is understood once but cannot be rerun. The fix is to persist seeds, event timelines, object snapshots, controller versions, and injected faults.
- Simulation model is too polite: watches are always fresh and controllers never restart. The fix is to model stale caches, queue reorderings, leader loss, and delayed status.
- Assertions check implementation details: tests break when code is refactored but miss safety regressions. The fix is invariant-based assertions around ownership, binding, status, and cleanup.
- Chaos without diagnosis: staging failures create noise but no reproducible case. The fix is to connect chaos experiments to decision telemetry and replay inputs.
- Replay hides external dependencies: capacity, quota, or admission inputs are not captured. The fix is to record the external state that changed the controller decision.
Connections
- The previous lesson,
019.md, focused on reconstructing decision timelines. Those timelines become replay artifacts when the team captures enough inputs and nondeterminism. - The next lesson,
021.md, covers human overrides and runbooks. A runbook is safer when its failure modes have already been simulated and replayed. distributed-testing-simulation-and-deterministic-replaygoes deeper on model design, fault injection, and replay systems.
Resources
- [DOC] Kubebuilder: Configuring EnvTest for Integration Tests
- Focus: See how Kubernetes controllers can be tested against API-server behavior without a full production cluster.
- [DOC] Kubernetes API Concepts
- Focus: Study watches, resource versions, and consistency because replay needs to model the observations controllers actually saw.
- [ARTICLE] FoundationDB: Testing
- Focus: Use deterministic simulation as a concrete example of finding distributed bugs through controlled fault exploration.
- [PAPER] Lineage-Driven Fault Injection
- Focus: Connect fault injection to system outcomes instead of injecting random failures without a hypothesis.
- [DOC] client-go workqueue
- Focus: Inspect rate-limited queues and retries as test targets for controller timing and backoff behavior.
Key Takeaways
- Scheduler and control-plane bugs often live in interleavings between correct controllers, not inside one isolated function.
- Simulations should model the state, timing, and failures that affect decisions, then preserve the seed or trace that exposed a bug.
- Deterministic replay needs explicit inputs for time, randomness, API responses, watch events, queues, and external state.
- A strong test strategy combines cheap local tests with invariants, simulation, replay, and carefully diagnosed staging failures.