Distributed Schedulers and Control Planes: Testing, Simulation, and Deterministic Replay

LESSON

Distributed Schedulers and Control Planes

020 35 min advanced

Distributed Schedulers and Control Planes: Testing, Simulation, and Deterministic Replay

The core idea: Control-plane tests need to explore timing, failure, and controller interleavings, and deterministic replay turns rare scheduler bugs from anecdotes into reproducible engineering work.

Core Insight

Imagine the team fixes the risk-api observability gap from the previous lesson. A later incident now has a clear timeline: the scheduler read a stale quota cache, the autoscaler added capacity, repair finished correctly, and the pending replica eventually bound. The team can explain what happened. The harder question is whether they can prove the same pattern will not break the next rollback, tenant, or region.

A normal unit test can check that a scoring function ranks nodes correctly. An integration test can check that a controller updates status after a binding. Those tests are useful, but many control-plane bugs live between correct pieces: a watch event arrives late, a leader lease expires mid-update, a retry observes a new generation, an admission plugin mutates labels, and a repair controller cleans up state at the same time.

The practical answer is not "test everything in production." It is to build a testing ladder that includes small deterministic tests, API-level integration tests, simulated worlds, fault injection, and replay of real decision timelines. The main trade-off is realism versus repeatability: a real cluster has messy timing but is hard to reproduce; a simulation is controllable but only finds bugs that the model is rich enough to express.

Why Unit Tests Miss Control-Plane Bugs

Control planes are asynchronous. A controller reads observed state, compares it with desired state, and writes an action or status. Another controller may be doing the same thing against overlapping state. The bug often depends on the order in which those reads and writes happen.

Consider a scheduler and quota controller:

1. scheduler reads quota: tenant risk has no zone-c capacity
2. quota controller frees capacity in zone-c
3. scheduler marks replica-042 unschedulable
4. autoscaler sees low readiness and adds desired replicas
5. scheduler cache observes the quota update

Each component may pass its own tests. The scheduler made a valid decision based on its cache. The quota controller freed capacity. The autoscaler responded to readiness. The combined behavior is the risk: extra desired capacity may be created because the system observed a temporary stale boundary as durable shortage.

Useful tests need to cover both local behavior and cross-controller properties:

Those are invariants. They describe what must remain true across many possible timings, not just what one function returns for one input.

Simulation as a Small World

A simulation is a small, controlled version of the control plane. It does not need to run every real binary or every cloud integration to be valuable. It needs to model the state and failures that matter to the decision under test.

For a scheduler control plane, the simulated world might include:

The simulator can then generate many event orders:

seed: 82219
00:01 admit risk-api generation 42
00:02 delay quota watch by 8 seconds
00:03 schedule replica-042
00:04 timeout binding write after commit
00:05 restart scheduler leader
00:07 rollback policy to v4
00:08 repair scans reservations

The seed matters. If the simulator finds a duplicate reservation at seed 82219, the team should be able to rerun exactly that seed and inspect the same sequence. Random exploration without replay becomes a bug generator that cannot help engineers fix the bug.

Deterministic Replay

Deterministic replay means recording enough inputs, timing decisions, and nondeterministic choices to run the same scenario again. The goal is not to replay wall-clock time perfectly. The goal is to make the same logical interleaving happen.

Useful replay inputs include:

Replay forces a design discipline: controllers should take their nondeterminism through explicit seams. Time should come from a clock abstraction in tests. Random selection should use a seedable source. API responses should be modelable. Work queues should expose enough order to reproduce a failure.

This does not mean production code should become artificial. It means the parts that make decisions should be separable from the parts that talk to the real clock, network, and API server. That separation also improves debuggability because the decision inputs become visible.

Testing Ladder

Different tests buy different confidence. A useful test strategy stacks them instead of expecting one layer to do all the work.

Layer What it catches Example
Unit tests pure policy, scoring, filtering, and status transitions topology filter rejects nodes without required labels
API-level integration tests controller behavior against real API semantics reconcile updates status only for the observed generation
Property tests invariants across many generated inputs no workload has two active bindings
Simulation timing, partitions, stale watches, retries, and controller restarts rollback races with repair and quota update
Deterministic replay reproduction of a real or simulated incident seed 82219 recreates duplicate reservation
Staging or chaos tests infrastructure interactions outside the model API server latency and watch disconnects under load

The trade-off is cost placement. Unit tests are cheap and precise but miss interactions. Staging tests are realistic but expensive and flaky. Simulation and replay sit in the middle: they are more work to build than unit tests, but they can explore interleavings that are almost impossible to trigger reliably by hand.

Worked Example: Replaying a Duplicate Reservation

Suppose an incident created two reservations for one risk-api recovery replica. The observability layer captured this decision timeline:

workload: risk-api replica-042 generation 42
00:00 scheduler creates reservation res-a
00:01 API write commits but client sees timeout
00:02 scheduler leader restarts
00:03 new leader reads stale reservation cache
00:04 new leader retries and creates reservation res-b
00:06 repair sees two reservations with same owner intent

The first fix might be "read after timeout." That is a reasonable patch, but a good test encodes the invariant:

For any workload identity and generation,
there must be at most one active reservation for the same scheduling intent.

The replay should force the same failure path:

1. Start with replica-042 pending.
2. Commit reservation create, but return timeout to the scheduler.
3. Restart the leader before status is updated.
4. Delay the reservation watch for the new leader.
5. Retry scheduling.
6. Assert that the controller reuses, confirms, or conflicts with res-a instead of creating res-b.

This test is better than checking one specific error message. It names the system property the control plane must preserve across timeout, restart, cache lag, and retry. Once the property is encoded, future scheduler changes can be tested against the same failure shape.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Observability, Debuggability, and Hidden Coupling NEXT Distributed Schedulers and Control Planes: Human Overrides, Runbooks, and Operational Control