Distributed Schedulers and Control Planes: Scheduling Architecture, Filtering, Scoring, and Binding

LESSON

Distributed Schedulers and Control Planes

006 35 min advanced

Distributed Schedulers and Control Planes: Scheduling Architecture, Filtering, Scoring, and Binding

The core idea: A scheduler separates impossible placements from preferred placements and commits only through an authoritative binding step, so the design trade-off is between fast local decisions and safe placement under stale cluster state.

Core Insight

Suppose fraud-batch is waiting for one GPU worker in eu-central. It needs an exclusive GPU, cannot exceed tenant quota, prefers nodes close to its input data, and should avoid a rack that is already under network pressure. The scheduler has a queue entry for the workload and a cached view of nodes, but that cache is not the source of truth. Another workload may be binding to the same GPU while fraud-batch is being evaluated.

That is why production schedulers split the path into stages. Filtering removes nodes that cannot safely run the workload. Scoring ranks the remaining nodes by preferences and cost. Binding commits the selected placement through the authority that owns workload-to-node assignment. The local scheduler can be clever, but the final write must still be safe.

The common misconception is that scheduling is mostly about picking the "best" node. In distributed control planes, the harder question is when the scheduler is allowed to believe its own choice. A good architecture lets the scheduler use fast local state for exploration while making the binding step authoritative enough to reject stale, conflicting, or policy-breaking decisions.

The Scheduler Pipeline

A simple scheduling path looks like this:

pending queue -> snapshot/cache -> filter -> score -> reserve/assume -> bind -> observe

Each stage answers a different question:

This shape keeps safety and liveness in view. Filters protect hard invariants. Scoring improves utilization and latency without pretending to be a proof of correctness. Binding is the point where desired placement becomes committed state. Observation closes the loop because a bound workload may still fail to start.

Filtering: Removing Unsafe Candidates

Filters should be boring and strict. They answer yes-or-no questions about feasibility:

For fraud-batch, node-gpu-7 may pass because it is in eu-central, has the right GPU type, and is not cordoned. node-gpu-9 may fail because it is in us-east. node-gpu-11 may fail because its last heartbeat is too old. A filter failure should usually record a reason that operators and later controllers can inspect.

Filtering has a hidden cost: the scheduler can only filter on evidence it has. If node health is stale, a strict filter may reject a node that has recovered. If quota data lags, a loose filter may pass a placement that binding later rejects. This is why filters should distinguish local feasibility checks from authoritative commitment checks. Local filtering narrows the search; it does not replace the final authority.

Scoring: Ranking Feasible Candidates

Scoring handles preferences after the hard constraints are satisfied. It might prefer:

Scores are useful because many feasible placements are not equally good. Running fraud-batch next to its input data may reduce job time. Spreading replicas across racks may reduce blast radius. Avoiding fragmentation may keep the cluster useful for the next large GPU request.

The risk is false precision. A score is not an invariant; it is an opinion based on a model. A scheduler that treats score differences as exact truth can thrash workloads, overfit stale metrics, or spend too much time chasing tiny improvements. Mature schedulers often keep scoring modular, weighted, and observable so operators can tell whether a placement was driven by locality, fairness, fragmentation, or policy.

Binding: Turning A Choice Into Authority

Binding is the commit step. The scheduler chooses a candidate, but the control plane must still make the placement durable:

1. Scheduler selects node-gpu-7 from a local snapshot.
2. Scheduler records an assumed placement in its local cache.
3. Scheduler sends a binding request to the authoritative API.
4. API accepts only if workload version, node state, quota, and ownership still allow it.
5. Node agent observes the binding and attempts to start the workload.
6. Status updates confirm running, failed, or unknown.

The local assumed placement prevents the scheduler from immediately placing another workload onto the same apparent capacity while the bind is pending. The authoritative binding prevents the scheduler from committing a stale or conflicting choice. Both are needed. Local assumption is fast but provisional; binding is slower but authoritative.

The lease and fencing model from the previous lesson applies here. If scheduler-shard-12 is currently owned by ctrl-b at epoch 1843, the binding write should include that ownership evidence. A stale scheduler replica should not be able to bind fraud-batch just because it still has a promising cache entry.

Worked Example: One Placement Attempt

Consider one scheduling cycle for fraud-batch:

workload: fraud-batch
requirements:
  region: eu-central
  gpu: 1 exclusive
  tenant: fraud
preferences:
  close_to: fraud-inputs-eu
  spread_across: rack
  avoid: nodes with high network pressure

The scheduler reads a snapshot and finds four candidate nodes:

node-gpu-7   eu-central  gpu-free  data-near  rack-a  healthy
node-gpu-9   us-east     gpu-free  data-far   rack-c  healthy
node-gpu-11  eu-central  gpu-free  data-near  rack-b  stale heartbeat
node-gpu-13  eu-central  gpu-free  data-far   rack-d  healthy

Filtering removes node-gpu-9 because of region and node-gpu-11 because its heartbeat is too stale for a new placement. Scoring ranks node-gpu-7 over node-gpu-13 because it is closer to the input data. The scheduler assumes node-gpu-7 locally and sends a binding request.

If the binding succeeds, the node agent becomes responsible for local execution and status. If the binding fails because another workload already took the GPU, the scheduler clears the assumption, records the rejection reason, and requeues fraud-batch. That failure is not a bug by itself. It is the expected safety boundary between fast scheduling from cache and authoritative commitment.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Leases, Leadership, and Ownership Transfer NEXT Distributed Schedulers and Control Planes: Placement, Locality, and Topology Constraints