Distributed Schedulers and Control Planes: Scheduling Architecture, Filtering, Scoring, and Binding
LESSON
Distributed Schedulers and Control Planes: Scheduling Architecture, Filtering, Scoring, and Binding
The core idea: A scheduler separates impossible placements from preferred placements and commits only through an authoritative binding step, so the design trade-off is between fast local decisions and safe placement under stale cluster state.
Core Insight
Suppose fraud-batch is waiting for one GPU worker in eu-central. It needs an exclusive GPU, cannot exceed tenant quota, prefers nodes close to its input data, and should avoid a rack that is already under network pressure. The scheduler has a queue entry for the workload and a cached view of nodes, but that cache is not the source of truth. Another workload may be binding to the same GPU while fraud-batch is being evaluated.
That is why production schedulers split the path into stages. Filtering removes nodes that cannot safely run the workload. Scoring ranks the remaining nodes by preferences and cost. Binding commits the selected placement through the authority that owns workload-to-node assignment. The local scheduler can be clever, but the final write must still be safe.
The common misconception is that scheduling is mostly about picking the "best" node. In distributed control planes, the harder question is when the scheduler is allowed to believe its own choice. A good architecture lets the scheduler use fast local state for exploration while making the binding step authoritative enough to reject stale, conflicting, or policy-breaking decisions.
The Scheduler Pipeline
A simple scheduling path looks like this:
pending queue -> snapshot/cache -> filter -> score -> reserve/assume -> bind -> observe
Each stage answers a different question:
- Pending queue: which workload should be considered next?
- Snapshot/cache: what does the scheduler currently believe about nodes, quotas, topology, and existing bindings?
- Filter: which nodes are impossible because they violate hard requirements?
- Score: among feasible nodes, which placement is preferable?
- Reserve/assume: how does the scheduler avoid racing itself while the binding is in flight?
- Bind: can the authority commit this placement against current state?
- Observe: did the node agent accept and run the work?
This shape keeps safety and liveness in view. Filters protect hard invariants. Scoring improves utilization and latency without pretending to be a proof of correctness. Binding is the point where desired placement becomes committed state. Observation closes the loop because a bound workload may still fail to start.
Filtering: Removing Unsafe Candidates
Filters should be boring and strict. They answer yes-or-no questions about feasibility:
- Does the node have the required resource type, such as GPU?
- Is the node in an allowed region, zone, or tenancy boundary?
- Would the workload exceed quota or exclusive resource ownership?
- Is the node healthy enough to accept new work?
- Does the workload require a runtime, accelerator, label, or policy that the node lacks?
For fraud-batch, node-gpu-7 may pass because it is in eu-central, has the right GPU type, and is not cordoned. node-gpu-9 may fail because it is in us-east. node-gpu-11 may fail because its last heartbeat is too old. A filter failure should usually record a reason that operators and later controllers can inspect.
Filtering has a hidden cost: the scheduler can only filter on evidence it has. If node health is stale, a strict filter may reject a node that has recovered. If quota data lags, a loose filter may pass a placement that binding later rejects. This is why filters should distinguish local feasibility checks from authoritative commitment checks. Local filtering narrows the search; it does not replace the final authority.
Scoring: Ranking Feasible Candidates
Scoring handles preferences after the hard constraints are satisfied. It might prefer:
- nodes with more spare GPU memory
- nodes near the workload's input data
- nodes that spread replicas across failure domains
- nodes that reduce fragmentation
- nodes that preserve scarce capacity for stricter future workloads
- nodes with lower current network pressure
Scores are useful because many feasible placements are not equally good. Running fraud-batch next to its input data may reduce job time. Spreading replicas across racks may reduce blast radius. Avoiding fragmentation may keep the cluster useful for the next large GPU request.
The risk is false precision. A score is not an invariant; it is an opinion based on a model. A scheduler that treats score differences as exact truth can thrash workloads, overfit stale metrics, or spend too much time chasing tiny improvements. Mature schedulers often keep scoring modular, weighted, and observable so operators can tell whether a placement was driven by locality, fairness, fragmentation, or policy.
Binding: Turning A Choice Into Authority
Binding is the commit step. The scheduler chooses a candidate, but the control plane must still make the placement durable:
1. Scheduler selects node-gpu-7 from a local snapshot.
2. Scheduler records an assumed placement in its local cache.
3. Scheduler sends a binding request to the authoritative API.
4. API accepts only if workload version, node state, quota, and ownership still allow it.
5. Node agent observes the binding and attempts to start the workload.
6. Status updates confirm running, failed, or unknown.
The local assumed placement prevents the scheduler from immediately placing another workload onto the same apparent capacity while the bind is pending. The authoritative binding prevents the scheduler from committing a stale or conflicting choice. Both are needed. Local assumption is fast but provisional; binding is slower but authoritative.
The lease and fencing model from the previous lesson applies here. If scheduler-shard-12 is currently owned by ctrl-b at epoch 1843, the binding write should include that ownership evidence. A stale scheduler replica should not be able to bind fraud-batch just because it still has a promising cache entry.
Worked Example: One Placement Attempt
Consider one scheduling cycle for fraud-batch:
workload: fraud-batch
requirements:
region: eu-central
gpu: 1 exclusive
tenant: fraud
preferences:
close_to: fraud-inputs-eu
spread_across: rack
avoid: nodes with high network pressure
The scheduler reads a snapshot and finds four candidate nodes:
node-gpu-7 eu-central gpu-free data-near rack-a healthy
node-gpu-9 us-east gpu-free data-far rack-c healthy
node-gpu-11 eu-central gpu-free data-near rack-b stale heartbeat
node-gpu-13 eu-central gpu-free data-far rack-d healthy
Filtering removes node-gpu-9 because of region and node-gpu-11 because its heartbeat is too stale for a new placement. Scoring ranks node-gpu-7 over node-gpu-13 because it is closer to the input data. The scheduler assumes node-gpu-7 locally and sends a binding request.
If the binding succeeds, the node agent becomes responsible for local execution and status. If the binding fails because another workload already took the GPU, the scheduler clears the assumption, records the rejection reason, and requeues fraud-batch. That failure is not a bug by itself. It is the expected safety boundary between fast scheduling from cache and authoritative commitment.
Operational Failure Modes
- Filter does too much: expensive policy checks slow the entire queue. The fix is to keep hard feasibility checks clear and move expensive preference logic into scoring or precomputed state.
- Score hides policy: a soft preference is accidentally used as if it were a hard constraint. The fix is to separate filter reasons from score contributions.
- Binding bypasses authority: the scheduler writes directly to node-local state. The fix is an authoritative binding API that can reject stale or conflicting decisions.
- Assumed state leaks: a failed bind leaves capacity reserved in the scheduler cache. The fix is timeout, rollback, and reconciliation of assumed placements.
- Stale cache causes churn: repeated bind failures come from old node or quota data. The fix is backoff, cache freshness signals, and better invalidation.
- No placement explanation: operators cannot tell why a node was chosen. The fix is structured filter failures and score breakdowns.
Connections
- The previous lesson,
005.md, explained leases and fencing. Scheduler binding uses the same ownership idea to keep stale leaders from committing placements. - The next lesson,
007.md, goes deeper into placement preferences such as locality, topology, spreading, and failure domains. cloud-platform-and-microservicesgives adjacent context for declarative workload APIs and platform-owned placement.
Resources
- [DOC] Kubernetes Scheduler
- Focus: Study filtering, scoring, and binding as separate stages in a production scheduler.
- [DOC] Kubernetes Scheduling Framework
- Focus: Look at how scheduler plugins separate queueing, filtering, scoring, reserving, permitting, and binding.
- [PAPER] Large-scale cluster management at Google with Borg
- Focus: Use Borg to connect scheduler architecture with resource constraints, placement decisions, and cluster utilization.
- [PAPER] Omega: Flexible, Scalable Schedulers for Large Compute Clusters
- Focus: Compare centralized binding authority with optimistic, concurrent scheduling.
Key Takeaways
- Filtering removes unsafe or impossible nodes; scoring ranks feasible nodes by preference and cost.
- Scheduler caches make placement fast, but binding must be authoritative because cached state can be stale.
- Reserve or assume stages prevent the scheduler from racing itself while a binding is pending.
- The central trade-off is fast local decision-making versus safe commitment under concurrent, changing cluster state.