Distributed Schedulers and Control Planes: Placement, Locality, and Topology Constraints
LESSON
Distributed Schedulers and Control Planes: Placement, Locality, and Topology Constraints
The core idea: Placement constraints translate physical and logical topology into scheduler decisions, so the design trade-off is between putting work near what it needs and spreading work far enough to survive failures and preserve capacity.
Core Insight
Suppose fraud-batch now needs three GPU workers, not one. Its input data lives in eu-central-a, one rack has a slow top-of-rack switch, and the tenant requires replicas to stay inside eu-central. A scheduler that only asks "which node has a free GPU?" can make a placement that passes basic filtering but still performs badly or fails in one rack outage.
Placement is where the scheduler's abstract node choice meets geography, network shape, storage locality, hardware pools, and failure domains. Some topology facts are hard constraints: this workload must not leave a region, must use GPU nodes, and must avoid a drained rack. Other facts are preferences: stay near input data, spread replicas across racks, keep scarce GPU types available for stricter jobs, and avoid network hotspots.
The important distinction is not "locality good, spreading good." Both can be right and both can be harmful. Locality reduces latency and data movement, but it can concentrate risk and fragment capacity. Spreading improves fault tolerance and fairness, but it can increase network cost and make tightly coupled jobs slower. A usable placement model makes that trade-off explicit instead of hiding it inside a mysterious score.
Topology As Scheduler Input
Topology is any structure that makes two placements different even when their raw resource numbers look similar. Common topology domains include:
- Region: legal, compliance, latency, and disaster boundary.
- Zone: availability boundary inside a region.
- Rack or cell: shared power, network, maintenance, or blast-radius boundary.
- Node: local CPU, memory, GPU, disk, and runtime state.
- Storage location: where input data, cache, or persistent volumes are available.
- Network path: cost and congestion between workload, data, and dependencies.
- Hardware pool: accelerator type, generation, driver, or isolation mode.
Schedulers usually consume this information through labels, taints, topology keys, node conditions, storage metadata, and controller-owned status. The scheduler does not need to know every cable in the data center. It needs a stable abstraction that lets policies say "same zone," "different rack," "near this data," or "only this accelerator pool."
The abstraction must be fresh enough to be useful. A rack label that is wrong after maintenance can defeat anti-affinity. A node condition that lags during a network incident can pull work toward the very domain that should be avoided. Placement depends on topology metadata, so topology metadata becomes part of the control plane's safety surface.
Hard Constraints And Soft Preferences
A placement rule should first say whether it is hard or soft.
Hard constraints are non-negotiable. If fraud-batch must stay in eu-central, the scheduler should not place it in us-east to improve utilization. If a workload needs an NVIDIA A100, a different GPU type is not "almost good enough." Hard constraints belong in filtering because violating them creates an invalid placement.
Soft preferences guide choices among valid placements. For fraud-batch, the scheduler may prefer:
- nodes in the same zone as the input data
- replicas spread across racks
- nodes with lower network pressure
- placements that avoid fragmenting a scarce GPU pool
- nodes that already have the required container image or cache
Soft preferences belong in scoring because they express costs, not absolute legality. The scheduler might accept a data-far node if all data-near nodes are unhealthy, or it might pack batch replicas into one zone during low-priority backfill. The key is that operators should be able to see which preference lost and why.
Confusion between hard and soft rules creates bad systems. If locality is accidentally hard, jobs can sit pending while usable remote capacity is idle. If anti-affinity is accidentally soft for a critical service, one rack failure can take out every replica.
Locality Versus Spread
The two most common placement pressures point in opposite directions.
Locality tries to reduce distance:
workload <-> data
workload <-> dependency
workload <-> cache
workload <-> accelerator
Spreading tries to increase independence:
replica-a on rack-a
replica-b on rack-b
replica-c on rack-c
For fraud-batch, locality may prefer three workers in eu-central-a because input data is there. Spread may prefer one worker per rack or zone so a rack failure does not stop the whole job. Utilization may prefer packing workers onto partially used GPU nodes to leave another GPU pool free. These are all reasonable objectives, but they cannot all be maximized at once.
A practical scheduler makes the decision in layers. It filters for hard requirements, then scores feasible placements with weights that reflect the workload's goal. A latency-sensitive inference service may weight locality and zone affinity heavily. A replicated control-plane component may weight anti-affinity and failure-domain spread heavily. A best-effort batch job may accept remote data access to improve overall cluster utilization.
Worked Example: Three GPU Workers
Consider four candidate nodes after hard filtering for region and GPU type:
node zone rack data distance network gpu state
gpu-a1 eu-central-a rack-a near normal 1 free
gpu-a2 eu-central-a rack-a near degraded 1 free
gpu-b1 eu-central-b rack-b medium normal 1 free
gpu-c1 eu-central-c rack-c far normal 1 free
If fraud-batch only optimizes data locality, it may choose gpu-a1, gpu-a2, and then wait for another near node. That reduces data movement, but it concentrates work in rack-a and uses a degraded network path. If it only optimizes spreading, it may choose gpu-a1, gpu-b1, and gpu-c1. That improves failure isolation, but one worker now reads data from far away.
A balanced scoring model could do this:
score = data_locality + rack_spread + network_health + capacity_shape
Then the scheduler can choose gpu-a1, gpu-b1, and gpu-c1 for a critical replicated workload, or gpu-a1 and gpu-b1 first while waiting briefly for a healthy near node for a data-heavy batch job. The result depends on workload intent. The architecture should make that intent visible rather than embedding one global placement philosophy.
The binding step from the previous lesson still matters. Topology scoring chooses a candidate placement. It does not prove the placement is still available. The authoritative binding API must still confirm current node state, quota, and ownership before committing.
Operational Failure Modes
- Locality as a trap: data-near capacity is exhausted, but locality is hard-coded as mandatory. The fix is to separate required locality from preferred locality.
- Spread without cost awareness: replicas are spread across domains but pay huge network or storage costs. The fix is to score both independence and distance.
- Stale topology labels: a rack, zone, or accelerator label no longer reflects reality. The fix is ownership, validation, and audit of topology metadata.
- Hidden shared fate: two nodes appear separate but share a power domain, network cell, or storage dependency. The fix is to model the real failure domain, not just visible labels.
- Overweight preference: one score term dominates all others and causes surprising placements. The fix is score breakdowns, bounded weights, and placement explanations.
- Fragmented scarce capacity: spreading small jobs across special hardware blocks larger jobs later. The fix is capacity-shape scoring and reservation policy.
Connections
- The previous lesson,
006.md, introduced filtering, scoring, and binding. This lesson makes the scoring inputs more concrete by adding topology and locality. - The next lesson,
008.md, adds fairness, priority, preemption, and backpressure when many workloads compete for the same placement space. networking-and-failure-modelsprovides useful background for understanding why topology domains are also failure domains.
Resources
- [DOC] Kubernetes Assign Pods to Nodes
- Focus: Study node selectors, affinity, anti-affinity, taints, and tolerations as placement controls.
- [DOC] Kubernetes Pod Topology Spread Constraints
- Focus: Look at how topology keys turn failure domains into scheduler-visible constraints.
- [DOC] Kubernetes Scheduling Framework
- Focus: Connect topology rules to filter, score, reserve, and bind extension points.
- [PAPER] Large-scale cluster management at Google with Borg
- Focus: Look for how production cluster managers balance locality, availability, and utilization.
Key Takeaways
- Topology turns region, zone, rack, node, storage, network, and hardware facts into scheduler-visible placement inputs.
- Hard placement rules belong in filtering; soft locality and spread preferences belong in scoring.
- Locality improves latency and data movement, while spreading improves failure isolation; the central trade-off is deciding which pressure dominates for a workload.
- Placement explanations matter because topology scores hide policy, cost, and risk unless the control plane makes them visible.