Distributed Schedulers and Control Planes: Placement, Locality, and Topology Constraints

LESSON

Distributed Schedulers and Control Planes

007 35 min advanced

Distributed Schedulers and Control Planes: Placement, Locality, and Topology Constraints

The core idea: Placement constraints translate physical and logical topology into scheduler decisions, so the design trade-off is between putting work near what it needs and spreading work far enough to survive failures and preserve capacity.

Core Insight

Suppose fraud-batch now needs three GPU workers, not one. Its input data lives in eu-central-a, one rack has a slow top-of-rack switch, and the tenant requires replicas to stay inside eu-central. A scheduler that only asks "which node has a free GPU?" can make a placement that passes basic filtering but still performs badly or fails in one rack outage.

Placement is where the scheduler's abstract node choice meets geography, network shape, storage locality, hardware pools, and failure domains. Some topology facts are hard constraints: this workload must not leave a region, must use GPU nodes, and must avoid a drained rack. Other facts are preferences: stay near input data, spread replicas across racks, keep scarce GPU types available for stricter jobs, and avoid network hotspots.

The important distinction is not "locality good, spreading good." Both can be right and both can be harmful. Locality reduces latency and data movement, but it can concentrate risk and fragment capacity. Spreading improves fault tolerance and fairness, but it can increase network cost and make tightly coupled jobs slower. A usable placement model makes that trade-off explicit instead of hiding it inside a mysterious score.

Topology As Scheduler Input

Topology is any structure that makes two placements different even when their raw resource numbers look similar. Common topology domains include:

Region: legal, compliance, latency, and disaster boundary.
Zone: availability boundary inside a region.
Rack or cell: shared power, network, maintenance, or blast-radius boundary.
Node: local CPU, memory, GPU, disk, and runtime state.
Storage location: where input data, cache, or persistent volumes are available.
Network path: cost and congestion between workload, data, and dependencies.
Hardware pool: accelerator type, generation, driver, or isolation mode.

Schedulers usually consume this information through labels, taints, topology keys, node conditions, storage metadata, and controller-owned status. The scheduler does not need to know every cable in the data center. It needs a stable abstraction that lets policies say "same zone," "different rack," "near this data," or "only this accelerator pool."

The abstraction must be fresh enough to be useful. A rack label that is wrong after maintenance can defeat anti-affinity. A node condition that lags during a network incident can pull work toward the very domain that should be avoided. Placement depends on topology metadata, so topology metadata becomes part of the control plane's safety surface.

Hard Constraints And Soft Preferences

A placement rule should first say whether it is hard or soft.

Hard constraints are non-negotiable. If fraud-batch must stay in eu-central, the scheduler should not place it in us-east to improve utilization. If a workload needs an NVIDIA A100, a different GPU type is not "almost good enough." Hard constraints belong in filtering because violating them creates an invalid placement.

Soft preferences guide choices among valid placements. For fraud-batch, the scheduler may prefer:

nodes in the same zone as the input data
replicas spread across racks
nodes with lower network pressure
placements that avoid fragmenting a scarce GPU pool
nodes that already have the required container image or cache

Soft preferences belong in scoring because they express costs, not absolute legality. The scheduler might accept a data-far node if all data-near nodes are unhealthy, or it might pack batch replicas into one zone during low-priority backfill. The key is that operators should be able to see which preference lost and why.

Confusion between hard and soft rules creates bad systems. If locality is accidentally hard, jobs can sit pending while usable remote capacity is idle. If anti-affinity is accidentally soft for a critical service, one rack failure can take out every replica.

Locality Versus Spread

The two most common placement pressures point in opposite directions.

Locality tries to reduce distance:

workload <-> data
workload <-> dependency
workload <-> cache
workload <-> accelerator

Spreading tries to increase independence:

replica-a on rack-a
replica-b on rack-b
replica-c on rack-c

For fraud-batch, locality may prefer three workers in eu-central-a because input data is there. Spread may prefer one worker per rack or zone so a rack failure does not stop the whole job. Utilization may prefer packing workers onto partially used GPU nodes to leave another GPU pool free. These are all reasonable objectives, but they cannot all be maximized at once.

A practical scheduler makes the decision in layers. It filters for hard requirements, then scores feasible placements with weights that reflect the workload's goal. A latency-sensitive inference service may weight locality and zone affinity heavily. A replicated control-plane component may weight anti-affinity and failure-domain spread heavily. A best-effort batch job may accept remote data access to improve overall cluster utilization.

Worked Example: Three GPU Workers

Consider four candidate nodes after hard filtering for region and GPU type:

node        zone          rack    data distance   network       gpu state
gpu-a1      eu-central-a  rack-a  near            normal        1 free
gpu-a2      eu-central-a  rack-a  near            degraded      1 free
gpu-b1      eu-central-b  rack-b  medium          normal        1 free
gpu-c1      eu-central-c  rack-c  far             normal        1 free

If fraud-batch only optimizes data locality, it may choose gpu-a1, gpu-a2, and then wait for another near node. That reduces data movement, but it concentrates work in rack-a and uses a degraded network path. If it only optimizes spreading, it may choose gpu-a1, gpu-b1, and gpu-c1. That improves failure isolation, but one worker now reads data from far away.

A balanced scoring model could do this:

score = data_locality + rack_spread + network_health + capacity_shape

Then the scheduler can choose gpu-a1, gpu-b1, and gpu-c1 for a critical replicated workload, or gpu-a1 and gpu-b1 first while waiting briefly for a healthy near node for a data-heavy batch job. The result depends on workload intent. The architecture should make that intent visible rather than embedding one global placement philosophy.

The binding step from the previous lesson still matters. Topology scoring chooses a candidate placement. It does not prove the placement is still available. The authoritative binding API must still confirm current node state, quota, and ownership before committing.

Operational Failure Modes

Locality as a trap: data-near capacity is exhausted, but locality is hard-coded as mandatory. The fix is to separate required locality from preferred locality.
Spread without cost awareness: replicas are spread across domains but pay huge network or storage costs. The fix is to score both independence and distance.
Stale topology labels: a rack, zone, or accelerator label no longer reflects reality. The fix is ownership, validation, and audit of topology metadata.
Hidden shared fate: two nodes appear separate but share a power domain, network cell, or storage dependency. The fix is to model the real failure domain, not just visible labels.
Overweight preference: one score term dominates all others and causes surprising placements. The fix is score breakdowns, bounded weights, and placement explanations.
Fragmented scarce capacity: spreading small jobs across special hardware blocks larger jobs later. The fix is capacity-shape scoring and reservation policy.

Connections

The previous lesson, 006.md, introduced filtering, scoring, and binding. This lesson makes the scoring inputs more concrete by adding topology and locality.
The next lesson, 008.md, adds fairness, priority, preemption, and backpressure when many workloads compete for the same placement space.
networking-and-failure-models provides useful background for understanding why topology domains are also failure domains.

Resources

[DOC] Kubernetes Assign Pods to Nodes
- Focus: Study node selectors, affinity, anti-affinity, taints, and tolerations as placement controls.
[DOC] Kubernetes Pod Topology Spread Constraints
- Focus: Look at how topology keys turn failure domains into scheduler-visible constraints.
[DOC] Kubernetes Scheduling Framework
- Focus: Connect topology rules to filter, score, reserve, and bind extension points.
[PAPER] Large-scale cluster management at Google with Borg
- Focus: Look for how production cluster managers balance locality, availability, and utilization.

Key Takeaways

Topology turns region, zone, rack, node, storage, network, and hardware facts into scheduler-visible placement inputs.
Hard placement rules belong in filtering; soft locality and spread preferences belong in scoring.
Locality improves latency and data movement, while spreading improves failure isolation; the central trade-off is deciding which pressure dominates for a workload.
Placement explanations matter because topology scores hide policy, cost, and risk unless the control plane makes them visible.

← Back to Distributed Schedulers and Control Planes

← Back to Distributed Systems

← Back to Learning Hub