Distributed Schedulers and Control Planes: Placement, Locality, and Topology Constraints

LESSON

Distributed Schedulers and Control Planes

007 35 min advanced

Distributed Schedulers and Control Planes: Placement, Locality, and Topology Constraints

The core idea: Placement constraints translate physical and logical topology into scheduler decisions, so the design trade-off is between putting work near what it needs and spreading work far enough to survive failures and preserve capacity.

Core Insight

Suppose fraud-batch now needs three GPU workers, not one. Its input data lives in eu-central-a, one rack has a slow top-of-rack switch, and the tenant requires replicas to stay inside eu-central. A scheduler that only asks "which node has a free GPU?" can make a placement that passes basic filtering but still performs badly or fails in one rack outage.

Placement is where the scheduler's abstract node choice meets geography, network shape, storage locality, hardware pools, and failure domains. Some topology facts are hard constraints: this workload must not leave a region, must use GPU nodes, and must avoid a drained rack. Other facts are preferences: stay near input data, spread replicas across racks, keep scarce GPU types available for stricter jobs, and avoid network hotspots.

The important distinction is not "locality good, spreading good." Both can be right and both can be harmful. Locality reduces latency and data movement, but it can concentrate risk and fragment capacity. Spreading improves fault tolerance and fairness, but it can increase network cost and make tightly coupled jobs slower. A usable placement model makes that trade-off explicit instead of hiding it inside a mysterious score.

Topology As Scheduler Input

Topology is any structure that makes two placements different even when their raw resource numbers look similar. Common topology domains include:

Schedulers usually consume this information through labels, taints, topology keys, node conditions, storage metadata, and controller-owned status. The scheduler does not need to know every cable in the data center. It needs a stable abstraction that lets policies say "same zone," "different rack," "near this data," or "only this accelerator pool."

The abstraction must be fresh enough to be useful. A rack label that is wrong after maintenance can defeat anti-affinity. A node condition that lags during a network incident can pull work toward the very domain that should be avoided. Placement depends on topology metadata, so topology metadata becomes part of the control plane's safety surface.

Hard Constraints And Soft Preferences

A placement rule should first say whether it is hard or soft.

Hard constraints are non-negotiable. If fraud-batch must stay in eu-central, the scheduler should not place it in us-east to improve utilization. If a workload needs an NVIDIA A100, a different GPU type is not "almost good enough." Hard constraints belong in filtering because violating them creates an invalid placement.

Soft preferences guide choices among valid placements. For fraud-batch, the scheduler may prefer:

Soft preferences belong in scoring because they express costs, not absolute legality. The scheduler might accept a data-far node if all data-near nodes are unhealthy, or it might pack batch replicas into one zone during low-priority backfill. The key is that operators should be able to see which preference lost and why.

Confusion between hard and soft rules creates bad systems. If locality is accidentally hard, jobs can sit pending while usable remote capacity is idle. If anti-affinity is accidentally soft for a critical service, one rack failure can take out every replica.

Locality Versus Spread

The two most common placement pressures point in opposite directions.

Locality tries to reduce distance:

workload <-> data
workload <-> dependency
workload <-> cache
workload <-> accelerator

Spreading tries to increase independence:

replica-a on rack-a
replica-b on rack-b
replica-c on rack-c

For fraud-batch, locality may prefer three workers in eu-central-a because input data is there. Spread may prefer one worker per rack or zone so a rack failure does not stop the whole job. Utilization may prefer packing workers onto partially used GPU nodes to leave another GPU pool free. These are all reasonable objectives, but they cannot all be maximized at once.

A practical scheduler makes the decision in layers. It filters for hard requirements, then scores feasible placements with weights that reflect the workload's goal. A latency-sensitive inference service may weight locality and zone affinity heavily. A replicated control-plane component may weight anti-affinity and failure-domain spread heavily. A best-effort batch job may accept remote data access to improve overall cluster utilization.

Worked Example: Three GPU Workers

Consider four candidate nodes after hard filtering for region and GPU type:

node        zone          rack    data distance   network       gpu state
gpu-a1      eu-central-a  rack-a  near            normal        1 free
gpu-a2      eu-central-a  rack-a  near            degraded      1 free
gpu-b1      eu-central-b  rack-b  medium          normal        1 free
gpu-c1      eu-central-c  rack-c  far             normal        1 free

If fraud-batch only optimizes data locality, it may choose gpu-a1, gpu-a2, and then wait for another near node. That reduces data movement, but it concentrates work in rack-a and uses a degraded network path. If it only optimizes spreading, it may choose gpu-a1, gpu-b1, and gpu-c1. That improves failure isolation, but one worker now reads data from far away.

A balanced scoring model could do this:

score = data_locality + rack_spread + network_health + capacity_shape

Then the scheduler can choose gpu-a1, gpu-b1, and gpu-c1 for a critical replicated workload, or gpu-a1 and gpu-b1 first while waiting briefly for a healthy near node for a data-heavy batch job. The result depends on workload intent. The architecture should make that intent visible rather than embedding one global placement philosophy.

The binding step from the previous lesson still matters. Topology scoring chooses a candidate placement. It does not prove the placement is still available. The authoritative binding API must still confirm current node state, quota, and ownership before committing.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Scheduling Architecture, Filtering, Scoring, and Binding NEXT Distributed Schedulers and Control Planes: Fairness, Priority, Preemption, and Backpressure