Distributed Schedulers and Control Planes: Cross-Region Scheduling and Disaster Boundaries

LESSON

Distributed Schedulers and Control Planes

015 35 min advanced

Distributed Schedulers and Control Planes: Cross-Region Scheduling and Disaster Boundaries

The core idea: Cross-region scheduling is a disaster-boundary design problem, so the trade-off is between using many regions as one large pool and preserving enough regional independence to survive failures.

Core Insight

Suppose risk-api normally runs in eu-central, with warm capacity in eu-west and a smaller emergency lane in us-east. payments-prod wants low latency for European users, fraud-labs wants spare GPUs wherever they exist, and the platform team wants a regional outage to degrade service without turning the whole control plane into one global incident.

The tempting model is to build one scheduler that sees every region and chooses the best place for each workload. That can improve utilization and make failover look simple on a whiteboard. It also creates a dangerous dependency: every scheduling decision now crosses wide-area links, reads global state that may be stale, and may need a single authority that is itself hard to keep available during a regional disaster.

A disaster boundary is the line across which failure should not automatically propagate. In a scheduling system, that boundary defines which state is local, which state can lag, which controller is authoritative, which work may move during failover, and which capacity is deliberately held idle. Cross-region scheduling is less about finding a globally optimal placement and more about deciding where global coordination is worth the risk.

Region, Zone, Cell, and Authority

Regions, zones, and cells are all failure domains, but they are not interchangeable.

The most important question is authority: who is allowed to decide?

local scheduler:
  authoritative for binding work inside one region or cell

global placement planner:
  recommends regional targets and failover intent

traffic controller:
  moves user traffic when a region degrades

disaster controller:
  activates reserved capacity and changes regional policy during declared events

If the global planner is unavailable, the local scheduler should still make local progress. If eu-central is partitioned from the global planner, it should not accidentally accept all world traffic and all failover work without a local boundary. Authority needs to degrade intentionally.

Active-Active and Active-Passive

Cross-region scheduling usually sits between two broad patterns.

In active-active, multiple regions serve production traffic at the same time. Work can be placed near users, capacity can be used continuously, and failover may require shifting only part of the load. The cost is more complex coordination. Data locality, tenant quotas, regional fairness, and traffic routing all interact.

In active-passive, one region is primary and another is standby. The standby may be cold, warm, or hot. The design is easier to reason about because normal authority is concentrated, but spare capacity may be underused and failover may be slower.

For risk-api, the platform might choose:

eu-central: active, serves most European traffic
eu-west: warm standby, runs 30 percent spare recovery capacity
us-east: emergency lane for degraded but acceptable service

That choice is not only a networking or database choice. It shapes the scheduler. The scheduler needs to know which workloads can run cross-region, which cannot because of data residency or latency, which quotas are regional, and which emergency placements are allowed only during a declared event.

Scheduling Across a Disaster Boundary

A cross-region scheduler needs to separate normal placement from disaster placement.

Normal placement may optimize for:

Disaster placement must respect a different set of constraints:

The worst design treats failover as a bigger version of normal scheduling. During a regional outage, signals are missing, watch streams lag, autoscalers are unstable, and operators are changing policy under pressure. The system should have precomputed boundaries: which workloads may leave the region, which capacity is held for them, what priority they receive, and when traffic should shift.

Global Pool Versus Regional Cells

One global pool can raise average utilization. A quiet region can accept work from a busy one, and batch jobs can chase spare capacity. But global pooling also couples failure domains. A bad rollout, noisy tenant, stale quota, or scheduler bug can spread across every region that trusts the same control loop.

Regional cells reduce that coupling. Each cell owns its local scheduling queue, cache, capacity model, and runtime enforcement. A global layer can make slower decisions: where to keep reserves, when to rebalance tenants, and how to prepare disaster capacity. The local layer makes fast binding decisions inside a bounded blast radius.

A useful split looks like this:

global layer:
  desired regional footprint
  disaster policy
  reserve targets
  traffic intent
  tenant-level capacity plans

regional layer:
  admission within regional policy
  local queue ordering
  node binding
  runtime enforcement
  local repair and backpressure

This split may leave capacity stranded in one region while another is busy. That is the cost of containment. The next lesson will examine cost, latency, and utilization directly; the key point here is that disaster boundaries are not free.

Worked Example: Failing Out of eu-central

Imagine eu-central starts dropping control-plane writes and risk-api readiness falls. Traffic is still partially flowing, but scheduler watches are delayed and node health is unreliable.

A weak failover path reacts late and globally:

eu-central metrics degrade
autoscalers request more replicas in eu-central
global scheduler sees stale capacity
batch work continues claiming spare GPUs in eu-west
operators manually shift traffic
risk-api competes with ordinary work in the recovery region

A stronger design has a disaster boundary:

1. Regional health marks eu-central as degraded.
2. Traffic controller shifts 40 percent of requests to eu-west.
3. Disaster controller activates eu-west recovery lane for risk-api.
4. Local eu-west scheduler protects reserved capacity from lower-priority tenants.
5. Global planner pauses non-critical cross-region migration into eu-west.
6. Operators can inspect which work moved, which stayed local, and why.

The scheduler did not solve disaster recovery alone. It enforced a prepared policy at the moment when normal signals became less trustworthy. That is the difference between a disaster boundary and an improvised global rebalance.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Multi-Tenant Isolation and Noisy Neighbor Control NEXT Distributed Schedulers and Control Planes: Cost, Latency, and Utilization Trade-Offs