Distributed Schedulers and Control Planes: Multi-Tenant Isolation and Noisy Neighbor Control

LESSON

Distributed Schedulers and Control Planes

014 35 min advanced

Distributed Schedulers and Control Planes: Multi-Tenant Isolation and Noisy Neighbor Control

The core idea: Multi-tenant control planes need isolation at admission, scheduling, runtime, and observability boundaries, so the design trade-off is between high shared-fleet utilization and predictable behavior when one tenant becomes noisy.

Core Insight

Suppose payments-prod, fraud-labs, and analytics-dev all share the eu-central compute fleet. payments-prod runs risk-api and needs predictable latency. fraud-labs runs GPU-heavy experiments that can saturate accelerators. analytics-dev runs notebooks that are idle most of the day, then suddenly fan out hundreds of jobs after a product launch. The platform saves money by sharing capacity, but the tenants should not be able to accidentally rewrite each other's policy, consume each other's quota, or make one workload's burst look like everyone else's outage.

Multi-tenancy is not just "put each tenant in a namespace." A namespace, project, queue, account, or cell is a naming boundary. Isolation requires enforcement boundaries: who may create work, what capacity they may claim, where that work may land, which runtime resources it can consume, how priority is resolved, and which signals prove that one tenant is harming another.

The difficult part is that perfect isolation wastes capacity, while aggressive sharing creates noisy neighbor risk. A good distributed scheduler makes that trade-off explicit. It lets safe sharing happen where resources are elastic or overcommittable, and it creates hard boundaries where a tenant could violate security, exhaust scarce capacity, or damage another tenant's SLO.

Isolation Dimensions

A control plane usually needs several isolation dimensions at once:

These dimensions reinforce each other. Admission can reject a request from analytics-dev that asks for the recovery-critical priority class. The scheduler can keep risk-api replicas away from nodes already crowded with experimental notebooks. Runtime controls can throttle CPU bursts. Observability can show that GPU queue age for fraud-labs is rising without blaming payments-prod.

No single mechanism is enough. Quota without runtime limits can admit the right number of jobs and still let one job consume the node. Runtime limits without scheduling policy can place incompatible tenants together. Scheduling policy without admission can leave tenants submitting impossible or unauthorized requests.

Noisy Neighbors

A noisy neighbor is a workload or tenant whose behavior degrades others that share a resource. The shared resource may be obvious, like CPU or GPUs, or hidden, like disk I/O, network egress, metadata API capacity, image pulls, API server QPS, or controller work queue depth.

Noisy neighbor failures often start with a narrow symptom:

analytics-dev starts 400 notebooks
image pulls saturate node network
fraud-labs GPU workers start slowly
risk-api rollout gate sees delayed readiness
autoscaler interprets lag as demand
scheduler queue grows across tenants

The mistake is to treat this as a generic capacity incident. The real question is which tenant crossed which boundary, and whether the platform had a boundary there at all. If the platform cannot attribute pressure to a tenant, workload class, node pool, or control-plane path, operators are left with blunt tools such as stopping all new work or adding expensive emergency capacity.

Noisy neighbor control needs both prevention and diagnosis. Prevention keeps one tenant from exceeding agreed limits. Diagnosis explains which shared resource is contested when limits were insufficient or missing.

Enforcement Layers

The enforcement layers should line up with the point where the decision is safest.

Admission is useful for requests that should never become desired state:

Scheduling is useful for deciding where valid work should run:

Runtime controls are useful after placement:

Reconciliation is useful for repair:

The design goal is not to duplicate every rule in every layer. The goal is to make each layer responsible for the decision it can enforce reliably, and to produce a reason that downstream controllers and operators can understand.

Fair Sharing Versus Hard Isolation

Some resources should be shared opportunistically. Idle CPU in analytics-dev can be useful to fraud-labs as long as it can be reclaimed. Idle batch capacity can run low-priority notebooks. Empty zones can absorb temporary overflow.

Other resources need hard isolation. Protected risk-api recovery capacity may stay unused during normal operation because its value appears during failure. GPU memory may be exclusive because overcommitment causes job failure rather than graceful slowdown. Security-sensitive tenants may require separate node pools or cells because runtime isolation is not the only risk.

A useful policy distinguishes at least three classes:

guaranteed: reserved, protected, and hard to preempt
burstable: allowed to use spare capacity but can be throttled or reclaimed
best-effort: admitted only when spare capacity exists

Those classes are not moral judgments about tenants. They are operational contracts. payments-prod may get guaranteed recovery capacity. fraud-labs may get quota for scheduled experiments and burst access when spare GPUs exist. analytics-dev may get best-effort notebooks with clear backpressure when the fleet is under pressure.

The scheduler must expose the reason when sharing stops. "Pending" is not enough. A tenant needs to know whether it is blocked by quota, protected capacity, topology, runtime limits, low priority, or a noisy-neighbor protection rule.

Worked Example: Notebook Burst During a Recovery Event

Imagine this starting point:

payments-prod:
  risk-api recovery lane: 4 GPUs guaranteed
fraud-labs:
  experiment quota: 12 GPUs burstable
analytics-dev:
  notebooks: best-effort CPU, no guaranteed GPUs

At 09:00, analytics-dev starts hundreds of notebooks. At 09:05, risk-api begins a regional recovery and needs its protected lane. At 09:10, fraud-labs submits a GPU experiment.

A weak design sees only global free capacity:

notebooks consume shared CPU and network
image pulls slow down GPU worker startup
fraud-labs jobs partially bind
risk-api recovery replicas wait behind ordinary work
operators see a generic cluster saturation alert

A stronger design applies boundaries at several points:

admission:
  analytics-dev notebooks admitted only within object and CPU quotas
  GPU requests rejected unless tenant has GPU quota

scheduling:
  risk-api recovery lane protected from lower-priority work
  notebooks kept away from GPU node pools
  fraud-labs jobs queued with visible burstable-quota reason

runtime:
  notebooks CPU-throttled and network/image-pull pressure limited
  GPU memory allocated exclusively to bound workers

observability:
  dashboards show tenant, resource, node pool, and pending reason

This design may leave some machines less than fully utilized during calm periods. That is the price of predictable recovery. The platform can still reclaim idle capacity through burstable classes, but reclaimability is a contract, not a hope.

Operational Failure Modes

Connections

Resources

Key Takeaways

PREVIOUS Distributed Schedulers and Control Planes: Admission, Policy, and API Control Surfaces NEXT Distributed Schedulers and Control Planes: Cross-Region Scheduling and Disaster Boundaries