Distributed Schedulers and Control Planes: Multi-Tenant Isolation and Noisy Neighbor Control

LESSON

Distributed Schedulers and Control Planes

014 35 min advanced

Distributed Schedulers and Control Planes: Multi-Tenant Isolation and Noisy Neighbor Control

The core idea: Multi-tenant control planes need isolation at admission, scheduling, runtime, and observability boundaries, so the design trade-off is between high shared-fleet utilization and predictable behavior when one tenant becomes noisy.

Core Insight

Suppose payments-prod, fraud-labs, and analytics-dev all share the eu-central compute fleet. payments-prod runs risk-api and needs predictable latency. fraud-labs runs GPU-heavy experiments that can saturate accelerators. analytics-dev runs notebooks that are idle most of the day, then suddenly fan out hundreds of jobs after a product launch. The platform saves money by sharing capacity, but the tenants should not be able to accidentally rewrite each other's policy, consume each other's quota, or make one workload's burst look like everyone else's outage.

Multi-tenancy is not just "put each tenant in a namespace." A namespace, project, queue, account, or cell is a naming boundary. Isolation requires enforcement boundaries: who may create work, what capacity they may claim, where that work may land, which runtime resources it can consume, how priority is resolved, and which signals prove that one tenant is harming another.

The difficult part is that perfect isolation wastes capacity, while aggressive sharing creates noisy neighbor risk. A good distributed scheduler makes that trade-off explicit. It lets safe sharing happen where resources are elastic or overcommittable, and it creates hard boundaries where a tenant could violate security, exhaust scarce capacity, or damage another tenant's SLO.

Isolation Dimensions

A control plane usually needs several isolation dimensions at once:

Identity isolation: tenants, service accounts, and operators have separate permissions.
API isolation: tenants can see and mutate only the objects they own.
Policy isolation: one tenant cannot opt into protected priority, placement, or rollout scopes without permission.
Quota isolation: tenants have bounded claims on CPU, memory, GPU, storage, network, and object count.
Placement isolation: sensitive or high-priority work can be separated by node pool, zone, cell, topology, or hardware class.
Runtime isolation: cgroups, containers, sandboxes, network policy, and storage controls limit what running work can consume or reach.
Operational isolation: metrics, alerts, logs, and incident ownership make tenant-specific impact visible.

These dimensions reinforce each other. Admission can reject a request from analytics-dev that asks for the recovery-critical priority class. The scheduler can keep risk-api replicas away from nodes already crowded with experimental notebooks. Runtime controls can throttle CPU bursts. Observability can show that GPU queue age for fraud-labs is rising without blaming payments-prod.

No single mechanism is enough. Quota without runtime limits can admit the right number of jobs and still let one job consume the node. Runtime limits without scheduling policy can place incompatible tenants together. Scheduling policy without admission can leave tenants submitting impossible or unauthorized requests.

Noisy Neighbors

A noisy neighbor is a workload or tenant whose behavior degrades others that share a resource. The shared resource may be obvious, like CPU or GPUs, or hidden, like disk I/O, network egress, metadata API capacity, image pulls, API server QPS, or controller work queue depth.

Noisy neighbor failures often start with a narrow symptom:

analytics-dev starts 400 notebooks
image pulls saturate node network
fraud-labs GPU workers start slowly
risk-api rollout gate sees delayed readiness
autoscaler interprets lag as demand
scheduler queue grows across tenants

The mistake is to treat this as a generic capacity incident. The real question is which tenant crossed which boundary, and whether the platform had a boundary there at all. If the platform cannot attribute pressure to a tenant, workload class, node pool, or control-plane path, operators are left with blunt tools such as stopping all new work or adding expensive emergency capacity.

Noisy neighbor control needs both prevention and diagnosis. Prevention keeps one tenant from exceeding agreed limits. Diagnosis explains which shared resource is contested when limits were insufficient or missing.

Enforcement Layers

The enforcement layers should line up with the point where the decision is safest.

Admission is useful for requests that should never become desired state:

tenant exceeds object count or GPU quota
namespace is not allowed to use a protected priority class
workload omits required resource requests
tenant selects a protected node pool directly
policy label claims a canary or recovery lane outside its scope

Scheduling is useful for deciding where valid work should run:

spread a tenant across zones
avoid colocating noisy workload classes
reserve a lane for risk-api recovery
prefer nodes with local data while respecting tenant boundaries
backpressure low-priority work when shared resources are tight

Runtime controls are useful after placement:

CPU shares and throttling
memory limits and eviction policy
GPU partitioning or exclusive allocation
disk and network I/O controls
sandboxing, network policy, and secret boundaries

Reconciliation is useful for repair:

release leaked reservations
rebalance tenants after a node drain
reduce quota after a policy change
clean up abandoned notebooks
restore runtime limits that drifted

The design goal is not to duplicate every rule in every layer. The goal is to make each layer responsible for the decision it can enforce reliably, and to produce a reason that downstream controllers and operators can understand.

Fair Sharing Versus Hard Isolation

Some resources should be shared opportunistically. Idle CPU in analytics-dev can be useful to fraud-labs as long as it can be reclaimed. Idle batch capacity can run low-priority notebooks. Empty zones can absorb temporary overflow.

Other resources need hard isolation. Protected risk-api recovery capacity may stay unused during normal operation because its value appears during failure. GPU memory may be exclusive because overcommitment causes job failure rather than graceful slowdown. Security-sensitive tenants may require separate node pools or cells because runtime isolation is not the only risk.

A useful policy distinguishes at least three classes:

guaranteed: reserved, protected, and hard to preempt
burstable: allowed to use spare capacity but can be throttled or reclaimed
best-effort: admitted only when spare capacity exists

Those classes are not moral judgments about tenants. They are operational contracts. payments-prod may get guaranteed recovery capacity. fraud-labs may get quota for scheduled experiments and burst access when spare GPUs exist. analytics-dev may get best-effort notebooks with clear backpressure when the fleet is under pressure.

The scheduler must expose the reason when sharing stops. "Pending" is not enough. A tenant needs to know whether it is blocked by quota, protected capacity, topology, runtime limits, low priority, or a noisy-neighbor protection rule.

Worked Example: Notebook Burst During a Recovery Event

Imagine this starting point:

payments-prod:
  risk-api recovery lane: 4 GPUs guaranteed
fraud-labs:
  experiment quota: 12 GPUs burstable
analytics-dev:
  notebooks: best-effort CPU, no guaranteed GPUs

At 09:00, analytics-dev starts hundreds of notebooks. At 09:05, risk-api begins a regional recovery and needs its protected lane. At 09:10, fraud-labs submits a GPU experiment.

A weak design sees only global free capacity:

notebooks consume shared CPU and network
image pulls slow down GPU worker startup
fraud-labs jobs partially bind
risk-api recovery replicas wait behind ordinary work
operators see a generic cluster saturation alert

A stronger design applies boundaries at several points:

admission:
  analytics-dev notebooks admitted only within object and CPU quotas
  GPU requests rejected unless tenant has GPU quota

scheduling:
  risk-api recovery lane protected from lower-priority work
  notebooks kept away from GPU node pools
  fraud-labs jobs queued with visible burstable-quota reason

runtime:
  notebooks CPU-throttled and network/image-pull pressure limited
  GPU memory allocated exclusively to bound workers

observability:
  dashboards show tenant, resource, node pool, and pending reason

This design may leave some machines less than fully utilized during calm periods. That is the price of predictable recovery. The platform can still reclaim idle capacity through burstable classes, but reclaimability is a contract, not a hope.

Operational Failure Modes

Namespace-only isolation: tenants are separated in names but not in quota, policy, runtime, or observability. The fix is layered enforcement.
Global queue hides tenant pressure: all pending work looks the same. The fix is pending reasons broken down by tenant, priority, resource, and node pool.
Protected capacity is invisible: users think the platform is wasting resources. The fix is explicit reservation state and documented burst rules.
Best-effort becomes permanent: opportunistic work is never reclaimed during pressure. The fix is priority, preemption, throttling, and clear eviction policy.
Runtime limits missing: admitted work consumes more than it requested. The fix is resource requests plus enforced runtime limits.
Control-plane noisy neighbor: one tenant floods the API, watches, or controller queues. The fix is API rate limits, scoped watches, object quotas, and per-tenant work queues.

Connections

The previous lesson, 013.md, covered admission and API control surfaces. Multi-tenant isolation depends on those front-door checks to keep unsafe requests out of desired state.
The next lesson, 015.md, moves these boundaries across regions, where isolation and disaster recovery policies interact.
multi-tenant-platform-architecture explores tenant models, cell boundaries, and platform product decisions in more depth.

Resources

[DOC] Kubernetes Multi-Tenancy
- Focus: Study tenant isolation as a combination of API, policy, workload, and cluster boundaries.
[DOC] Kubernetes Resource Quotas
- Focus: Connect tenant claims to admission-time limits and resource accounting.
[DOC] Kubernetes Limit Ranges
- Focus: Look at default and bounded resource requests and limits inside a namespace.
[DOC] Pod Priority and Preemption
- Focus: Understand how priority classes and preemption protect important work under contention.
[PAPER] Large-scale cluster management at Google with Borg
- Focus: Compare quota, priority, reservations, and shared-fleet utilization in a production cluster manager.

Key Takeaways

Multi-tenant isolation is layered: identity, API, policy, quota, placement, runtime, and observability all matter.
Noisy neighbor control requires attribution, not just more capacity.
Shared fleets need explicit contracts for guaranteed, burstable, and best-effort work.
The central trade-off is high utilization from sharing versus predictable behavior when tenants compete or fail.

← Back to Distributed Schedulers and Control Planes

← Back to Distributed Systems

← Back to Learning Hub