Backpressure, Load, and Cascading Failure

LESSON

Distributed Systems Foundations

012 20 min beginner

Backpressure, Load, and Cascading Failure

Core Insight

At noon, a report-generation service receives a burst of requests. Its workers read from a database that normally completes 200 reports per minute. A queue absorbs the first minute of extra demand, and nothing looks alarming. Then the database slows after a storage incident. Workers can finish only 80 reports per minute while clients keep submitting 240 reports per minute.

If the API accepts every request, it has not made more capacity. It has created a growing list of promises it cannot keep. Every minute the queue gains roughly 160 reports. Waiting time rises, reports become stale, callers hit their timeouts, and retries add more requests to a system that is already behind.

Backpressure is the set of controls that tells demand to slow down, wait elsewhere, accept a smaller result, or be rejected before overload spreads. It can appear as an admission decision, a bounded queue, a concurrency cap, a rate limit, a 429 response, a Retry-After hint, or a graceful reduction in optional features.

The aim is not to make every request succeed during an incident. It is to protect the highest-value promises and preserve a recovery path. Refusing a low-priority report early can be more reliable than accepting it, letting it wait for twenty minutes, and then consuming the capacity needed for a critical transaction.

Load Is A Rate And A Waiting Budget

The most useful starting point is to compare arrival rate with completion rate.

arrival rate:    240 reports/minute
completion rate:  80 reports/minute
net queue growth: 160 reports/minute

When arrivals exceed completions only briefly, a queue can smooth the burst. The service drains the accumulated work later. When the mismatch persists, queue length and queue age both grow. A queue is then recording unkept promises, not providing extra capacity.

The age matters as much as the count. A report requested two seconds ago may still be valuable. A report requested thirty minutes ago may be irrelevant, or worse, may cause a user to act on stale information. Each workload needs a waiting budget:

report deadline:       5 minutes
queue capacity budget: 2,000 jobs
database concurrency:  40 active queries
retry budget:          5% extra requests per minute

These numbers are policies rather than universal defaults. They say when a queued job becomes too old to be useful, how much work the system is willing to hold, how much parallelism the protected dependency can tolerate, and how much extra load retries may create.

Worked Trace: Break The Overload Loop

The report API follows this path:

clients -> report API -> bounded queue -> workers -> database

The database becomes slow at 12:00. Follow the system through the choices it can make.

1. Accepting Everything Creates A Hidden Failure

At first, the API accepts all 240 requests per minute. Workers start more database queries as the queue grows.

12:01
queue depth: 160
oldest job:  40 seconds
database:    80 active queries, rising latency

12:05
queue depth: 800
oldest job:  4 minutes
database:    connection pool exhausted

More workers do not help when the database is the bottleneck. They can make it worse by increasing contention, holding more connections, and causing every query to run longer. The API now returns timeouts. Some clients retry immediately. The system's observed arrival rate becomes larger than the original user demand.

slow database
  -> longer requests
  -> client timeouts
  -> retries
  -> more queued work and active queries
  -> even slower database

This positive feedback loop is a cascading failure. The initial problem was one slow dependency. The cascade happens when waiting and retries amplify it into overload across callers, workers, and connection pools.

2. Bound The Queue And Close Admission Early

The service gives the queue a capacity and age limit. When either limit is crossed, new low-priority report requests are not accepted.

admission rule:
  if queue_depth > 2,000
     or oldest_job_age > 5 minutes
     or database health is degraded:
       reject new standard reports with Retry-After

The rejection is useful information. It prevents the API from claiming it will deliver a report that has little chance of meeting its deadline. A user can be told that reports are delayed, offered an older cached report, or asked to try later. The response is less pleasant than immediate success, but it is honest and bounded.

Admission should be selective. A monthly accounting close may be more important than an optional dashboard refresh. One noisy tenant should not consume every slot. A system can reserve part of the queue or concurrency budget for high-priority work rather than applying one undifferentiated rule to all traffic.

3. Protect The Database With Concurrency Limits

The workers reduce active database work from 80 to a tested limit of 40. A limit does not make the database faster by itself. It prevents overload from turning one slow query into hundreds of competing slow queries.

worker policy:
  at most 40 active database queries
  remaining jobs stay in the bounded queue
  do not start more work merely because workers are idle

This is a concurrency limit. It is different from a queue: the queue limits waiting work; the concurrency limit limits work already stressing the dependency. Both are needed. An unbounded queue can hide a bad rate mismatch. Unlimited concurrency can collapse the dependency before the queue has a chance to help.

4. Shed Work And Enforce Deadlines

As the incident continues, the system drops jobs that can no longer be useful and degrades optional features. Cached reports may be served for a dashboard. Analytics enrichment can pause. Critical security and billing actions keep their protected capacity.

preserve first:
  payment confirmation
  account security actions
  committed order receipts

degrade or delay first:
  dashboard refresh
  recommendations
  optional analytics
  expired report jobs

Load shedding means deliberately refusing or simplifying lower-priority work to protect the path that matters most. A job that has exceeded its deadline is not a free future task; it may be harmful work consuming recovery capacity. The system should discard it with a traceable reason or send it to a repair path if the business requires later completion.

5. Make Retries Part Of The Capacity Plan

Retries are necessary for some transient errors, but they must be constrained. The API uses exponential backoff with jitter, an overall deadline, and a retry budget. A Retry-After response makes callers wait instead of retrying in synchronized waves.

retry policy:
  retry only retryable failures
  wait with randomized backoff
  stop at the request deadline
  do not exceed the retry budget
  reuse idempotency keys for state-changing work

Without jitter, many callers retry at exactly the same interval and create a second spike. Without a budget, every dependency slowdown can multiply traffic. Without an idempotency key, retries may duplicate a payment or order. Backpressure is therefore part of correctness as well as capacity management.

6. Recover Gradually

At 12:20, database latency improves. The system does not instantly reopen every gate. It first confirms that active-query latency, queue age, and error rates are returning to safe ranges. It drains still-useful work, discards expired jobs, increases admission in steps, and restores optional features after the core path is stable.

recovery order:
  stabilize protected dependency
  drain useful high-priority work
  raise admission gradually
  restore optional work last

Reopening all gates at once can recreate the very surge that caused the incident. Recovery is a controlled transition, not a binary switch.

Controls Apply Pressure At Different Boundaries

Admission control answers whether a new request may enter. It protects the whole path from promises the system cannot keep.

Rate limits answer how quickly one user, tenant, or upstream service may submit work. They prevent a noisy producer from consuming shared capacity.

Concurrency limits answer how much active work may reach a dependency. They protect the dependency's connection pool, CPU, memory, and tail latency.

Bounded queues absorb short variation but force a policy once waiting work is too large or too old.

Load shedding and degraded modes decide which features can return a smaller answer or no answer so critical paths survive.

Retry budgets and deadlines stop recovery behavior from becoming a new source of load.

None of these controls is sufficient alone. A rate limit cannot prevent a database from slowing. A queue cannot make a persistent rate mismatch disappear. A concurrency cap can protect the database while still leaving callers to wait forever if no deadline exists. The controls form a layered response to the same pressure.

Failure Modes And Operational Signals

The obvious failure is accepting every request until the system stops serving anything well. It often looks friendly at the API boundary and hostile to users later, after a long wait and an ambiguous timeout.

Another failure is protecting the wrong workload. If recommendation jobs retain full concurrency while payment confirmation waits, the system has made an implicit product decision without naming it. Priority classes should be explicit, tested, and visible in the admission rules.

A third failure is treating queue depth alone as health. A shallow queue can contain jobs that are already too old, and a deep queue can be safe for a short burst. Queue age, deadline expiry, completion rate, and dependency latency show whether the buffer is absorbing variation or accumulating failure.

Useful signals include:

Deadlines must travel with the work. If the caller has only two seconds left, handing a job to a queue that will wait for five minutes does not preserve the caller's promise. The worker needs to know the remaining budget and decline a database call that cannot finish before it expires. Likewise, a downstream call should receive a shorter deadline than its caller, leaving time for the caller to record a useful outcome or return a controlled pending response. This prevents a request from surviving independently in every layer after it has already become useless to the user.

For state-changing work, the expiration policy must also distinguish “safe to drop” from “must reconcile.” An expired dashboard refresh can be discarded. An expired payment authorization may need a compensating void or a status check before it disappears from the queue. Backpressure does not remove responsibility for side effects; it makes the system decide which delayed work still has an obligation attached to it.

The trade-off is completeness versus survival. Under normal load, serving every request at full fidelity may be correct. Under sustained overload, a smaller, explicit promise—reject, defer, cache, or degrade—can protect the state transitions that must remain trustworthy.

Design Check

Choose one path: checkout, search, login, video upload, report generation, notification send, or background import. Without looking back, write:

protected dependency:
normal arrival and completion rates:
maximum useful queue age and capacity:
concurrency limit:
admission rule and user response:
work to preserve:
work to shed or degrade:
retry deadline, backoff, and budget:
idempotency requirement:
recovery ramp and metric that permits it:

Then increase arrivals above completions for ten minutes. If the design has no point where it stops accepting work that cannot meet its promise, it is postponing overload rather than controlling it.

Resources

Key Takeaways

PREVIOUS Observability and Debugging Distributed Systems NEXT Schemas, Contracts, and Versioned Messages