Day 075: Rate Limiting and Backpressure for Workers
The hard part is not getting workers to go faster. It is deciding how fast the rest of the system can safely afford to let them go.
Today's "Aha!" Moment
Once a team has queues and worker pools, the next temptation is obvious: if backlog is growing, add workers or let them pull faster. Sometimes that helps. Very often it just moves the bottleneck from the queue to something more fragile, such as a database, a storage layer, or a third-party API with quotas.
Keep one example throughout the lesson. The learning platform accepts video uploads and pushes jobs into a transcode pipeline. Workers read the queue, fetch source files from object storage, call an external encoding provider, and write output metadata back to the database. The queue can hold an hour of work, but the provider only accepts a limited request rate and the database can only tolerate so many concurrent job updates. In that world, "drain faster" is not automatically a success condition.
That is the aha. Worker throughput is not just a performance choice. It is a pressure-control choice. Rate limiting decides how quickly workers may hit a dependency. Concurrency limits decide how many jobs or requests may be in flight at once. Backpressure is the system's response when arriving work is outrunning safe processing capacity. Those are related ideas, but they solve different failure modes.
Once you see the difference, a lot of confusing behavior starts to make sense. A deep queue does not always mean "scale up." Sometimes it means "the bottleneck is elsewhere." Sometimes it means "slow producers, reserve capacity for priority work, or let backlog accumulate briefly instead of detonating a downstream service." A healthy asynchronous system often looks patient from the outside because it is protecting the whole pipeline, not trying to win a short race.
Why This Matters
The problem: Queues make it easy to buffer work, but they do not make downstream capacity infinite. Without throughput control, background workers can turn backlog into overload, retries into storms, and temporary slowness into cascading failure.
Before:
- Worker fleets pull as fast as possible because backlog feels urgent.
- One stressed dependency causes retries that further increase pressure.
- Teams observe queue depth but miss the real saturation point elsewhere.
After:
- Rate and concurrency are treated as explicit control levers, not accidental defaults.
- Backpressure is a normal protective response, not a sign of architectural defeat.
- The system can preserve high-priority work and stay stable while backlog drains at a sustainable pace.
Real-world impact: Fewer overload incidents, fewer retry storms, safer use of external providers, and a more predictable relationship between queue depth and actual system health.
Learning Objectives
By the end of this session, you will be able to:
- Distinguish rate limits, concurrency limits, and backpressure - Explain what each one controls and why they are not interchangeable.
- Reason about overload in worker systems - Identify when backlog should trigger pacing instead of blind acceleration.
- Explain why stability beats reckless drain speed - Connect throughput policy to dependency safety and recovery behavior.
Core Concepts Explained
Concept 1: Concurrency Limits and Rate Limits Protect Different Things
Suppose the platform can safely run only 10 transcoding requests in parallel because each one is heavy, and the encoding provider also allows only 30 submissions per minute. Those are two different constraints.
Concurrency limits bound how many operations are active at the same time. They protect scarce in-flight capacity such as CPU, memory, connection pools, and expensive external sessions. Rate limits bound how quickly new operations may begin over time. They protect against bursts and provider quotas even if each individual request is short.
queue -> worker pool -> concurrency gate -> rate gate -> encoding provider
If you only cap concurrency, workers may still start too many requests per minute once jobs become shorter. If you only cap rate, you may still have too many long-running jobs in flight at once. Good worker systems often need both because one answers "how many active now?" and the other answers "how quickly can we begin more?"
def can_start_job(in_flight, starts_last_minute):
if in_flight >= 10:
return False
if starts_last_minute >= 30:
return False
return True
The code is intentionally simple. The important teaching point is that the two checks represent different safety budgets, not duplicate knobs.
The trade-off is throughput versus protection. Tight limits reduce peak drain speed, but they prevent a worker fleet from turning one hot dependency into the real outage.
Concept 2: Backpressure Is What the System Does When Safe Capacity Has Been Reached
Backpressure begins when the system stops pretending it can consume arbitrary work immediately. It is the feedback path that says, "We are at or near a boundary, so behavior must change."
In the transcode example, backlog may rise because uploads spike during a product launch. If workers continue pulling at maximum pace, the provider starts returning 429 errors, retries pile up, and the database sees a wave of status writes. The queue is not the failure. The failure is converting buffered demand into unsafe active demand.
Backpressure can take several forms:
- workers stop claiming new jobs for a while
- producers slow down or defer low-priority job creation
- the system sheds or postpones non-essential work
- capacity is reserved for urgent classes of jobs
incoming uploads
|
v
queue depth rises
|
+--> if safe: keep draining
+--> if saturated: slow claims / slow producers / protect priorities
The key idea is that backpressure is not just "the queue is long." It is a control response to finite capacity. This is why healthy systems sometimes choose to let backlog grow temporarily. A larger queue is often preferable to a melted dependency and a retry storm that makes recovery slower.
The trade-off is responsiveness versus system survival. Backpressure means some work waits longer, but it keeps waiting from becoming collapse.
Concept 3: The Right Goal Is Sustainable Throughput, Not the Fastest Possible Empty Queue
Teams often optimize for one visible number: queue depth. That is understandable because queues are easy to graph. But emptying the queue quickly is not the same as operating the system well.
Imagine two outcomes:
- Option A: the queue drains in 12 minutes, but the provider rate-limits hard, retries spike, and several jobs fail repeatedly.
- Option B: the queue drains in 45 minutes, but workers stay inside quota, failure rate remains low, and high-priority jobs still move steadily.
In most production systems, Option B is the better result. Sustainable throughput is about the rate the whole pipeline can maintain without destabilizing itself.
That is why worker control loops usually watch several signals together:
- queue depth and queue age
- in-flight job count
- dependency latency and error rate
- retry volume
- priority class starvation
When those signals move together, the response should be deliberate. Sometimes the right action is adding workers. Sometimes it is lowering concurrency, reducing retries, or isolating workloads so one noisy class does not consume the entire budget.
The trade-off is short-term speed versus long-term reliability. Sustainable systems often look slower at the burst peak precisely because they recover better and finish more work correctly over the full incident window.
Troubleshooting
Issue: Treating every growing queue as proof that workers should scale up immediately.
Why it happens / is confusing: Queue depth is visible, urgent, and emotionally persuasive.
Clarification / Fix: Check where the real bottleneck sits. If dependency latency, 429s, lock contention, or retries are already rising, more consumers may worsen the problem.
Issue: Using one knob for every overload problem.
Why it happens / is confusing: Rate, concurrency, and backpressure all feel like "slowing things down."
Clarification / Fix: Use concurrency limits for in-flight pressure, rate limits for start frequency, and backpressure for the broader system response once safe capacity is reached.
Issue: Thinking backlog is always worse than slowing work down.
Why it happens / is confusing: Waiting feels like failure, while aggressive drain feels productive.
Clarification / Fix: Backlog is often the buffer that protects the rest of the system. A growing queue is sometimes the system doing its job.
Advanced Connections
Connection 1: Backpressure ↔ Retry Policy
The parallel: Retry behavior changes the effective load on the system, so throughput control and retry policy must be designed together.
Real-world case: A provider outage often requires both lower dispatch rate and less aggressive retries, otherwise the recovery window gets flooded by repeat attempts.
Connection 2: Throughput Control ↔ Worker Pool Architecture
The parallel: Pool sizing, queue partitioning, and throughput limits are all ways of allocating scarce execution budget across workloads.
Real-world case: Teams often isolate high-priority email jobs from heavy media jobs so backpressure on one path does not freeze the other.
Resources
Optional Deepening Resources
- These resources are optional and are not required for the core 30-minute path.
- [ARTICLE] Exponential Backoff and Jitter
- Link: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
- Focus: Connect retries to load shaping instead of accidental retry storms.
- [DOC] Stripe Rate Limits
- Link: https://docs.stripe.com/rate-limits
- Focus: See how a real external dependency exposes finite capacity and why clients must adapt.
- [DOC] RabbitMQ Consumer Prefetch
- Link: https://www.rabbitmq.com/docs/consumer-prefetch
- Focus: Connect queue consumption to bounded in-flight work rather than unlimited draining.
- [ARTICLE] Apply Back Pressure When Overloaded
- Link: https://mechanical-sympathy.blogspot.com/2012/05/apply-back-pressure-when-overloaded.html
- Focus: Deepen the systems intuition for pacing and overload protection.
Key Insights
- Rate limits and concurrency limits answer different questions - One controls start frequency over time, the other controls how much work is active at once.
- Backpressure is a protective behavior, not a defect - It is how the system reacts when buffered demand exceeds safe processing capacity.
- Stable throughput matters more than heroic drain speed - The goal is to finish work safely and recover cleanly, not just to flatten the queue graph at any cost.
Knowledge Check (Test Questions)
-
What is the main difference between a concurrency limit and a rate limit?
- A) A concurrency limit bounds work currently in flight, while a rate limit bounds how quickly new work may start over time.
- B) A concurrency limit is only for CPUs, while a rate limit is only for APIs.
- C) They are different names for the same mechanism.
-
When is backpressure doing the right thing?
- A) When the system slows or reshapes work because safe downstream capacity has been reached.
- B) When every worker is permanently idle.
- C) When the queue is deleted to avoid backlog.
-
Why is a temporarily growing queue sometimes healthier than draining at maximum speed?
- A) Because buffering demand can protect dependencies from overload and avoid turning backlog into retries and cascading failure.
- B) Because queue depth has no relationship to system behavior.
- C) Because workers should never consume jobs quickly.
Answers
1. A: Concurrency limits cap active work now, while rate limits cap how rapidly new work begins. Many real systems need both.
2. A: Backpressure is the system's controlled response to finite capacity. It protects the pipeline by changing behavior when pressure becomes unsafe.
3. A: A queue is a buffer. Letting it absorb demand for a while is often safer than dumping that demand instantly onto a dependency that cannot cope.