LESSON
Day 493: Backpressure and Flow Control
The core idea: Backpressure makes overload explicit. A slower stage advertises limited capacity, upstream stages reduce in-flight work, and the system trades raw peak throughput for bounded memory, stable recovery, and predictable latency.
Today's "Aha!" Moment
In 061.md, Harbor Point made manifest-projector replay-safe: if Kafka redelivers booking-confirmed after a crash, the projector can collapse the duplicate into one row in passenger_manifest. That solves the semantic problem. It does not solve the load problem. On the Monday when online check-in opens for the July 14 sailing, Harbor Point sees a burst of 3,000 booking and check-in events per second, while the projector's PostgreSQL sink can only commit about 600 upserts per second because an index rebuild is still running.
If the consumer keeps polling the broker at full speed, it turns the rate mismatch into hidden state. Records pile up in the broker, then in the consumer fetch buffer, then in the in-memory worker queue waiting for database connections. The service looks "busy" right up until heap usage spikes, GC pauses stretch, offset commits stall, and a pod restart forces the same backlog to be deserialized again. The non-obvious point is that overload is not just a sizing mistake; it is a control problem. A production system needs a way for the slowest useful boundary to say "stop sending me more than this."
That is the difference between raw buffering and backpressure. Buffering absorbs short bursts. Backpressure propagates the fact that the burst is no longer short. Harbor Point wants the projector to slow its own polling, keep only a bounded number of manifest updates in flight, and let lag accumulate in a durable place it knows how to recover from. The trade-off is deliberate: the team accepts visible lag and sometimes throttled producers in exchange for avoiding memory blowups, retry storms, and multi-hour replay loops.
Why This Matters
Backpressure and flow control determine where excess work lives when the system cannot keep up. If Harbor Point chooses poorly, excess work lands in the least reliable place: process memory, thread pools, connection pools, or ad hoc retry loops. That is where incidents become confusing. The broker lag might look acceptable while the actual failure is a saturated database, a consumer with 80,000 deserialized records waiting in RAM, or an API tier creating even more downstream work because it cannot see the bottleneck.
A good control scheme makes the failure mode legible. Small bursts are absorbed cheaply. Sustained overload turns into bounded queue growth, reduced fetch sizes, slower acknowledgment, or explicit 429 Too Many Requests responses at the edge. Operators can then ask a real question: is the system temporarily behind, or is the sink permanently underprovisioned? Without that signal, teams tend to "fix" incidents by increasing queue sizes, worker counts, or retry rates, which often amplifies the original bottleneck.
The production consequence is straightforward. Systems with working backpressure degrade into lag and admission control. Systems without it degrade into memory exhaustion, connection thrash, and replay storms. Harbor Point cares because embarkation tools, finance jobs, and guest notifications all share the same event backbone. If one slow consumer can silently pull unbounded work into memory, a localized slowdown becomes a broader reliability incident.
Core Walkthrough
Part 1: Grounded Situation
Keep one Harbor Point pipeline in view:
booking-api -> outbox-relay -> booking-events topic
-> manifest-projector -> passenger_manifest in PostgreSQL
-> embarkation dashboard
During a normal hour, the manifest-projector consumes about 500 events per second and the database keeps up. During the online check-in release window, arrivals jump to 3,000 events per second for several minutes. At the same time, a migration adds write amplification to passenger_manifest, so each upsert now spends longer waiting on disk and index maintenance.
The mismatch is easy to quantify. If Harbor Point accepts 3,000 events per second and completes 600, the deficit is 2,400 events per second. At roughly 5 KB of deserialized state per event across objects, retries, and queue metadata, that is about 12 MB of extra heap pressure per second. Five minutes of "let it buffer" is enough to push gigabytes of extra working set into a consumer that was sized for a few hundred megabytes.
None of this means Harbor Point should panic and stop consuming instantly. It means the system needs a deliberate place for the backlog to live and a rule for how fast work is allowed to move downstream. In this lesson, the durable backlog is the Kafka topic, not the consumer heap. That choice only works if the consumer can limit how much work it pulls and if slow sink commits feed back into the fetch rate.
Part 2: Mechanism
It helps to separate two related ideas.
Flow control is a local rule between two parties: how much data may be in flight before the sender must wait for more credit. TCP receive windows, HTTP/2 stream windows, and "poll at most 500 records because only 500 permits remain" are all flow-control mechanisms.
Backpressure is the larger system effect: the slow sink causes credits to return more slowly, which reduces consumption upstream, which may eventually slow producers or force admission control. Flow control is the mechanism; backpressure is the propagated consequence.
Harbor Point can implement this in the projector with three decisions:
- Keep the in-memory work queue bounded.
- Tie broker fetch size to remaining downstream capacity.
- Commit offsets only after the corresponding manifest writes complete.
A simple credit-based loop looks like this:
MAX_IN_FLIGHT = 2000
permits = Semaphore(MAX_IN_FLIGHT)
while True:
available = permits.available_count()
if available == 0:
wait_for_any_write_to_finish()
continue
batch = consumer.poll(max_records=min(available, 500))
for record in batch:
permits.acquire()
submit(record)
def submit(record):
try:
upsert_manifest(record)
commit_offset(record)
finally:
permits.release()
The exact API differs by runtime, but the control idea is stable. The fetch loop is not allowed to ingest more records than the sink can plausibly finish soon. When PostgreSQL slows down, permits release more slowly. Smaller polls follow automatically. Lag grows in Kafka, where the backlog is durable and observable, instead of in a Python queue or JVM heap.
That local mechanism is still not enough if upstream keeps generating work faster than Harbor Point can tolerate operationally. At some point, lag age threatens business freshness. The dashboard might be allowed to run two minutes behind during a burst, but not forty minutes behind before boarding closes. That is where backpressure becomes a policy question rather than only a queueing trick. Harbor Point may need to pause low-priority consumers, rate-limit new booking exports, reject optional enrichment jobs, or send 429 Too Many Requests from non-critical APIs so the critical manifest path stays within its freshness budget.
The main design choice is where each boundary should absorb pressure:
- Broker or durable log for replayable backlog
- Bounded in-memory buffers for short smoothing only
- Edge admission control when freshness SLOs would otherwise be violated
- Load shedding for non-essential work that is cheaper to drop than delay
A healthy system usually uses all four. The mistake is letting any one layer pretend the others do not exist.
Part 3: Implications and Trade-offs
Backpressure always imposes a trade-off: the system gives up some apparent throughput to protect stability. Harbor Point will process fewer concurrent manifest updates once the database slows, and upstream clients may see higher latency or explicit throttling. That can feel like "wasted capacity" to a team that only watches request acceptance. In reality, it prevents the more expensive outcome where the service accepts work optimistically, crashes under memory pressure, and replays the same backlog even later.
Buffer size is the clearest example. A larger queue smooths short bursts and improves hardware utilization when slowdowns are brief. The same larger queue hides sustained overload for longer, increases tail latency, and lengthens recovery after a restart because more work sits between ingestion and durable completion. Harbor Point therefore wants buffer sizes based on burst tolerance, not on hope. "Can absorb 30 seconds at 3x load" is defensible. "Large enough that we have not seen OOM yet" is not.
This also changes how the team interprets metrics. High CPU is not automatically a problem if queue depth and oldest-event age stay within budget. Low CPU is not automatically healthy if consumers are stalled on connection pools and lag is climbing. For this pipeline, the critical set is: broker lag, oldest unprocessed event age, in-flight write count, queue depth, database commit latency, offset-commit latency, and any throttle or reject rate at the edges. Those metrics tell Harbor Point whether it is absorbing a burst, falling into unstable amplification, or deliberately protecting the system by slowing intake.
The payoff is that overload becomes an engineering choice instead of an accident. Harbor Point can decide that the embarkation dashboard may be 90 seconds stale but must never lose confirmed bookings or force a three-hour replay after a pod crash. Once that policy is explicit, flow-control parameters, queue bounds, and throttling rules become tools for meeting it rather than knobs turned blindly.
Failure Modes and Misconceptions
-
"A bigger queue makes the system more resilient." It only makes the first symptom arrive later. If the sink remains slower than the source, the extra queue mostly converts a throughput deficit into memory growth and longer recovery time.
-
"Backpressure is a stream-processing feature, not a general systems concern." The same mechanism appears in TCP windows, broker fetch sizes, worker-pool semaphores, database connection pools, and HTTP
429responses. Any place that controls in-flight work participates in flow control. -
"If consumers are idempotent, replay volume is harmless." Replay-safe code prevents duplicate effects, but it does not make duplicate or delayed work free. CPU, I/O, connection slots, and freshness budgets can still be exhausted by uncontrolled backlog.
-
"Autoscaling removes the need for backpressure." Scaling helps only when additional workers increase true sink capacity. If PostgreSQL, a downstream API, or a shared lock is the bottleneck, more consumers usually add contention and make tail latency worse.
-
"Lag is always bad, so the consumer should keep polling aggressively." Some lag is the correct outcome under bursty load. The real question is whether lag sits in a durable, observable layer with a bounded freshness impact, not whether it exists at all.
Connections
Connection 1: 061.md made replay safe; this lesson makes replay survivable
The previous lesson ensured that a redelivered booking-confirmed event does not create two manifest rows. This lesson deals with the next production pressure: when thousands of safe events arrive faster than Harbor Point can apply them, the system still needs a controlled way to fall behind.
Connection 2: 063.md uses the same control ideas in change-data-capture pipelines
A CDC connector that reads the source database faster than downstream consumers or sinks can absorb changes creates the same instability. The next lesson turns this load-regulation problem into an integration-architecture question.
Connection 3: ../networking-and-failure-models/003.md explains why retries can amplify overload
Timeouts and retries are reasonable local tools, but under overload they can multiply work. Backpressure is one of the main ways to keep retry behavior from turning a slow system into a collapsing one.
Resources
- [DOC] Reactive Streams
- Link: https://www.reactive-streams.org/
- Focus: Read the specification language around non-blocking backpressure to see how explicit demand signals bound publisher-consumer interaction.
- [DOC] Apache Flink: Monitoring Back Pressure
- Link: https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/monitoring/back_pressure/
- Focus: Study how an actual stream processor exposes blocked operators and why upstream slowdown is often the symptom of a downstream bottleneck.
- [DOC] Apache Kafka Documentation
- Link: https://kafka.apache.org/documentation/
- Focus: Look at consumer fetch and poll configuration to see how a pull-based log lets consumers control intake instead of accepting unbounded push delivery.
- [BOOK] Designing Data-Intensive Applications
- Link: https://dataintensive.net/
- Focus: Revisit the chapters on stream processing, batching, and failure handling with an eye toward where backlog is buffered and how slow consumers shape the system.
Key Takeaways
- Flow control limits local in-flight work; backpressure is the system-wide propagation of that limit when a downstream stage slows.
- A durable broker backlog is usually safer than an unbounded consumer heap, but only if fetch size and concurrency are tied to real sink capacity.
- Backpressure does not eliminate overload; it chooses how overload appears, trading some throughput and freshness for stability and recoverable behavior.
- Once duplicate effects are made safe, the next operational question is whether the system can regulate work without turning a temporary slowdown into a replay and memory incident.