LESSON
Day 436: Replication Flow Control and Backpressure
The core idea: A replicated leader must pace writes to the rate at which its critical followers can durably absorb the log; otherwise backlog turns one slow replica into cluster-wide latency, unstable failover, or exhausted storage.
Today's "Aha!" Moment
In 036.md, Harbor Point learned that safe lease-based reads depend on timely quorum acknowledgments and bounded clock uncertainty. That proof looks elegant on paper until the market opens and replication stops being a neat sequence diagram. At 09:30, Madrid leader md-db-2 is accepting reservation writes for shard 184, local follower md-db-4 is keeping up, and recovering replica ny-db-3 is still replaying a backlog across the Atlantic. The tempting response is "just push harder" so New York catches up while the write burst is still in flight.
That instinct is how leaders melt themselves. If md-db-2 keeps appending new entries, buffering megabytes of unsent WAL for ny-db-3, and sharing the same replication workers for heartbeats and catch-up traffic, the lagging follower stops being a local problem. Sender queues expand, WAL retention grows because old log segments cannot be discarded yet, heartbeats to healthy followers compete with bulk traffic, and the lease-read margin from the previous lesson starts collapsing. What looked like a performance issue becomes a correctness-adjacent availability issue.
Replication backpressure is the mechanism that forces the leader to admit an uncomfortable truth early: the cluster can only move as fast as the path that defines its safety contract. Sometimes that means capping a lagging follower's in-flight bytes so it does not consume unbounded memory. Sometimes it means falling back from lease reads because quorum acknowledgments arrive too close to lease expiry. Sometimes it means slowing or rejecting new writes before a quorum-critical replica falls so far behind that the shard is one fault away from losing safe progress. The misconception to remove is that lag hurts only the slow replica. In production, unmanaged lag changes the behavior of the whole shard.
Why This Matters
Harbor Point's reservation service has a business problem, not a benchmarking problem. During the opening auction, the API can see a 4x jump in writes for some issuers. If replication is allowed to build unlimited backlog during that burst, the system may look healthy for thirty seconds and then fail all at once: memory pressure spikes on the leader, retained WAL fills the disk budget, md-db-4 misses heartbeat deadlines because replication workers are saturated, and a failover that was supposed to take seconds becomes risky because only one replica is truly current.
With deliberate flow control, the system makes a different bargain. It distinguishes a follower that is merely slow from one that is part of the current durability path. It caps background catch-up before queues become invisible debt. It slows writers when the majority path, not just an optional replica, is losing ground. That is a harder message to send to application teams because it turns latent infrastructure trouble into visible latency or 429-style pressure sooner. It is also the honest one. Backpressure is how the database keeps "temporary slowdown" from becoming "replication state we can no longer safely recover from."
Learning Objectives
By the end of this session, you will be able to:
- Explain why replication lag must be modeled as a pipeline - Distinguish append lag, durable lag, apply lag, and retained-log pressure instead of treating "replica is behind" as one number.
- Trace the mechanisms leaders use to apply backpressure - Follow how per-follower windows, heartbeat prioritization, snapshot cutovers, and write admission control keep backlog bounded.
- Evaluate when to throttle writes versus isolate a lagging replica - Decide which forms of lag threaten quorum safety, read freshness, or only recovery speed.
Core Concepts Explained
Concept 1: Replication lag is a pipeline, so backpressure has to target the blocked stage
Harbor Point's shard 184 now has a clear topology after the previous two lessons: md-db-2 is the active leader in term 42, md-db-4 is the local quorum follower, and ny-db-3 is catching up after the New York failover. At 09:30:00, the shard is receiving about 18,000 reservation writes per second. Each write becomes a log entry on the leader, but that entry still moves through several distinct stages before the cluster can forget about it.
For one follower, the leader cares about at least four positions:
leader last index -> newest entry accepted locally
sent index -> newest entry put on the wire to that follower
durable index -> newest entry the follower has fsynced
applied index -> newest entry visible in the follower's state machine
Those positions answer different operational questions. If sent index trails far behind leader last index, the bottleneck is usually sender scheduling, TCP congestion, or an in-flight window that is already full. If durable index trails far behind sent index, the follower is receiving data but cannot flush it fast enough. If applied index lags while durable index is close, commit safety may still be fine even though follower reads are stale. A single "replication lag = 2.4 seconds" metric hides all of that and makes the wrong fix look plausible.
At Harbor Point during the burst, the state might look like this:
leader last index: 813620
md-db-4 durable: 813612 (healthy quorum follower)
ny-db-3 durable: 809940 (recovering over WAN)
ny-db-3 applied: 809100 (replay still catching up)
The production danger is not only that New York is behind. It is that the leader must retain every log segment after 809940 until ny-db-3 catches up or is re-seeded by snapshot. If the sender also keeps buffering new entries for New York without a cap, backlog grows in two places at once: leader memory and retained WAL on disk. Flow control begins with admitting that "slow replica" is not one condition. The system needs to know whether it is limited by network, follower disk, replay throughput, or retention window so it can apply pressure at the right boundary.
The trade-off is straightforward. Richer per-stage accounting costs more instrumentation and more state in the replication layer, but without it teams tune blind. In Harbor Point's case, a dashboard that separates inflight_bytes, durable_lag_entries, apply_lag_entries, and retained_wal_bytes is not observability vanity. It is the map that tells operators whether to pace writers, switch a follower to snapshot catch-up, or simply stop serving local follower reads.
Concept 2: Good flow control is targeted: cap optional laggers, protect quorum traffic, and throttle only when safety paths degrade
The naive leader policy is "send every follower everything as fast as possible." Real systems use more selective controls because not every follower matters equally at every moment. Harbor Point does not want ny-db-3 to compete with md-db-4 for the same replication budget while New York is recovering. The local quorum follower is what keeps commits fast and lease renewals credible. The remote recovering follower matters for fault tolerance and failback, but it should not be allowed to crowd out the majority path.
One practical design is a per-follower sliding window plus a separate priority lane for heartbeats and quorum-critical append traffic:
client writes
-> leader appends locally
-> replicate to quorum followers on priority lane
-> replicate to lagging followers within per-follower byte window
-> if follower backlog exceeds catch-up threshold, stop incremental stream and snapshot
In Harbor Point's policy, md-db-4 may have up to 4 MB of in-flight log data because it is on the fast local path and usually acknowledges within a few milliseconds. ny-db-3 may be capped at 512 KB of in-flight data while recovering over the WAN. That smaller window is not punitive. It prevents the leader from accumulating gigabytes of unsent or unacknowledged data for a follower that is already known to be slow. If ny-db-3 falls beyond the retained log window, the leader stops pretending incremental catch-up is efficient and sends a snapshot instead.
The more important distinction is when backpressure crosses from "shape background traffic" to "slow foreground writes." If only ny-db-3 is behind, Harbor Point may continue committing writes on md-db-2 plus md-db-4 while capping New York's catch-up stream. But if md-db-4 starts acknowledging too slowly, the safety path changes. Now the shard is not merely losing recovery slack; it is losing the follower that makes quorum commits and the lease-read fast path practical. At that point the leader should reduce write admission, shrink batch sizes, or even reject new work rather than let majority lag drift indefinitely upward.
A simplified controller looks like this:
def admission_decision(quorum_durable_lag_ms, optional_backlog_mb, retained_wal_gb):
if quorum_durable_lag_ms > 40:
return "throttle foreground writes"
if retained_wal_gb > 20:
return "snapshot or detach lagging follower"
if optional_backlog_mb > 512:
return "cap catch-up stream"
return "normal"
The exact thresholds are workload-specific, but the mechanism matters more than the numbers. Backpressure is not one giant brake pedal. It is a policy layer that answers three different questions: which follower gets replication bandwidth first, how much backlog is allowed to build for each one, and at what point the leader must slow clients because the majority path itself is falling behind.
Concept 3: Backpressure is a durability and failover policy, not just a throughput optimization
The easiest mistake to make in Harbor Point's incident review would be to judge the controller only by peak write throughput. A looser controller often looks better for the first few minutes because it lets the API keep accepting work while hidden buffers expand. The bill arrives later. Retained WAL grows because lagging followers pin old segments. Snapshot size grows because the follower is now too far behind for log shipping to be efficient. Lease-based reads from 036.md fall back more often because quorum acknowledgments arrive with less margin. If md-db-4 degrades next, the shard is suddenly operating with almost no safe runway.
That is why production teams usually define backpressure around "how much safety debt are we willing to carry" rather than "how many writes per second can we squeeze through right now." Harbor Point might adopt rules such as:
- keep quorum durable lag below
40 ms - keep retained WAL due to one lagging follower below
20 GB - switch to snapshot catch-up if a follower trails more than
250,000entries - shed or slow new writes before the quorum follower misses two renewal intervals in a row
Those rules force the system to surface pain earlier, but in a controlled place. Application latency goes up before disks fill. A lagging replica is snapshot-reseeded before it holds the whole shard hostage through retained-log pressure. Lease reads fall back before the time-based proof becomes unsafe. In other words, backpressure chooses a graceful failure mode instead of an accidental one.
This is also where the next lesson begins. Flow control can keep replicas within a bounded distance while the live write stream is still the main source of truth. Once a replica has been offline long enough, or state has diverged beyond what the retained log can cheaply repair, the problem stops being "pace the stream" and becomes "reconcile histories." That is the bridge to 038.md, where read repair and anti-entropy take over after ordinary replication backpressure is no longer enough.
Troubleshooting
-
Issue: One recovering follower causes leader memory growth and exploding WAL retention even though quorum commits still succeed.
- Why it happens: The leader is letting a lagging replica accumulate unbounded in-flight data or pin unlimited log segments while trying to catch up incrementally.
- Clarification / Fix: Add per-follower byte windows, monitor retained-log pressure explicitly, and switch the follower to snapshot catch-up or temporary detachment once the backlog crosses a bounded threshold.
-
Issue: Write latency and lease-read fallback spike together during peak load.
- Why it happens: The quorum follower's acknowledgments are being delayed by the same queues, workers, or links used for bulk catch-up traffic, so the system is losing both commit margin and read-lease margin at once.
- Clarification / Fix: Prioritize heartbeats and quorum replication traffic over optional catch-up streams, and treat quorum durable lag as the signal for foreground throttling.
-
Issue: Increasing batch size or send window makes replication graphs look smoother, but incidents get worse when a follower slows down.
- Why it happens: Larger buffers hide the bottleneck temporarily while increasing the amount of memory, retained log, and recovery work that accumulates behind it.
- Clarification / Fix: Tune window sizes from bandwidth-delay product and follower flush behavior, not from "largest burst we can buffer." Bigger queues are not the same as more capacity.
Advanced Connections
Connection 1: Lease-based reads in 036.md depend on the same replication timing that backpressure protects
The previous lesson treated safe reads as a lease and clock-uncertainty problem. This lesson adds the missing operational detail: those leases remain useful only if quorum followers keep acknowledging heartbeats and log progress on time. A backpressure policy that protects quorum traffic is therefore preserving read safety as well as write throughput.
Connection 2: 038.md begins where backpressure stops being enough
Flow control is the live-stream discipline that keeps replicas from drifting too far apart during normal operation and transient trouble. Read repair and anti-entropy are the repair mechanisms for the cases where drift already happened: missed writes, stale replicas rejoining after a long outage, or histories that now need explicit reconciliation instead of more pacing.
Resources
Optional Deepening Resources
- [PAPER] In Search of an Understandable Consensus Algorithm (Extended Version)
- Focus: Read the sections on
nextIndex,matchIndex, and snapshot installation to see how a leader bounds catch-up work for lagging followers.
- Focus: Read the sections on
- [DOC] PostgreSQL Documentation: Replication Configuration
- Focus: Look at replication slots, WAL retention, sender limits, and how lagging standbys can push resource pressure back onto the primary.
- [DOC] CockroachDB Architecture: Replication Layer
- Focus: Compare Harbor Point's quorum-first policy with a production Raft-based system's handling of follower lag, snapshots, and replica recovery.
- [DOC] etcd Tuning
- Focus: Connect heartbeat and election timing with the backpressure lesson's point that overloaded replication paths eventually shape failover and read behavior too.
Key Insights
- Replication lag is a staged pipeline, not a single scalar - You only choose the right pressure mechanism after you know whether the bottleneck is sender backlog, follower fsync, apply lag, or retained-log pressure.
- Backpressure should protect the majority path before it protects every replica equally - A recovering optional follower may be bandwidth-limited or snapshot-reseeded; a quorum-critical follower should trigger slower writes much sooner.
- The real product is bounded safety debt - Slowing clients early is often cheaper than carrying hidden backlog until leases collapse, disks fill, or failover is no longer credible.