Replication Flow Control and Backpressure

LESSON

Consistency and Replication

018 30 min advanced

Replication Flow Control and Backpressure

The core idea: A replicated leader must pace writes to the rate at which its critical followers can durably absorb the log; otherwise backlog turns one slow replica into cluster-wide latency, unstable failover, or exhausted storage.

Core Insight

Harbor Point's Madrid leader, md-db-2, is carrying shard 184 through market open. At 09:30, reservation writes spike, local follower md-db-4 is still keeping up, and recovering replica ny-db-3 is replaying a backlog across the Atlantic. The tempting response is simple: keep accepting writes and push harder so New York catches up while the burst is still in flight.

That instinct is how leaders overload themselves. If md-db-2 keeps appending new entries, buffering unsent log data for ny-db-3, and sharing the same replication workers for heartbeats and catch-up traffic, the lagging follower stops being a local problem. Sender queues expand. Retained WAL grows because old log segments cannot be discarded. Heartbeats to healthy followers compete with bulk traffic. Lease-read margin starts collapsing because quorum acknowledgments arrive too close to expiry.

Replication backpressure is the mechanism that forces the leader to admit the uncomfortable truth early: the cluster can only move safely at the rate its critical replication path can absorb. Sometimes that means capping a recovering follower's in-flight bytes. Sometimes it means snapshotting a lagger instead of streaming an endless backlog. Sometimes it means slowing client writes before a quorum-critical follower falls so far behind that failover and safe reads become fragile.

The misconception to remove is that lag hurts only the slow replica. In production, unmanaged lag changes the behavior of the whole shard. It turns a catch-up problem into write latency, read fallback, WAL pressure, and weaker recovery options.

Lag Has Stages

Harbor Point's shard 184 has three important roles in this incident. md-db-2 is the active leader in term 42, md-db-4 is the local quorum follower, and ny-db-3 is a recovering remote replica. At 09:30:00, the shard receives about 18,000 reservation writes per second. Each write becomes a log entry on the leader, but that entry still moves through several distinct stages before the cluster can forget about it.

For each follower, the leader cares about at least four positions:

leader last index   -> newest entry accepted locally
sent index          -> newest entry put on the wire to that follower
durable index       -> newest entry the follower has fsynced
applied index       -> newest entry visible in the follower's state machine

Those positions answer different operational questions. If sent index trails far behind leader last index, the bottleneck is usually sender scheduling, TCP congestion, or an in-flight window that is already full. If durable index trails far behind sent index, the follower is receiving data but cannot flush it fast enough. If applied index lags while durable index is close, commit safety may still be fine even though follower reads are stale.

A single "replication lag = 2.4 seconds" metric hides those differences and makes the wrong fix look plausible.

At Harbor Point during the burst, the state might look like this:

leader last index:   813620
md-db-4 durable:     813612   (healthy quorum follower)
ny-db-3 durable:     809940   (recovering over WAN)
ny-db-3 applied:     809100   (replay still catching up)

The production danger is not only that New York is behind. It is that the leader must retain every log segment after 809940 until ny-db-3 catches up or is re-seeded by snapshot. If the sender also keeps buffering new entries for New York without a cap, backlog grows in two places at once: leader memory and retained WAL on disk. Flow control begins with admitting that "slow replica" is not one condition. The system needs to know whether it is limited by network, follower disk, replay throughput, or retention window so it can apply pressure at the right boundary.

The trade-off is straightforward: richer per-stage accounting costs more instrumentation and more state in the replication layer, but without it teams tune blind. In Harbor Point's case, a dashboard that separates inflight_bytes, durable_lag_entries, apply_lag_entries, and retained_wal_bytes is not observability decoration. It is the map that tells operators whether to pace writers, switch a follower to snapshot catch-up, or stop serving local follower reads.

Pacing The Replication Stream

The naive leader policy is "send every follower everything as fast as possible." Real systems use more selective controls because not every follower matters equally at every moment. Harbor Point does not want ny-db-3 to compete with md-db-4 for the same replication budget while New York is recovering. The local quorum follower is what keeps commits fast and lease renewals credible. The remote recovering follower matters for fault tolerance and failback, but it should not be allowed to crowd out the majority path.

One practical design is a per-follower sliding window plus a separate priority lane for heartbeats and quorum-critical append traffic:

client writes
   -> leader appends locally
   -> replicate to quorum followers on priority lane
   -> replicate to lagging followers within per-follower byte window
   -> if follower backlog exceeds catch-up threshold, stop incremental stream and snapshot

In Harbor Point's policy, md-db-4 may have up to 4 MB of in-flight log data because it is on the fast local path and usually acknowledges within a few milliseconds. ny-db-3 may be capped at 512 KB of in-flight data while recovering over the WAN. That smaller window is not punitive. It prevents the leader from accumulating gigabytes of unsent or unacknowledged data for a follower that is already known to be slow. If ny-db-3 falls beyond the retained log window, the leader stops pretending incremental catch-up is efficient and sends a snapshot instead.

The more important distinction is when backpressure crosses from "shape background traffic" to "slow foreground writes." If only ny-db-3 is behind, Harbor Point may continue committing writes on md-db-2 plus md-db-4 while capping New York's catch-up stream. But if md-db-4 starts acknowledging too slowly, the safety path changes. Now the shard is not merely losing recovery slack; it is losing the follower that makes quorum commits and the lease-read fast path practical. At that point the leader should reduce write admission, shrink batch sizes, or even reject new work rather than let majority lag drift indefinitely upward.

A simplified controller looks like this:

def admission_decision(quorum_durable_lag_ms, optional_backlog_mb, retained_wal_gb):
    if quorum_durable_lag_ms > 40:
        return "throttle foreground writes"
    if retained_wal_gb > 20:
        return "snapshot or detach lagging follower"
    if optional_backlog_mb > 512:
        return "cap catch-up stream"
    return "normal"

The exact thresholds are workload-specific, but the mechanism matters more than the numbers. Backpressure is not one giant brake pedal. It is a policy layer that answers three different questions: which follower gets replication bandwidth first, how much backlog is allowed to build for each one, and at what point the leader must slow clients because the majority path itself is falling behind.

Backpressure As Safety Policy

The easiest mistake in Harbor Point's incident review would be to judge the controller only by peak write throughput. A looser controller often looks better for the first few minutes because it lets the API keep accepting work while hidden buffers expand. The bill arrives later. Retained WAL grows because lagging followers pin old segments. Snapshot size grows because the follower is now too far behind for incremental log shipping to be efficient. Lease-based reads from Clocks, Leases, and Safe Reads fall back more often because quorum acknowledgments arrive with less margin. If md-db-4 degrades next, the shard is suddenly operating with almost no safe runway.

That is why production teams usually define backpressure around "how much safety debt are we willing to carry" rather than "how many writes per second can we squeeze through right now." Harbor Point might adopt rules such as:

Those rules force the system to surface pain earlier, but in a controlled place. Application latency goes up before disks fill. A lagging replica is snapshot-reseeded before it holds the whole shard hostage through retained-log pressure. Lease reads fall back before the time-based proof becomes unsafe. In other words, backpressure chooses a graceful failure mode instead of an accidental one.

Flow control can keep replicas within a bounded distance while the live write stream is still the main source of truth. Once a replica has been offline long enough, or state has diverged beyond what the retained log can cheaply repair, the problem stops being "pace the stream" and becomes "recover to a known state." That is why snapshot and recovery semantics matter: ordinary replication control cannot compensate for a missing restore path.

Operational Failure Modes

One recovering follower causes leader memory growth and exploding WAL retention even though quorum commits still succeed. The leader is letting a lagging replica accumulate unbounded in-flight data or pin unlimited log segments while trying to catch up incrementally. Add per-follower byte windows, monitor retained-log pressure directly, and switch the follower to snapshot catch-up or temporary detachment once the backlog crosses a bounded threshold.

Write latency and lease-read fallback spike together during peak load. The quorum follower's acknowledgments are being delayed by the same queues, workers, or links used for bulk catch-up traffic, so the system is losing both commit margin and read-lease margin at once. Prioritize heartbeats and quorum replication traffic over optional catch-up streams, and treat quorum durable lag as the signal for foreground throttling.

Increasing batch size or send window makes replication graphs look smoother, but incidents get worse when a follower slows down. Larger buffers hide the bottleneck temporarily while increasing the amount of memory, retained log, and recovery work that accumulates behind it. Tune window sizes from bandwidth-delay product and follower flush behavior, not from "largest burst we can buffer." Bigger queues are not the same as more capacity.

Connections

Resources

Key Takeaways

  1. Replication lag is a staged pipeline, not a single scalar; sender backlog, durable lag, apply lag, and retained-log pressure need different responses.
  2. Backpressure should protect the majority path before it protects every replica equally: optional laggers can be capped or snapshotted, while quorum lag should slow foreground writes sooner.
  3. The real product is bounded safety debt. Slowing clients early is often cheaper than carrying hidden backlog until leases collapse, disks fill, or failover is no longer credible.
PREVIOUS Membership Changes and Replica Set Evolution NEXT Backup, Snapshots, and Recovery Semantics