Replication Primitives: Log Shipping and Apply

LESSON

Consistency and Replication

018 30 min advanced

Day 417: Replication Primitives: Log Shipping and Apply

The core idea: A replica becomes trustworthy when it replays the same durable log the primary would use for crash recovery, in the same order, while keeping separate track of what has been received, flushed, and applied.

Today's "Aha!" Moment

In ../26/16.md, Harbor Point chose a page-oriented MVCC database because its reservation workflow depended on ordered WAL, predictable B-tree access, and bounded recovery. The next step is obvious from an operations perspective: move trader dashboards and disaster-recovery failover onto a standby so the primary is not the only copy of the truth. The tempting mental model is "the primary sends changed rows to the replica." That sounds intuitive, but it is not how reliable database replication begins.

What the primary actually knows how to produce durably is its write-ahead log. When a reservation is approved, the engine does not only think "one row changed." It emits a sequence of WAL records that describe index updates, heap changes, visibility metadata, and finally the commit record that makes the transaction durable. Log shipping works because the replica can consume that exact ordered history and reconstruct the same state transition the primary would reconstruct during crash recovery. Replication is therefore an extension of recovery, not a separate magic copy mechanism.

That shift matters because "copied" is not one state. A WAL record may have left the primary, arrived on the standby, been written to disk on the standby, or already been replayed into visible pages. Those are different milestones with different operational meanings. Harbor Point can have almost no network lag and still show stale dashboards if apply is behind. It can also have WAL safely stored on the standby before those changes are query-visible there. This lesson is about those primitives: the log, the shipping path, the apply path, and the watermarks that describe where the replica really is.

Why This Matters

Harbor Point's production problem is concrete. During the 09:30 market open, the primary handles short approval transactions that update issuer_exposure, insert an open reservation, and write index entries for the trader dashboard. Product wants that dashboard to read from a standby. Operations also wants to promote the standby quickly if the primary host dies. Both requests sound simple until someone asks the question that matters: what exactly does "caught up" mean?

If the team answers that question loosely, incidents follow. A standby may have received the newest WAL bytes but not replayed them, so the dashboard misses reservations that definitely committed on the primary. A disconnected standby may fall so far behind that the primary has already discarded the WAL segments it needs. A failover runbook may assume that a replica with low transport lag is promotion-ready when the actual replay position is still behind the business events support is about to inspect.

Once the team thinks in terms of log shipping and apply, the system becomes easier to reason about. The WAL defines the only valid order of state transitions. Shipping tells you how much of that history has reached the standby. Apply tells you how much of that history has become part of the standby's visible state. The next lesson, 02.md, will ask when a client commit should wait for a replica acknowledgment. That question only makes sense after the mechanics of shipping and apply are clear.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain why log shipping starts from the recovery log instead of table diffs - Show how WAL preserves commit order and engine-specific state transitions.
  2. Distinguish receipt, durability, and replay on a replica - Use the right watermark when reasoning about stale reads, failover readiness, and lag.
  3. Analyze the operational trade-offs of log shipping - Identify where backlog, WAL retention, and apply conflicts turn a healthy replica into an unreliable one.

Core Concepts Explained

Concept 1: Replication is remote crash recovery driven by WAL order

Harbor Point's approval path already gave the primary a durable source of truth before replication existed. When a trader opens a reservation for issuer MUNI-77, the engine writes WAL records for the summary-row update in issuer_exposure, the new row in reservations, and the affected B-tree entries, then emits a commit record. That WAL is what lets the primary recover after a crash. Replication reuses the same mechanism rather than inventing a second description of the transaction.

That design solves a hard correctness problem. The database state is not just "rows." It includes page images, visibility information, index maintenance, and a very specific commit order. If Harbor Point tried to ship only logical row deltas at this layer, the standby would need a second mechanism to preserve every storage-engine invariant the primary already encoded in WAL. By shipping the durable log, the standby gets the exact prefix of history the primary trusted locally.

For one Harbor Point approval, the WAL prefix might look conceptually like this:

LSN 8A/10  UPDATE issuer_exposure SET reserved_notional = reserved_notional + 500000
LSN 8A/28  INSERT reservations(id='R-88421', issuer='MUNI-77', status='open', ...)
LSN 8A/40  INSERT idx_reservations_open_issuer_time(...)
LSN 8A/58  COMMIT tx=88421

The replica cannot safely expose the reservation after replaying only the row insert if the index change or commit record is missing. It must replay the WAL in order and stop at the last complete durable prefix. That is why log shipping is attractive operationally: the primary and standby share one notion of history. It is also why physical log shipping is tightly coupled to one engine family and storage layout. You gain exactness and predictable recovery semantics, but the replica is no longer a loosely interpreted consumer of events. It is another copy of the same database system reconstructing the same state.

An ASCII view of the normal path makes the coupling clearer:

client tx
   |
   v
primary generates WAL ----> local fsync ----> commit visible on primary
   |
   +---- WAL sender ----> standby stores WAL ----> standby replays WAL ----> query visible on standby

The arrow from "local fsync" to "standby stores WAL" is intentionally separate from the arrow to "standby replays WAL." Shipping and apply are different phases. That distinction is the core primitive for the rest of the month.

Concept 2: A replica has multiple positions, not one "caught up" flag

Harbor Point's operations dashboard should never display a single boolean called replica_healthy. A useful standby exposes at least three meaningful positions. One position tells you how much WAL the standby has received over the network. Another tells you how much of that WAL is durable on the standby's own disk. A third tells you how much has been replayed into data pages so hot-standby queries can see it.

Those positions diverge in ordinary production conditions. Suppose the primary commits reservation R-88421 at LSN 8A/58. The standby might receive that record almost immediately because the network is fast, but replay could still lag because the standby is saturating I/O or because a long-running read query is delaying page changes. In that moment Harbor Point has very low transport lag and very real read staleness.

The operational meaning of each watermark is different:

That separation explains a common confusion around failover. A standby can be safer than its visible query state suggests. If it has flushed WAL beyond its current replay position, promotion can continue recovery from the durable local prefix before opening for writes. By contrast, read scaling depends on replay, not merely receipt or flush, because trader dashboards and support queries only see changes that have actually been applied.

Harbor Point therefore needs metrics and alerts keyed to the right question. "How stale are standby reads?" is a replay question. "How much WAL would the standby lose if it crashed right now?" is a flush question. "Is the network path to the standby keeping up?" is a receive question. Treating those as one number hides the real failure mode.

Concept 3: Log shipping stays healthy only if retention, apply speed, and conflicts are managed

Once Harbor Point starts relying on a standby, the dangerous failures are usually not mysterious consensus bugs. They are ordinary backlog problems at the replication primitives layer. If the standby disconnects for long enough, the primary may recycle old WAL segments before the standby consumes them. At that point the standby cannot resume from its last position and needs a new base backup or another seeding path. The database did not "partially replicate." It lost the continuous log prefix required to continue recovery.

Apply lag is the second class of problem. Harbor Point wants to run trader dashboards on the standby, but read traffic is only safe when it does not prevent the standby from applying new WAL promptly. Depending on engine behavior, long-running reads can delay cleanup or conflict with replayed changes. Even without query conflicts, replay can fall behind simply because one standby is under-provisioned for the rate at which the primary generates WAL during market open. The result looks like an application bug because dashboards are stale, yet the root cause is replay throughput.

The third pressure is observability. Log shipping gives strong ordering guarantees precisely because it insists on a contiguous prefix of WAL. That means small gaps matter. If Harbor Point sees receive_lsn moving but replay_lsn stuck, the team should ask whether replay is blocked, not whether the application "forgot to refresh." If none of the positions move, the issue is earlier: WAL sender, network path, standby receiver, or retention. The lesson is not that replication is fragile. The lesson is that the right primitive-level signals make it diagnosable.

This is the trade-off of physical log shipping. It is operationally efficient, preserves exact engine semantics, and makes failover conceptually straightforward because every standby is reconstructing the same storage history. In exchange, replicas are tightly bound to the primary's engine version and physical layout, and every standby must keep up with a single ordered WAL stream instead of independently interpreting business events. That is acceptable for Harbor Point because the requirement is "another trustworthy copy of the same OLTP database," not "a differently shaped downstream system."

Troubleshooting

Issue: The standby reports that it is connected, but the trader dashboard still misses reservations that committed seconds ago.

Why it happens / is confusing: Connection health only proves that the shipping path exists. The standby may be receiving WAL while replay is behind due to I/O saturation, replay conflicts, or a long-running query.

Clarification / Fix: Compare receive, flush, and replay positions separately. If replay alone is lagging, reduce conflicting standby reads, provision more I/O, or move the dashboard to a replica tier that can keep up with market-open WAL volume.

Issue: A standby comes back after an outage and cannot catch up from its old position.

Why it happens / is confusing: The primary kept moving and eventually removed WAL segments older than the standby's restart point. Log shipping requires a continuous WAL prefix; missing segments are not reconstructible from the later log alone.

Clarification / Fix: Increase WAL retention, use replication slots or an equivalent retention mechanism carefully, and alert on lag before the standby falls off the retained history window.

Issue: Operations promotes a standby and discovers that some recently committed work is not yet visible to support queries.

Why it happens / is confusing: The standby may have durably stored WAL beyond its replay point. Promotion can continue recovery from that durable prefix, but hot-standby reads before promotion were only seeing the replayed prefix.

Clarification / Fix: Distinguish read freshness from failover safety in runbooks and dashboards. For user-facing reads, watch replay lag. For promotion analysis, also inspect what WAL has been durably flushed locally.

Advanced Connections

Connection 1: ../26/16.md chose an engine whose WAL could support replication without redefining correctness

The month 26 capstone argued that Harbor Point needed a page-oriented MVCC engine because ordered WAL, B-tree maintenance, and bounded recovery matched the reservation workload. This lesson extends that same decision over the network. If the local engine's commit order were not trustworthy, remote replicas would only reproduce the confusion on more machines.

Connection 2: 02.md and 03.md build on these watermarks in different ways

02.md asks which replica milestone a client commit should wait for: local durability only, remote flush, or something stricter. 03.md revisits the same idea through Raft, where the replicated log is logical and quorum-committed rather than a physical WAL stream. The common thread is that ordering comes first; the difference is what exactly is being ordered and what acknowledgment rule makes it durable.

Resources

Optional Deepening Resources

Key Insights

  1. Log shipping works because replication reuses recovery - The replica consumes the same ordered durable history the primary would use after its own crash.
  2. Replica freshness has multiple milestones - Received WAL, flushed WAL, and replayed WAL answer different production questions and should never be collapsed into one status flag.
  3. Healthy replication depends on continuous history and timely apply - Retention gaps, under-provisioned standby I/O, and replay conflicts are the ordinary ways a replica stops being trustworthy.
PREVIOUS Monthly Capstone: Design a Globally Distributed Data Layer NEXT Synchronous vs Asynchronous Replication

← Back to Consistency and Replication

← Back to Learning Hub