LESSON
Day 417: Replication Primitives: Log Shipping and Apply
The core idea: A replica becomes trustworthy when it replays the same durable log the primary would use for crash recovery, in the same order, while keeping separate track of what has been received, flushed, and applied.
Today's "Aha!" Moment
In ../26/16.md, Harbor Point chose a page-oriented MVCC database because its reservation workflow depended on ordered WAL, predictable B-tree access, and bounded recovery. The next step is obvious from an operations perspective: move trader dashboards and disaster-recovery failover onto a standby so the primary is not the only copy of the truth. The tempting mental model is "the primary sends changed rows to the replica." That sounds intuitive, but it is not how reliable database replication begins.
What the primary actually knows how to produce durably is its write-ahead log. When a reservation is approved, the engine does not only think "one row changed." It emits a sequence of WAL records that describe index updates, heap changes, visibility metadata, and finally the commit record that makes the transaction durable. Log shipping works because the replica can consume that exact ordered history and reconstruct the same state transition the primary would reconstruct during crash recovery. Replication is therefore an extension of recovery, not a separate magic copy mechanism.
That shift matters because "copied" is not one state. A WAL record may have left the primary, arrived on the standby, been written to disk on the standby, or already been replayed into visible pages. Those are different milestones with different operational meanings. Harbor Point can have almost no network lag and still show stale dashboards if apply is behind. It can also have WAL safely stored on the standby before those changes are query-visible there. This lesson is about those primitives: the log, the shipping path, the apply path, and the watermarks that describe where the replica really is.
Why This Matters
Harbor Point's production problem is concrete. During the 09:30 market open, the primary handles short approval transactions that update issuer_exposure, insert an open reservation, and write index entries for the trader dashboard. Product wants that dashboard to read from a standby. Operations also wants to promote the standby quickly if the primary host dies. Both requests sound simple until someone asks the question that matters: what exactly does "caught up" mean?
If the team answers that question loosely, incidents follow. A standby may have received the newest WAL bytes but not replayed them, so the dashboard misses reservations that definitely committed on the primary. A disconnected standby may fall so far behind that the primary has already discarded the WAL segments it needs. A failover runbook may assume that a replica with low transport lag is promotion-ready when the actual replay position is still behind the business events support is about to inspect.
Once the team thinks in terms of log shipping and apply, the system becomes easier to reason about. The WAL defines the only valid order of state transitions. Shipping tells you how much of that history has reached the standby. Apply tells you how much of that history has become part of the standby's visible state. The next lesson, 02.md, will ask when a client commit should wait for a replica acknowledgment. That question only makes sense after the mechanics of shipping and apply are clear.
Learning Objectives
By the end of this session, you will be able to:
- Explain why log shipping starts from the recovery log instead of table diffs - Show how WAL preserves commit order and engine-specific state transitions.
- Distinguish receipt, durability, and replay on a replica - Use the right watermark when reasoning about stale reads, failover readiness, and lag.
- Analyze the operational trade-offs of log shipping - Identify where backlog, WAL retention, and apply conflicts turn a healthy replica into an unreliable one.
Core Concepts Explained
Concept 1: Replication is remote crash recovery driven by WAL order
Harbor Point's approval path already gave the primary a durable source of truth before replication existed. When a trader opens a reservation for issuer MUNI-77, the engine writes WAL records for the summary-row update in issuer_exposure, the new row in reservations, and the affected B-tree entries, then emits a commit record. That WAL is what lets the primary recover after a crash. Replication reuses the same mechanism rather than inventing a second description of the transaction.
That design solves a hard correctness problem. The database state is not just "rows." It includes page images, visibility information, index maintenance, and a very specific commit order. If Harbor Point tried to ship only logical row deltas at this layer, the standby would need a second mechanism to preserve every storage-engine invariant the primary already encoded in WAL. By shipping the durable log, the standby gets the exact prefix of history the primary trusted locally.
For one Harbor Point approval, the WAL prefix might look conceptually like this:
LSN 8A/10 UPDATE issuer_exposure SET reserved_notional = reserved_notional + 500000
LSN 8A/28 INSERT reservations(id='R-88421', issuer='MUNI-77', status='open', ...)
LSN 8A/40 INSERT idx_reservations_open_issuer_time(...)
LSN 8A/58 COMMIT tx=88421
The replica cannot safely expose the reservation after replaying only the row insert if the index change or commit record is missing. It must replay the WAL in order and stop at the last complete durable prefix. That is why log shipping is attractive operationally: the primary and standby share one notion of history. It is also why physical log shipping is tightly coupled to one engine family and storage layout. You gain exactness and predictable recovery semantics, but the replica is no longer a loosely interpreted consumer of events. It is another copy of the same database system reconstructing the same state.
An ASCII view of the normal path makes the coupling clearer:
client tx
|
v
primary generates WAL ----> local fsync ----> commit visible on primary
|
+---- WAL sender ----> standby stores WAL ----> standby replays WAL ----> query visible on standby
The arrow from "local fsync" to "standby stores WAL" is intentionally separate from the arrow to "standby replays WAL." Shipping and apply are different phases. That distinction is the core primitive for the rest of the month.
Concept 2: A replica has multiple positions, not one "caught up" flag
Harbor Point's operations dashboard should never display a single boolean called replica_healthy. A useful standby exposes at least three meaningful positions. One position tells you how much WAL the standby has received over the network. Another tells you how much of that WAL is durable on the standby's own disk. A third tells you how much has been replayed into data pages so hot-standby queries can see it.
Those positions diverge in ordinary production conditions. Suppose the primary commits reservation R-88421 at LSN 8A/58. The standby might receive that record almost immediately because the network is fast, but replay could still lag because the standby is saturating I/O or because a long-running read query is delaying page changes. In that moment Harbor Point has very low transport lag and very real read staleness.
The operational meaning of each watermark is different:
receive_lsn: the standby has the bytes in memory or on its ingress path, but not necessarily durably.flush_lsn: the standby has fsynced that WAL locally, so it can continue recovery from that prefix even after its own crash.replay_lsn: the standby has applied that prefix, so queries on the standby can observe the effects.
That separation explains a common confusion around failover. A standby can be safer than its visible query state suggests. If it has flushed WAL beyond its current replay position, promotion can continue recovery from the durable local prefix before opening for writes. By contrast, read scaling depends on replay, not merely receipt or flush, because trader dashboards and support queries only see changes that have actually been applied.
Harbor Point therefore needs metrics and alerts keyed to the right question. "How stale are standby reads?" is a replay question. "How much WAL would the standby lose if it crashed right now?" is a flush question. "Is the network path to the standby keeping up?" is a receive question. Treating those as one number hides the real failure mode.
Concept 3: Log shipping stays healthy only if retention, apply speed, and conflicts are managed
Once Harbor Point starts relying on a standby, the dangerous failures are usually not mysterious consensus bugs. They are ordinary backlog problems at the replication primitives layer. If the standby disconnects for long enough, the primary may recycle old WAL segments before the standby consumes them. At that point the standby cannot resume from its last position and needs a new base backup or another seeding path. The database did not "partially replicate." It lost the continuous log prefix required to continue recovery.
Apply lag is the second class of problem. Harbor Point wants to run trader dashboards on the standby, but read traffic is only safe when it does not prevent the standby from applying new WAL promptly. Depending on engine behavior, long-running reads can delay cleanup or conflict with replayed changes. Even without query conflicts, replay can fall behind simply because one standby is under-provisioned for the rate at which the primary generates WAL during market open. The result looks like an application bug because dashboards are stale, yet the root cause is replay throughput.
The third pressure is observability. Log shipping gives strong ordering guarantees precisely because it insists on a contiguous prefix of WAL. That means small gaps matter. If Harbor Point sees receive_lsn moving but replay_lsn stuck, the team should ask whether replay is blocked, not whether the application "forgot to refresh." If none of the positions move, the issue is earlier: WAL sender, network path, standby receiver, or retention. The lesson is not that replication is fragile. The lesson is that the right primitive-level signals make it diagnosable.
This is the trade-off of physical log shipping. It is operationally efficient, preserves exact engine semantics, and makes failover conceptually straightforward because every standby is reconstructing the same storage history. In exchange, replicas are tightly bound to the primary's engine version and physical layout, and every standby must keep up with a single ordered WAL stream instead of independently interpreting business events. That is acceptable for Harbor Point because the requirement is "another trustworthy copy of the same OLTP database," not "a differently shaped downstream system."
Troubleshooting
Issue: The standby reports that it is connected, but the trader dashboard still misses reservations that committed seconds ago.
Why it happens / is confusing: Connection health only proves that the shipping path exists. The standby may be receiving WAL while replay is behind due to I/O saturation, replay conflicts, or a long-running query.
Clarification / Fix: Compare receive, flush, and replay positions separately. If replay alone is lagging, reduce conflicting standby reads, provision more I/O, or move the dashboard to a replica tier that can keep up with market-open WAL volume.
Issue: A standby comes back after an outage and cannot catch up from its old position.
Why it happens / is confusing: The primary kept moving and eventually removed WAL segments older than the standby's restart point. Log shipping requires a continuous WAL prefix; missing segments are not reconstructible from the later log alone.
Clarification / Fix: Increase WAL retention, use replication slots or an equivalent retention mechanism carefully, and alert on lag before the standby falls off the retained history window.
Issue: Operations promotes a standby and discovers that some recently committed work is not yet visible to support queries.
Why it happens / is confusing: The standby may have durably stored WAL beyond its replay point. Promotion can continue recovery from that durable prefix, but hot-standby reads before promotion were only seeing the replayed prefix.
Clarification / Fix: Distinguish read freshness from failover safety in runbooks and dashboards. For user-facing reads, watch replay lag. For promotion analysis, also inspect what WAL has been durably flushed locally.
Advanced Connections
Connection 1: ../26/16.md chose an engine whose WAL could support replication without redefining correctness
The month 26 capstone argued that Harbor Point needed a page-oriented MVCC engine because ordered WAL, B-tree maintenance, and bounded recovery matched the reservation workload. This lesson extends that same decision over the network. If the local engine's commit order were not trustworthy, remote replicas would only reproduce the confusion on more machines.
Connection 2: 02.md and 03.md build on these watermarks in different ways
02.md asks which replica milestone a client commit should wait for: local durability only, remote flush, or something stricter. 03.md revisits the same idea through Raft, where the replicated log is logical and quorum-committed rather than a physical WAL stream. The common thread is that ordering comes first; the difference is what exactly is being ordered and what acknowledgment rule makes it durable.
Resources
Optional Deepening Resources
- [DOC] PostgreSQL Documentation: Log-Shipping Standby Servers
- Focus: The end-to-end mechanics of WAL shipping, standby recovery, and how streaming replication extends the same recovery machinery used after a crash.
- [DOC] PostgreSQL Documentation:
pg_stat_replicationand Replication Monitoring- Focus: How real systems expose sent, write, flush, and replay positions so operators can separate transport lag from apply lag.
- [DOC] MySQL Reference Manual: Replication Threads
- Focus: The same primitive split in another engine family, where an I/O thread fetches changes and an applier thread replays them from a relay log.
- [BOOK] Designing Data-Intensive Applications
- Focus: The broader conceptual frame for replication logs, follower catch-up, and why a replica is really a state machine consuming an ordered history.
Key Insights
- Log shipping works because replication reuses recovery - The replica consumes the same ordered durable history the primary would use after its own crash.
- Replica freshness has multiple milestones - Received WAL, flushed WAL, and replayed WAL answer different production questions and should never be collapsed into one status flag.
- Healthy replication depends on continuous history and timely apply - Retention gaps, under-provisioned standby I/O, and replay conflicts are the ordinary ways a replica stops being trustworthy.