Backup, Snapshots, and Point-in-Time Recovery

LESSON

Consistency and Replication

030 30 min advanced

Day 429: Backup, Snapshots, and Point-in-Time Recovery

The core idea: A production-grade recovery system is a time machine built from two parts: a base image that is known to be restorable, and an unbroken log history that can replay the database forward to the exact moment you choose.

Today's "Aha!" Moment

In 12.md, Harbor Point learned that a global secondary index is not decorative metadata. It is maintained state with its own write path, lag surface, and repair story. That becomes much more serious the day recovery enters the picture. At 09:41, an operator runs the wrong maintenance job against bond_reservations, and thousands of open reservations for issuer CA-MUNI are flipped to released. The corresponding rows in gsi_open_by_issuer are deleted just as faithfully. The cluster is behaving correctly. The business result is disastrous.

The obvious reaction, "restore last night's backup," is not a real answer. Harbor Point would get the lost reservations back, but it would also erase every legitimate trade placed since the backup finished. A dump taken at 02:00 is a copy of history at 02:00, not a mechanism for reconstructing 09:40:59. On the other side, WAL archives without a base image are not enough either, because replay needs a known starting set of pages, indexes, and metadata.

Point-in-time recovery works because it combines those two ingredients into one contract. The base backup says, "starting from this exact on-disk state, all later changes can be reconstructed from log position X onward." The archived WAL says, "here is every durable change after that point, in the order the engine committed it." Recovery is therefore not "copy files and hope." It is controlled re-execution of database history until the last safe boundary before the bad event.

That framing also fixes a common misconception. Teams often talk about backups as if the only question is whether a copy exists. In production, the harder question is whether the copy, the archived log stream, and the stop point all line up tightly enough to produce the state the business actually needs. Harbor Point does not need "a backup." It needs a restore path that can land just before the erroneous release job and can prove that the restored state is internally consistent.

Why This Matters

Harbor Point's reservation service is busiest during the market open, which is also the worst time to discover that "restore" was never modeled beyond an S3 retention policy. The database contains the authoritative reservation rows, exposure summaries, and the index structures that make trader and compliance workflows fast enough to use. If a bad migration, operator mistake, storage fault, or software bug corrupts those structures at 09:41, the team has two business constraints at once: recover the right state, and do it fast enough that trading does not stay frozen for hours.

Without a real PITR design, the team is trapped between two bad options. Restoring an old full backup loses too much legitimate work. Keeping only logical exports gives a readable copy of some tables, but not the engine state required to bring the whole cluster back consistently. Relying on "the replicas probably have the old data" is worse, because a faithfully replicated mistake spreads the corruption to every healthy standby.

Once the recovery path is designed around snapshots plus WAL, the problem becomes precise. Harbor Point can choose a base backup from 02:00, restore it onto new hosts, replay WAL until just before the destructive maintenance transaction, and then decide what to do about derived systems that live outside the core cluster. The next lesson, 14.md, naturally follows from that workflow: once recovery is decomposed into snapshot age, archived-log completeness, replay speed, and validation, those stages become observable production commitments rather than folklore.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain why a physical base backup must be anchored to log positions - Distinguish a restorable snapshot from a generic filesystem copy or logical export.
  2. Trace how point-in-time recovery reconstructs the database to a chosen boundary - Follow base restore, WAL replay, recovery target selection, and timeline creation in Harbor Point's cluster.
  3. Evaluate recovery trade-offs for real production systems - Decide what must be restored exactly, what can be rebuilt, and how those choices shape RPO, RTO, and operational complexity.

Core Concepts Explained

Concept 1: A restorable base backup is a crash-consistent starting point, not just a copy of files

Harbor Point takes nightly physical backups of the reservation cluster while trading is still possible in some regions. That means a backup can capture data files while the database is actively mutating pages, splitting B-tree nodes, and advancing transaction visibility metadata. If the team simply copied the data directory with no coordination, the result could contain page P1 from before transaction T88421 and index page I7 from after T88421, which is not a state the database ever considered valid.

That is why practical base backups are tied to the engine's recovery model. The database exposes a safe backup window, or a backup tool such as pg_basebackup coordinates the process, records the WAL position where the backup starts, and guarantees that recovery can use WAL to bridge the fuzzy parts of the snapshot. The copied files do not need to represent one perfectly frozen instant by themselves. They need to represent an on-disk state that becomes consistent once replay begins from the recorded log boundary.

For Harbor Point, the mental model looks like this:

base backup files
    +
backup start/end WAL positions
    +
all WAL generated after the backup began
    =
restorable foundation

This is why a logical dump solves a different problem. A dump can re-create table contents, often portably, but it does not preserve page layout, visibility maps, free space maps, or the exact index structure the engine needs for fast recovery. PITR is about recovering the database as a storage system, not merely as a collection of rows.

The trade-off is operational: physical base backups are larger, engine-specific, and require disciplined handling of backup manifests and WAL retention. The payoff is that Harbor Point gets a starting image the engine itself knows how to trust during crash recovery and replay.

Concept 2: Point-in-time recovery is ordered log replay that stops at a business-safe boundary

Suppose Harbor Point restores the 02:00 base backup after the bad 09:41 maintenance job. The cluster does not become useful the moment the files are copied back. The real work is replaying every committed change between the backup and the chosen recovery target. WAL is what makes that possible because it records the same low-level state transitions the primary relied on for durability in the first place.

The recovery flow is conceptually simple:

02:00 base backup at timeline T1
        |
        v
restore files onto new hosts
        |
        v
fetch WAL segments from archive
        |
        v
replay records in commit order
        |
        v
stop at target_time / target_lsn / restore point
        |
        v
promote restored cluster onto new timeline T2

In Harbor Point's incident, the operator knows the destructive job committed at 09:41:23 UTC. Recovery can therefore be configured to stop just before that commit:

restore_command = 'cp /wal-archive/%f %p'
recovery_target_time = '2026-04-01 09:41:23+00'
recovery_target_inclusive = false

The details matter. Recovery targets are evaluated at commit boundaries, not at the moment a transaction first started doing work. If a long-running transaction began before 09:41:23 but committed after it, the restored cluster will exclude its effects when recovery_target_inclusive = false. The team also has to respect the backup window itself: a base backup cannot recover to a time before that backup completed, because the files on disk were still being copied then.

This makes PITR precise, but not effortless. The archive must contain every WAL segment from the base backup forward, in the correct timeline, without corruption or accidental garbage collection. Harbor Point is trading storage cost and archive-management discipline for the ability to land on a very narrow safe point instead of choosing between "yesterday" and "now."

Concept 3: Recovery design is really about scope, authority, and rehearsal

The hardest production question is rarely "can the database replay WAL?" The harder question is "what exactly must be correct when the restored system goes live?" Harbor Point's base tables are authoritative. The gsi_open_by_issuer structure from 12.md is also inside the database, so a physical backup and WAL replay restore it automatically if its mutations were committed transactionally with the base rows. That is the easy case.

Now imagine Harbor Point also feeds an external search index and a risk dashboard cache from change data capture. A PITR restore of the core database does not automatically rewind those external systems to the same cutover point. The team must decide whether they are authoritative enough to require their own aligned restore, or derived enough to be rebuilt after the database comes back. That design decision changes restore steps, downtime, and the amount of post-recovery validation required.

This is where RPO and RTO become concrete engineering knobs instead of executive slogans. A shorter base-backup interval can reduce replay volume, but costs more I/O and storage. Longer WAL retention improves recovery flexibility, but raises archive spend and operational surface area. Rebuilding derived structures after restore can shrink backup size, but extends the time before the platform is truly healthy. Harbor Point needs these choices documented before the incident, not improvised during it.

That is also why restore drills matter more than backup success logs. A successful upload proves only that bytes were copied somewhere. A successful drill proves that the snapshot, WAL archive, target-selection procedure, application cutover steps, and data validation checks actually compose into a usable recovery path. The observability lesson in 14.md starts from exactly this point: once recovery is treated as a staged mechanism, each stage needs metrics, alerts, and an SLO of its own.

Troubleshooting

Issue: The restored cluster starts, but several legitimate reservations from just before the incident are missing.

Why it happens / is confusing: The team targeted the wrong stop point. Time-based recovery may have used the wrong timezone, the target may have landed before the commit record actually arrived, or recovery_target_inclusive may have excluded a transaction the operators meant to keep.

Clarification / Fix: Prefer named restore points or exact LSN targets when the workflow allows it, record incident times in UTC, and verify the commit boundary of the destructive transaction before promoting the restored cluster.

Issue: Recovery fails midway with an error about a missing WAL segment or timeline history file.

Why it happens / is confusing: The base backup itself may be valid, but PITR requires an unbroken archive chain from the backup's recorded start point through the target. Aggressive WAL cleanup, broken archive commands, or partial object-store uploads can leave a gap that only appears during restore.

Clarification / Fix: Treat WAL archival as a durability pipeline, not a background convenience. Validate archive completeness continuously, keep retention aligned with the maximum recovery window, and test restores from the actual archive rather than from local happy-path files.

Issue: The database is transactionally consistent after PITR, but downstream search or analytics results disagree with the restored state.

Why it happens / is confusing: PITR repaired the authoritative database, but asynchronous derived systems were neither restored to the same point nor rebuilt from the restored source of truth.

Clarification / Fix: Classify every dependent system ahead of time as authoritative or rebuildable. After recovery, either restore the authoritative dependents to the same cutover point or replay/rebuild the derived ones before declaring the platform healthy.

Advanced Connections

Connection 1: 12.md made global indexes a correctness problem; PITR turns them into a recovery-scope problem

Once Harbor Point models gsi_open_by_issuer as maintained state, disaster recovery has to answer whether that state is restored in lockstep with the base table or regenerated after the fact. The correctness story from the write path becomes a scope decision in the recovery path.

Connection 2: 14.md turns recovery stages into observable promises

Backup age, WAL archive lag, replay throughput, restore duration, and post-restore validation success are all measurable. Observability is not an add-on after PITR is designed; it is the way Harbor Point proves the recovery design is still intact before the next incident forces the issue.

Resources

Optional Deepening Resources

Key Insights

  1. A backup becomes useful only when it is paired with the log boundary that makes it restorable - Physical copies without replay metadata are just incomplete raw material.
  2. PITR is replay of committed history, not selective row repair - The engine rebuilds a valid state by following log order and stopping at an explicit target.
  3. Recovery scope is an architectural decision - Teams must decide in advance which structures are authoritative, which are rebuildable, and how that choice changes restore time and validation.
PREVIOUS Global Secondary Indexes Across Shards NEXT Storage Engine Observability and SLOs

← Back to Consistency and Replication

← Back to Learning Hub