LESSON
Day 448: Monthly Capstone: Primary Region Outage Runbook
The core idea: A primary-region outage runbook is a commit-order recovery protocol. You freeze new writes, prove which WAL prefix is durable in the failover region, fence the old primary, and only then promote, reroute, and decide which background jobs may resume.
Today's "Aha!" Moment
In 046.md, Harbor Point treated online reindexing as a resumable state machine: choose a snapshot frontier, backfill a shadow index, catch up with live writes, validate parity, and cut over only when the catalog says the new structure is trustworthy. At 09:33 on a volatile Monday, that exact rebuild is still in progress on the reservations cluster when the primary region, iad, disappears behind a power and network event. The trading desk still needs to open reservations, compliance still needs same-day visibility, and the index job is stuck somewhere between catch-up and validation.
The tempting mistake is to frame the problem as "bring up the replica quickly." That is not the real task. The real task is to decide what database state is authoritative. Which commits acknowledged in iad are definitely durable in phx? Did the reindex job publish a new catalog state, or was it still only writing into a shadow structure? Are application workers, cron jobs, and queue consumers guaranteed to stop sending writes to the old primary if it comes back half-alive? A runbook exists so those questions have deterministic answers before humans improvise under stress.
That is why outage handling belongs inside database internals, not outside them. Failover is not only an SRE procedure and not only a routing change. It is the moment when WAL durability, replication mode, MVCC visibility, job metadata, and process-level fencing all have to agree on one timeline. If they do not, Harbor Point can reopen the system quickly and still lose reservation writes, double-apply side effects, or let an invalid index become visible after promotion.
Why This Matters
Harbor Point's reservation system is the kind of database where "mostly correct" is operationally useless. A missed reservation can leave issuer exposure understated. A duplicated reservation can over-reserve headroom and block trading that should succeed. A failover that promotes a replica without knowing the state of the in-flight reindex from 046.md can create a subtler failure: queries keep running, but one region believes a shadow index is ready while another never durably recorded the cutover.
Production runbooks are supposed to turn scary moments into bounded decisions. In practice, many organizations still keep disaster recovery as a loose checklist: promote the most up-to-date replica, switch DNS, watch dashboards, and clean up later. That is not enough for a transactional database. The hard parts happen before traffic is reopened: stop every remaining write path, identify the highest safe WAL position, choose whether the organization will accept any data loss beyond that point, and preserve job and schema metadata so the new primary does not inherit an ambiguous half-state.
When the runbook is mechanistic, RTO and RPO stop being slogans. The team knows which metrics gate promotion, what gets paused, what can resume, and which writes need reconciliation after the incident. When the runbook is vague, speed becomes a liability. The cluster comes back faster, but the business learns hours later that the recovered timeline was never made authoritative.
Learning Objectives
By the end of this session, you will be able to:
- Determine whether a failover target is safe to promote - Use WAL durability, replay position, fencing state, and replicated job metadata to decide what commit prefix can become authoritative.
- Walk through a concrete primary-region outage runbook - Explain the sequence of write freeze, fencing, promotion, traffic reopening, and maintenance-job handling for Harbor Point's reservation database.
- Plan post-failover reconciliation - Identify which requests, side effects, and database nodes must be reconciled before the incident can truly be considered closed.
Core Concepts Explained
Concept 1: Promotion means choosing an authoritative commit prefix
Harbor Point runs the reservations primary in iad and streams WAL to a warm standby in phx. During normal operation, the desk accepts a little cross-region replication lag because fully synchronous remote commits would slow the hot path too much during the 09:30 market-open burst. That design choice is reasonable, but it changes what failover can promise. Once iad disappears, phx is not automatically allowed to become the new primary for "everything the old primary ever acknowledged." It is only allowed to become primary for the commit prefix it can prove is durably present.
That proof starts with WAL, not with query latency and not with a vague "replica looks healthy" dashboard. Harbor Point needs the highest WAL location that is definitely replayed or at least durably received in phx, depending on the failover policy. It also needs to know whether acknowledgments in iad were local-only or required remote durability. If client commits were acknowledged before phx had them, then some acknowledged transactions may now be outside the safe promotion frontier. The runbook has to say whether the business accepts that bounded loss or whether writes stay offline until archive recovery or another replica can close the gap.
The mental model looks like this:
client commit
|
v
iad primary WAL append -----> WAL archive
|
+-----------------------> phx standby receive -> replay
safe promotion frontier = highest commit prefix that Harbor Point
can prove is durable on the region it is about to promote
The same rule applies to metadata, not only to user rows. The in-flight reindex from 046.md has state in job tables and catalog records: snapshot frontier, completed spans, validation progress, and whether cutover was ever committed. If that metadata is not present on phx, operators cannot assume the new primary "knows" where the maintenance job was. Promotion therefore chooses an authoritative prefix for the entire database state machine, including background work.
This is the first trade-off the runbook makes explicit. Asynchronous cross-region replication keeps the hot path faster, but it means the outage procedure may involve a data-loss boundary that humans have to acknowledge. Tighter durability guarantees reduce that ambiguity, but they spend latency on every healthy-day commit. A production runbook is where the system finally admits which side of that trade-off it chose.
Concept 2: The runbook is a sequence of authority transfers, not a bag of recovery tricks
Once iad is considered unhealthy, Harbor Point should not improvise from scratch. The database runbook needs ordered phases, because each phase answers one authority question before the next one starts.
1. Freeze writes and background mutators
2. Fence the old primary region
3. Verify the candidate standby and choose promotion LSN
4. Promote with a new epoch/timeline
5. Reopen traffic in stages
6. Resume or abort unfinished maintenance jobs explicitly
Freezing writes means more than setting the application to read-only mode. Harbor Point has API servers, asynchronous queue consumers that emit reservation amendments, and schema-maintenance workers left over from the reindex job. If any of them can still reach iad or cache the old writer endpoint, the system does not yet have one writer. The first phase therefore disables writer feature flags, pauses job schedulers, and stops any process that could emit new WAL before the database team has proved which region is in charge.
Fencing is the next non-negotiable step. If iad comes back with stale leader state while phx is already accepting writes, Harbor Point has a split-brain event rather than a recovery. Fencing can involve load-balancer removal, lease revocation in service discovery, credential revocation for automation, or even forceful node shutdown if a host is reachable but unhealthy. The exact mechanism depends on the platform, but the invariant is stable: the old primary must be unable to accept writes before the new one is opened.
Only after fencing does Harbor Point validate phx as the failover target. Operators check replay position, replication health, disk pressure, and whether the replicated metadata for the in-flight reindex is coherent. A practical checklist for the reservation cluster looks like this:
- highest replayed WAL location in phx
- highest archived WAL location available to phx
- current job record for idx_reservations_by_settlement rebuild
- catalog flag showing whether new index is visible to the planner
- replication lag on read replicas that will follow phx after promotion
- lease/epoch store ready to issue a new writer term
Promotion itself should create a new epoch or timeline that every write path can recognize. The database becomes writable in phx, service discovery points writer traffic there, and only leader-scoped background workers start up. This is where 046.md matters again. If the reindex metadata on phx says "backfill complete, validation incomplete, cutover not committed," the job can resume from validation under the new primary. If the metadata says the cutover was already committed and replicated, the index is authoritative and the job should not rerun old phases. If the metadata is missing or contradictory, the runbook must force the safer branch: keep the shadow index non-public and rebuild or revalidate instead of guessing.
The production trade-off is between automation speed and state richness. Fully automatic failover only works if every phase marker that matters to correctness is persisted and replicated: leader epoch, schema-job state, durable replay position, and service-level write permissions. If those markers do not exist, a human can still recover the system, but the runbook becomes slower because humans must reconstruct state from logs and side channels.
Concept 3: Recovery is incomplete until Harbor Point reconciles side effects and quarantines the old primary
After phx is promoted and writes reopen, the incident is not over. Harbor Point still has a gray window around the outage boundary where three classes of work may disagree: client requests near the failover moment, asynchronous side effects triggered from committed transactions, and the old iad region if it comes back with unreplicated WAL.
The reservation service therefore needs an explicit reconciliation pass. Every write request already carries a durable request_id, and reservation creation uses idempotent inserts keyed by reservation_id. That design is what makes the runbook survivable. Once phx is primary, Harbor Point can compare the last minute of API request logs and outbox events against committed rows on the new timeline. Requests that reached the API but never became durable can be retried safely because the idempotency key prevents duplicates. Side effects such as downstream risk notifications are rebuilt from the transactional outbox on the promoted primary, not replayed blindly from application memory.
The in-flight reindex also needs a post-failover decision. Suppose the job had finished backfill in iad, but validation was still running when the outage started. On phx, the safe response is not "continue as if nothing happened." The safe response is "resume from the last replicated phase marker, then rerun any proof step whose result might have depended on the missing prefix." Validation is cheap compared with letting an index become visible based on a proof that lived only in the dead region's memory or logs.
The biggest operational mistake is how teams treat the old primary when it returns. iad must not simply reconnect and "catch up somehow." It may contain writes from a divergent timeline that were never accepted into phx. Mature runbooks treat the returned primary as contaminated until it has been rewound or re-seeded from the new authority. In PostgreSQL terms that could mean pg_rewind when histories are compatible or a fresh base backup when they are not. In managed systems it often means discarding the old volume and rebuilding a replica from the promoted region. The important principle is the same: there is one authoritative timeline after failover, and the old primary has to rejoin that timeline as a follower, not as a peer with partially valid history.
This final phase exposes the deeper trade-off in disaster recovery design. Faster failover is only safe when application requests are idempotent, side effects are tied to committed database state, job progress is durable, and node reincorporation is automated. Teams that skip those investments can still write a runbook, but it will be slower, more manual, and less certain right when certainty matters most.
Troubleshooting
Issue: The standby is only a second or two behind, so the team wants to promote immediately.
Why it happens / is confusing: Lag dashboards often summarize transport delay, not the exact WAL prefix that is durably replayed and not the state of catalog or job metadata. "Almost caught up" is not a correctness condition.
Clarification / Fix: Promote only after selecting an explicit safe frontier. Record the chosen WAL position in the incident log, confirm fencing first, and verify that the metadata for unfinished schema or index work is present on the candidate.
Issue: The old primary comes back after phx is live, and someone suggests re-enabling it as a second writer temporarily for capacity.
Why it happens / is confusing: The node may look healthy again, so operators focus on CPU and reachability instead of timeline divergence.
Clarification / Fix: Treat the returned node as read-disabled until it has been rewound or rebuilt from the new primary. A recovered old primary without timeline repair is a split-brain risk, not spare capacity.
Issue: After promotion, the reservation API is healthy but dashboard queries show unstable plans around the rebuilt index.
Why it happens / is confusing: The new primary may have the shadow index files and job metadata, but not the same planner statistics, cache warmth, or completed validation proof the old region had in memory.
Clarification / Fix: Resume the index job from the last replicated phase marker, rerun validation if there is any doubt, and delay planner visibility for the new index until the proof is durable on the promoted region.
Advanced Connections
Connection 1: 046.md gave the outage runbook something precise to preserve
Online reindexing became manageable only when Harbor Point represented it as durable phases instead of a background blur. The outage runbook uses that same structure. Failover is no longer "what happens if the job crashes," but "which job phase is durably part of the authoritative WAL prefix and which proof steps must be repeated on the new leader."
Connection 2: ../consistency-and-replication/018.md generalizes this capstone into steady-state policy
This lesson is a one-incident view of a larger design question: what durability and visibility guarantees should a replicated database offer every day? The next replication-focused material takes the same questions about commit order, promotion safety, and remote lag and turns them into normal operating policy instead of emergency procedure.
Resources
Optional Deepening Resources
- [DOC] PostgreSQL Documentation: Warm Standby, Failover, and Replication
- Focus: Read the sections on promotion, streaming replication, and failover triggers to connect the lesson's "safe prefix" idea to a concrete WAL-based system.
- [DOC] PostgreSQL Documentation: Continuous Archiving and Point-in-Time Recovery (PITR)
- Focus: Study how archived WAL extends recovery beyond a single live standby and why archive completeness matters when the promoted replica is missing the last acknowledged commits.
- [DOC] PostgreSQL Documentation: pg_rewind
- Focus: See why the old primary cannot simply reconnect after failover and how a divergent node is brought back under the new authoritative timeline.
- [DOC] Patroni Documentation: Replication Modes
- Focus: Compare asynchronous, synchronous, and quorum-style replication choices and map them to the RPO and failover trade-offs discussed in this runbook.
Key Insights
- Failover promotes a commit prefix, not a server - The new primary is authoritative only for the WAL and metadata it can prove are durable in the failover region.
- Fencing is part of correctness, not an operations afterthought - If the old primary can still accept writes, the system does not have a runbook; it has two competing timelines.
- Recovery finishes after reconciliation, not after traffic returns - Request idempotency, transactional outboxes, resumable maintenance jobs, and safe old-primary rebuilds are what turn a fast failover into a correct one.