LESSON

Database Engine Internals and Implementation

031 30 min advanced

Day 408: Recovery Correctness and Idempotent Redo

The core idea: Once 07.md makes WAL durable before dirty pages outrun it, crash recovery becomes correct only if redo can be repeated safely after any number of restarts and undo can resume without losing track of unfinished work.

Today's "Aha!" Moment

At Harbor Point, trader R1 committed a 0.5M reservation for MUNI-77 and received success from the database. Trader R2 started a second reservation against the same issuer, dirtied a heap page and an index page, and then the primary lost power before R2 committed. When the node comes back, the storage engine is staring at a mixed world: some committed effects exist only in WAL, some uncommitted effects may already have reached disk because the engine uses steal, and the crash might happen again in the middle of recovery itself.

That last part is the mental shift. Recovery is not a one-shot cleanup script that gets to run in perfect conditions. It is another crashable execution path. If redo simply "replays everything after the checkpoint," the same log record could be applied twice after repeated restarts and a reservation could be counted twice. If undo only keeps in-memory progress, the second crash can strand half-undone work and force the engine to guess which effects still need to be reversed.

ARIES-style recovery solves this by turning recovery into a set of durable invariants. Redo repeats history, but it is idempotent because each page says how far into the WAL it already reflects through pageLSN. Undo is bounded because compensation log records, or CLRs, make the progress of rollback itself durable. The lesson is not "recovery is complicated." It is that recovery is only trustworthy when every recovery action can survive being interrupted and started again.

The common misconception is that redo means "blindly write every logged change again." Real engines do not do that. They use WAL ordering from 07.md, page-local LSN checks, and undo metadata so restart after restart converges on the same committed state instead of drifting further away from it.

Why This Matters

Harbor Point opens into a burst of reservations, checkpoint pressure, and dirty pages. In that environment, the engine cannot afford a force policy that flushes every touched page on every commit, so it relies on steal/no-force: committed changes may still be only in WAL at crash time, and uncommitted changes may already be on disk. That design keeps normal-path latency reasonable, but it means crash recovery is not optional bookkeeping. It is the mechanism that decides whether the post-crash book matches the promises already made to traders.

If recovery correctness is weak, the failure modes are severe and confusing. A committed reservation may disappear because redo skipped too much. An aborted reservation may survive because undo lost its place. Worse, the first recovery attempt can appear to work, while the second crash during recovery exposes double-applied updates or an endless rollback loop. Those are production failures that often surface only under the worst operational timing.

When recovery is designed correctly, Harbor Point can tolerate ugly restart sequences without inventing new history. Winners stay committed, losers are removed, and a second crash during restart only lengthens recovery time; it does not change the final answer. That is the real production benefit of idempotent redo.

Learning Objectives

By the end of this session, you will be able to:

Explain what recovery correctness guarantees after a crash - Distinguish committed effects that must survive from uncommitted effects that must be removed in a steal/no-force engine.
Trace why redo is idempotent instead of merely repeatable - Use LSN and pageLSN to decide when a redo record should be applied or skipped after one or more crashes.
Reason about bounded undo in production terms - Explain why CLRs and undo chains are necessary when recovery itself can be interrupted.

Core Concepts Explained

Concept 1: Recovery correctness starts by reconstructing which work is durable and which work is merely present

At the moment of the Harbor Point crash, the disk is allowed to be inconsistent in a very specific way. The WAL may contain R1's update and commit record even if the heap page for the reservation is still old on disk. Meanwhile, the buffer manager may have flushed one of R2's dirty pages early to relieve memory pressure even though R2 never committed. That combination sounds alarming until you remember that steal/no-force is deliberately shifting complexity away from the commit path and into recovery.

The first recovery question is therefore not "which bytes are on disk?" but "which transactions had reached a durable commit point, and which pages might still need work?" In ARIES terms, analysis walks the WAL from the last checkpoint, rebuilds the transaction table, and reconstructs a dirty page table. For Harbor Point, analysis concludes that R1 is a winner because its commit record is durable, while R2 is a loser because the crash happened first.

You can picture the pre-crash history like this:

LSN 9810240  UPDATE heap page 842   reserve 0.5M MUNI-77 for R1
LSN 9810248  UPDATE index page 119  add open-reservation entry for R1
LSN 9810312  COMMIT R1
LSN 9810400  UPDATE heap page 842   provisional reserve 0.3M MUNI-77 for R2
LSN 9810416  UPDATE index page 119  provisional entry for R2
CRASH

Recovery correctness means the post-restart database must contain both of R1's effects and none of R2's effects, regardless of which dirty pages happened to reach disk before the crash. That is why analysis needs both the durable transaction state and the earliest LSN from which redo must start. Without that bookkeeping, the engine would either redo too little and lose committed work or redo too much without a plan to clean up losers afterward.

The trade-off is straightforward: steal/no-force improves steady-state throughput and commit latency, but it forces the engine to maintain enough recovery metadata to distinguish "durably committed," "written early," and "still needs undo." Fast commit is purchased with a more disciplined restart path.

Concept 2: Idempotent redo works because each page can prove whether it already absorbed a log record

Redo in ARIES repeats history, including updates from loser transactions, before undo removes the losers. That sounds wasteful until you look at why it works. The redo pass is not trying to reason transaction by transaction about the final answer. It is trying to restore pages to the exact state implied by the WAL prefix that existed at crash time. Only after that stable page state is rebuilt does undo walk backward through loser transactions.

The reason redo can be repeated safely is the pageLSN contract from 07.md. For every redo record targeting a page, recovery compares the record's LSN with the pageLSN stored on the page image currently on disk. If the page already has pageLSN >= redo_lsn, that change is already present and redo skips it. If the page is behind, recovery reapplies the change and advances the page's pageLSN.

For Harbor Point's heap page 842, the decision is local and mechanical:

redo record: LSN 9810400 -> provisional reserve for R2 on page 842

if disk.pageLSN >= 9810400:
    skip
else:
    apply update
    set pageLSN = 9810400

Now imagine the database crashes again halfway through recovery. On the first restart, redo already restored page 842 and flushed it with pageLSN = 9810400, but recovery had not yet reached page 119. On the second restart, the engine reads the same WAL again. Page 842 now proves that LSN 9810400 is already reflected, so the record is skipped. Page 119 may still need its redo. That is idempotence in the storage-engine sense: recovery can repeat the same scan and converge instead of amplifying the change.

This page-local rule is why WAL correctness has to cover every page family, not just the heap. If Harbor Point logs the heap insertion for R1 correctly but misses the index-page update or fails to maintain pageLSN on a split index page, a repeated restart can leave the base table and index disagreeing about which reservations exist. Idempotent redo is not a slogan; it is a per-page proof obligation.

Concept 3: Undo must be bounded so recovery can make forward progress even when rollback is interrupted

After redo, Harbor Point's pages reflect the full pre-crash history, including R2's provisional reservation and index entry. Recovery still is not complete, because losers must be undone. The naive rollback plan would be "walk backward through R2's log records, reverse each change, and hope the machine stays up." That plan fails under repeated crashes because the engine can lose track of which undo actions were already completed.

CLRs solve that by logging the fact that an undo step happened. When recovery removes R2's index entry from page 119, it emits a CLR describing that compensation and pointing to the next older log record that still needs undo. If the machine crashes immediately afterward, the next recovery run sees the CLR in WAL and knows not to re-undo the index entry blindly; it resumes from the recorded undoNextLSN. In other words, rollback progress becomes durable just like forward progress.

For Harbor Point, a bounded undo chain might look like this:

LSN 9810508  CLR for undo of R2 index entry on page 119, undoNextLSN=9810400
LSN 9810520  CLR for undo of R2 heap reservation on page 842, undoNextLSN=null

The operational payoff is subtle but important. During a bad morning, Harbor Point may crash once from the original power event and again because recovery saturates I/O on a fragile node. With CLRs, each restart continues shrinking the remaining undo set. Without them, the same loser transaction can force recovery to revisit the same rollback work again and again, stretching restart time unpredictably or even leaving operators unsure whether progress is real.

The trade-off is more WAL traffic and more recovery metadata. Every undo action now generates its own durable record. Production engines accept that cost because "rollback eventually finishes, even if recovery crashes" is a far stronger guarantee than "rollback is simple when nothing else goes wrong." The same discipline becomes even more important once a single transaction updates both data pages and secondary index pages, which is where 09.md goes next.

Troubleshooting

Issue: A committed change is missing after crash recovery, even though the application received commit success.

Why it happens / is confusing: Engineers often inspect the data file and assume the page should have been flushed before commit. In a no-force engine, the durable guarantee comes from WAL, so the page image may legitimately be stale at crash time.

Clarification / Fix: Verify that the commit record and required redo records were present in durable WAL and that recovery started redo early enough. Missing committed effects usually point to a bad redo start point or an incorrect page-level skip check, not to the lack of an immediate data-page flush.

Issue: Crash drills appear to "replay the same WAL twice," and the team worries that redo is duplicating rows.

Why it happens / is confusing: Recovery may scan the same WAL records on every restart, especially if the first recovery itself crashes. That is normal. The correctness question is whether page-level checks make repeated scans harmless.

Clarification / Fix: Inspect pageLSN behavior on the affected heap and index pages. If pageLSN advances when redo applies and later restarts skip records whose LSN is already reflected, repeated redo is working as designed.

Issue: Recovery time grows sharply when the crash happened during rollback of a large transaction.

Why it happens / is confusing: Undo is doing real work and can itself be interrupted. If compensation records are missing or the undo chain is hard to resume, every restart reopens old rollback work instead of continuing from the remaining frontier.

Clarification / Fix: Check whether CLRs are being generated for undo actions and whether undoNextLSN chains are intact. Bounded undo depends on making rollback progress durable.

Issue: The base table is correct after restart, but a secondary index still contains stale entries.

Why it happens / is confusing: Teams sometimes validate WAL and redo behavior on heap pages only. Secondary indexes have their own page images, split records, and pageLSN values, so they need the same recovery discipline.

Clarification / Fix: Audit WAL coverage for every page mutation involved in index maintenance, including structural changes such as page splits. Recovery correctness is only as strong as the weakest page type it logs.

Advanced Connections

Connection 1: The WAL ordering rule from 07.md is the prerequisite for idempotent redo

Yesterday's lesson established that a dirty page cannot legally reach disk before the WAL that explains it is durable. Today's lesson depends on that rule. pageLSN only means something if the log prefix up to that LSN is guaranteed to exist on stable storage. Without WAL-first ordering, redo would compare pages to a timeline that might be missing the records needed to reconstruct them.

Connection 2: Database redo and replicated-log replay share the same need for repeatable apply

State-machine replication systems such as Raft also replay a log after restart, but they usually reason in terms of a single applied index for the whole state machine. Storage engines need a more granular form because persistence is page-oriented and partially flushed. pageLSN is the page-local analogue of "last applied log index": it lets Harbor Point skip already-applied history on one page while continuing replay on another page that is still behind.

Resources

Optional Deepening Resources

[PAPER] ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging
- Focus: The original description of analysis, redo, undo, compensation log records, and why repeating history makes recovery correct.
[DOC] PostgreSQL Documentation: Reliability and the Write-Ahead Log
- Focus: How a production engine relies on WAL durability first and later replays changes after a crash.
[DOC] PostgreSQL Documentation: Database Page Layout
- Focus: The page-header fields, including the page LSN, that let recovery decide whether a page is already up to date.
[DOC] MySQL Reference Manual: InnoDB Recovery
- Focus: A second production implementation of redo and undo recovery, useful for comparing the same invariants across engines.

Key Insights

Recovery correctness is about committed history, not raw disk contents - In a steal/no-force engine, pages on disk can contain too little or too much until recovery reconciles them with WAL.
Redo is idempotent because the page participates in the proof - pageLSN lets recovery ask a local yes-or-no question for each page instead of guessing globally whether a log record already took effect.
Undo must log its own progress - CLRs turn rollback from a fragile in-memory procedure into a crash-safe process that eventually finishes.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub