Crash Recovery Walkthrough: Analysis, Redo, Undo

LESSON

Database Engine Internals and Implementation

021 30 min advanced

Day 398: Crash Recovery Walkthrough: Analysis, Redo, Undo

The core idea: ARIES recovery does not "restore the last checkpoint." It rebuilds recovery state during analysis, repeats history during redo, and then removes uncommitted work during undo so the final database matches exactly what committed log records allow.

Today's "Aha!" Moment

In 13.md, Harbor Point learned that a checkpoint is only a recovery boundary plus a picture of which pages were dirty. That picture matters because, at 09:47, the primary database loses power in the middle of the market-open burst. One transaction has already committed a municipal-bond quote update, another transaction has inserted a secondary-index entry but has not committed yet, and several dirty pages are still only partially reflected on disk. After restart, the engine cannot trust memory, but it also cannot afford to guess.

The tempting mental model is "load the checkpoint and replay committed transactions." ARIES is stricter than that because the disk image after a crash is not a clean snapshot. Some committed updates may never have reached the data files; some uncommitted updates may already be present on disk because steal/no-force buffering allowed the engine to flush pages before commit. Recovery therefore has to answer three separate questions in order. Which transactions were still active when the crash happened? Which logged actions might be missing from disk? Which of those actions belong to losers and must be backed out?

That is why the three phases exist as a chain, not as optional extras. Analysis reconstructs the transaction table and dirty page table that memory lost. Redo repeats history from the earliest relevant recLSN, even for transactions that will later be undone, because recovery first needs the on-disk pages to reflect the exact pre-crash state. Undo then walks the loser transactions backward and logs compensation log records so recovery itself remains crash-safe. Once that sequence clicks, the transition to 15.md becomes natural: commit latency and fsync policy decide how much of this work is waiting for recovery after a crash.

Why This Matters

Harbor Point's on-call engineer does not get paged with the abstract complaint "recovery theory is confusing." They get paged because a database restarted, the process came back, and the trading desk wants to know whether quote corrections, risk flags, and settlement adjustments are actually consistent. If the engineer cannot explain what analysis, redo, and undo are doing, they also cannot tell whether a long recovery is expected, whether a repeated redo pass is harmless, or whether the system is stuck behind a pathological loser transaction.

The production stakes are high because steal/no-force buffering makes crash recovery the price of normal write performance. You let dirty pages reach disk before commit, and you let commit complete before data pages are forced, because that keeps the foreground path fast. In return, restart logic must be able to reconstruct the exact state implied by the WAL. If recovery is misunderstood, teams misread symptoms. They may blame the buffer pool when redo is the bottleneck, or blame redo when the real problem is an enormous undo chain created by one long-running transaction.

Framed that way, crash recovery is not a postscript to WAL design. It is the mechanism that makes the whole design safe. Checkpoints from the previous lesson limit how far recovery must search. The recovery phases in this lesson explain how the engine uses that checkpoint metadata. The next lesson will tighten the loop by showing how fsync and group commit change the durability boundary that recovery has to respect.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain what analysis reconstructs after a crash - Describe how checkpoint metadata and WAL records rebuild the dirty page table and transaction table.
  2. Trace why redo repeats history instead of replaying committed work only - Use recLSN and page_lsn to decide which logged changes must be applied again.
  3. Explain how undo safely removes loser transactions - Show why compensation log records and backward traversal are necessary for crash-safe rollback.

Core Concepts Explained

Concept 1: Analysis rebuilds the recovery bookkeeping that disappeared with memory

Harbor Point's last completed checkpoint was taken two minutes before the outage. At that moment, the checkpoint recorded a fuzzy snapshot of active transactions and dirty pages, but transactions kept running afterward. When the database restarts, analysis begins from that checkpoint because it is the newest durable summary of the engine's in-flight state.

Assume the end-checkpoint record says the dirty page table contains P17 -> recLSN 880 and P21 -> recLSN 892, while the transaction table says T41 is active with lastLSN = 905. The WAL after the checkpoint then contains these records:

LSN 910  T41 update P21
LSN 918  T42 begin
LSN 924  T42 update P44
LSN 930  T41 commit
LSN 934  T41 end
LSN 938  T42 update P17
LSN 940  crash

Analysis walks forward through those log records and recreates the runtime tables the buffer manager and lock manager had before the crash. T41 moves from active to committed and then disappears once the end record is seen. T42 is discovered as an active loser because it has updates but no commit record. P44 is added to the dirty page table with recLSN 924, while P17 stays dirty with its older recLSN 880 because that older value still bounds redo.

The important point is that analysis is not yet fixing data pages. It is reconstructing the questions recovery must answer next: where redo should start, which pages might be stale on disk, and which transactions will eventually need undo. The trade-off is extra log metadata and checkpoint complexity in exchange for bounded restart logic. Without that bookkeeping, the engine would have to make much weaker guesses after every crash.

Concept 2: Redo repeats history from the oldest relevant dirty page, including loser updates

Once analysis has rebuilt the dirty page table, redo starts from the smallest recLSN still present. In Harbor Point's case that is 880 because page P17 has been dirty since before the checkpoint. This often surprises people: redo does not start at the first loser transaction, and it does not start at the checkpoint LSN blindly. It starts at the earliest log record that could still matter for any page that may be stale on disk.

Redo then scans forward and asks, record by record, whether the logged effect is already reflected on the target page. ARIES uses the page's durable page_lsn as the test. If the on-disk page_lsn for P21 is already 910 or newer, the engine can skip reapplying T41's update because that page image already contains the change. If P44 is still at page_lsn 900, redo must reapply the LSN 924 update even though T42 never committed. At this stage recovery is trying to reconstruct the exact state of the database at the instant before the crash, not the final committed state after rollback.

That "repeat history" rule is what makes the later undo phase reliable. Suppose T42 inserted a row into a secondary index and also changed the base table. If redo applied only committed work, the engine might reach a physically inconsistent state in which one page reflects the loser transaction and the other does not. Repeating history first ensures that every page is brought to the same logical moment in time before rollback begins.

The cost is that recovery may intentionally redo work it already knows will be undone. That sounds wasteful until you compare it with the alternative: bespoke reasoning about every possible partially flushed page combination. ARIES pays some extra scan and replay work so correctness reduces to a small set of idempotent tests based on recLSN and page_lsn.

Concept 3: Undo walks loser transactions backward and logs the rollback itself

After redo, Harbor Point's pages now reflect the exact pre-crash history, including T42's uncommitted changes. Undo can finally remove that work. Recovery takes the loser transactions from the analysis phase and processes them in reverse log order, typically by keeping a priority queue keyed by each transaction's lastLSN. In the example above, T42 begins undo at LSN 938, then continues to LSN 924, and stops when it reaches the begin record.

For each undoable record, ARIES does two things. It applies the inverse action to the page, and it writes a compensation log record (CLR) describing that rollback step. A simplified sketch looks like this:

redo finished
undo T42 at LSN 938 -> write CLR 946 (undoNextLSN = 924)
undo T42 at LSN 924 -> write CLR 952 (undoNextLSN = 918)
write T42 end at 956

The CLR is not bookkeeping decoration. It is what makes recovery itself crash-safe. If the database crashes again after writing CLR 946 but before finishing T42, the next restart will see that part of the undo already happened and will continue from undoNextLSN = 924 rather than undoing the same action twice. In other words, redo can safely replay CLRs too, because they are just more durable evidence about the state transition history.

Operationally, undo is where long-running loser transactions become painful. A single transaction that touched thousands of hot pages can dominate restart time even when redo is short. That is why production teams watch not only WAL volume and checkpoint age, but also the size and age of active transactions. The trade-off behind steal/no-force buffering shows up here in full: faster commits during normal operation mean recovery must sometimes pay the rollback bill later.

Troubleshooting

Issue: Recovery logs show redo applying records from a transaction that never committed, and the team assumes the engine is corrupting data.

Why it happens / is confusing: Redo's job is to repeat history up to the crash point, not to produce the final committed state by itself. Loser updates may need to be restored on disk temporarily so undo can remove them from a physically consistent page set.

Clarification / Fix: Check whether undo is still in progress and whether the suspect records belong to transactions marked as losers during analysis. If so, the behavior is expected.

Issue: Restart time is much longer than checkpoint frequency suggests.

Why it happens / is confusing: A recent checkpoint only limits where analysis begins. Redo still starts from the oldest dirty recLSN, and undo can still be dominated by one large active transaction that survived until crash time.

Clarification / Fix: Inspect the dirty page table age and the loser transaction list separately. If redo is short but undo is huge, the operational fix is transaction design and timeout discipline, not merely more frequent checkpoints.

Issue: The database crashes again during recovery, and operators worry that rollback work will be duplicated or skipped.

Why it happens / is confusing: Recovery is doing real writes, so a second crash feels like it invalidates the whole process.

Clarification / Fix: Verify that compensation log records are being written and flushed according to the engine's WAL rules. CLRs are what let the next recovery pass resume undo safely instead of starting from scratch.

Advanced Connections

Connection 1: ARIES redo ↔ replicated log catch-up

A Raft follower catching up after downtime also replays a durable log to reconstruct state it missed while offline. The difference is that ARIES can skip some records using each page's page_lsn, while a follower usually reasons in terms of log index and committed prefix. Both systems rely on durable history plus idempotent replay tests rather than trusting volatile memory.

Connection 2: ARIES undo ↔ filesystem journaling and repair logs

Filesystems such as ext3 and ext4 journal metadata so crash recovery can safely replay or complete interrupted operations, but ARIES goes further by tracking loser transactions explicitly and logging compensation steps. The common pattern is that repair work must itself be durable; otherwise a crash during repair creates a second, harder-to-explain inconsistency.

Resources

Optional Deepening Resources

Key Insights

  1. Analysis recreates recovery state, not user data - It rebuilds the dirty page table and transaction table so the later phases know where to start and what must be undone.
  2. Redo repeats history before it enforces commit status - Replaying loser updates is necessary because the disk image after a crash is a partial, page-by-page snapshot.
  3. Undo must be logged to be trustworthy - Compensation log records make rollback idempotent and allow recovery to survive a second crash mid-rollback.
PREVIOUS Checkpointing and Dirty Page Control NEXT fsync, Group Commit, and Durable Latency

← Back to Database Engine Internals and Implementation

← Back to Learning Hub