Write-Ahead Logging and Crash Recovery Mechanics

LESSON

Database Engine Internals and Implementation

005 30 min intermediate

Day 277: Write-Ahead Logging and Crash Recovery Mechanics

The core idea: a storage engine can acknowledge a write before all affected pages are in their final on-disk place, as long as the change is first recorded durably in a write-ahead log.


Today's "Aha!" Moment

The insight: Durability does not require "write the whole database page immediately." It requires a durable record of intent before the engine is allowed to flush data pages lazily or recover after a crash.

Why this matters: This is the missing piece after LSM trees and B-Trees. Both kinds of engines often keep recent work in memory first. Without a durable log, a crash would erase acknowledged writes.

Concrete anchor: A client updates an account balance and receives "commit succeeded." One second later the machine crashes. The question is simple and brutal: after restart, how does the engine know which committed writes must still exist?

The practical sentence to remember:
WAL makes commit durable first; data pages can catch up later.


Why This Matters

The problem: Writing full data pages to disk on every change is too expensive. Engines buffer data in memory, reorder writes, batch page flushes, and optimize layout over time. That improves performance, but it creates a recovery problem.

Without WAL:

With WAL:

Operational payoff: WAL is what lets engines be both fast and durable instead of forcing "sync every final page before every commit."


Learning Objectives

By the end of this lesson, you should be able to:

  1. Explain why write-ahead logging exists as the durability bridge between in-memory work and later page flushes.
  2. Describe the commit and recovery flow across log records, fsync, checkpoints, redo, and sometimes undo.
  3. Reason about operational trade-offs such as commit latency, recovery time, log growth, and checkpoint policy.

Core Concepts Explained

Concept 1: Why WAL Exists

Concrete example / mini-scenario: A database updates several records in memory, marks their pages dirty, and returns success to the client. The OS crashes before those pages are written back.

Intuition: Memory is fast, but memory is not durable. If the engine wants to acknowledge work before final page writes happen, it needs a smaller, append-friendly durable record that says what must survive the crash.

How it works mechanically:

  1. The engine changes in-memory state first, often in a buffer pool or memtable.
  2. It creates log records describing the change.
  3. Those log records are appended to a sequential log.
  4. Before commit is acknowledged, the relevant log records must be made durable.

That ordering is the rule behind the name:

Why append logs are attractive:

Connection to the previous lesson: LSM trees postpone layout work and rely on in-memory state plus later flushes and compaction. WAL is the protection that makes those deferred writes survivable.


Concept 2: Commit, Checkpoints, and Crash Recovery

Concrete example / mini-scenario: Transaction T1 commits. Some of its dirty pages are still only in memory. The process crashes before the next checkpoint.

Intuition: Recovery works because the log knows more than the final data files at crash time.

Typical commit path:

  1. A transaction modifies in-memory state.
  2. The engine appends log records describing those modifications.
  3. On commit, the commit record and required log tail are forced to stable storage.
  4. Only then can the client hear "committed."
  5. Dirty data pages may be flushed much later.

Typical recovery path:

Why checkpoints matter:

Useful distinction:

Mental model:
Think of the WAL as the legally binding journal.
The data pages are the neatly organized filing cabinets.
After a fire, you rebuild the cabinets from the journal.


Concept 3: The Real Trade-offs in Production

Concrete example / mini-scenario: A team wants lower commit latency, shorter restart time, and minimal storage overhead at the same time.

Intuition: WAL is not "free durability." It shifts cost into log throughput, sync policy, checkpointing, and recovery engineering.

Main trade-offs:

  1. Commit latency vs durability policy

    • Synchronous log flush on commit gives stronger durability.
    • Looser policies improve latency but widen the window of acknowledged data loss on crash.
  2. Checkpoint frequency vs background I/O

    • Frequent checkpoints shorten crash recovery.
    • But they increase background write pressure.
  3. Log growth vs operational simplicity

    • Long log tails preserve history for recovery and replication.
    • But they consume disk and can slow restart if not managed.
  4. Foreground throughput vs restart time

    • Letting lots of dirty state accumulate can help steady-state throughput.
    • But it makes recovery more expensive when failure happens.

Where teams often go wrong:

The systems view: WAL is the first half of durability. The second half is whether recovery remains fast and correct when the log is large and the crash timing is inconvenient.


Troubleshooting

Issue: The application received success, but the last few writes disappeared after a crash.

Why it happens: The system likely acknowledged commit before the log tail was durably forced, or relied on a weaker sync policy than the team assumed.

Clarification / Fix: Check commit durability settings, fsync behavior, storage guarantees, and whether the application confused "replicated eventually" with "durable locally now."

Issue: Restart after crash takes far too long.

Why it happens: The engine may need to scan and replay a large log tail because checkpoints were infrequent or dirty state had grown too large.

Clarification / Fix: Review checkpoint cadence, log retention, and recovery metrics. Recovery-time objectives should be treated as a first-class performance target.

Issue: Storage usage for logs keeps growing.

Why it happens: Logs may be pinned by checkpoints, replicas, snapshots, or archival policies.

Clarification / Fix: Inspect retention dependencies instead of deleting log files blindly. A WAL that looks old may still be required by replication or recovery guarantees.

Issue: Write throughput drops during bursts.

Why it happens: The log device, fsync path, or group-commit window may be the real bottleneck, not the data pages themselves.

Clarification / Fix: Measure log append bandwidth, fsync latency, queue depth, and commit batching. The hot path is often the WAL device.


Advanced Connections

Connection 1: WAL <-> Buffer Pools and Dirty Pages

The parallel: Buffer pools let engines delay page writes. WAL is what makes that delay safe.

Why this matters: Once you understand WAL, deferred flushing stops looking risky and starts looking like a controlled performance technique.

Connection 2: WAL <-> Replication and Distributed Logs

The parallel: Many replicated systems also rely on ordered durable logs before state-machine application. Local crash recovery and distributed replication are different problems, but they often share the same pattern: durable ordered intent first, materialized state second.

Why this matters: This is why database internals, consensus logs, and streaming systems feel related. They all separate "record the decision durably" from "fully materialize the effect everywhere."


Resources

Suggested Resources


Key Insights

  1. Durability does not mean "flush final pages immediately" - it means there is a durable history that recovery can trust.
  2. WAL separates commit from page flush - that separation is what makes modern engines both fast and correct.
  3. Crash recovery is part of the design, not an afterthought - checkpointing, redo cost, and log retention are all part of the same durability story.

PREVIOUS LSM Trees, SSTables, and Compaction Trade-offs NEXT Query Execution Pipelines and Operator Costs

← Back to Database Engine Internals and Implementation

← Back to Learning Hub