LESSON

Database Engine Internals and Implementation

005 30 min intermediate

Day 277: Write-Ahead Logging and Crash Recovery Mechanics

The core idea: a storage engine can acknowledge a write before all affected pages are in their final on-disk place, as long as the change is first recorded durably in a write-ahead log.

Today's "Aha!" Moment

The insight: Durability does not require "write the whole database page immediately." It requires a durable record of intent before the engine is allowed to flush data pages lazily or recover after a crash.

Why this matters: This is the missing piece after LSM trees and B-Trees. Both kinds of engines often keep recent work in memory first. Without a durable log, a crash would erase acknowledged writes.

Concrete anchor: A client updates an account balance and receives "commit succeeded." One second later the machine crashes. The question is simple and brutal: after restart, how does the engine know which committed writes must still exist?

The practical sentence to remember:
WAL makes commit durable first; data pages can catch up later.

Why This Matters

The problem: Writing full data pages to disk on every change is too expensive. Engines buffer data in memory, reorder writes, batch page flushes, and optimize layout over time. That improves performance, but it creates a recovery problem.

Without WAL:

Acknowledged writes can disappear after a crash.
Dirty pages in memory vanish with the process.
Page flush order becomes unsafe because partial state may reach disk.

With WAL:

The engine preserves a durable history of changes before data pages are flushed.
Recovery can replay committed work that had not yet reached its final location.
Checkpoints and background flushing become performance tools, not durability hazards.

Operational payoff: WAL is what lets engines be both fast and durable instead of forcing "sync every final page before every commit."

Learning Objectives

By the end of this lesson, you should be able to:

Explain why write-ahead logging exists as the durability bridge between in-memory work and later page flushes.
Describe the commit and recovery flow across log records, fsync, checkpoints, redo, and sometimes undo.
Reason about operational trade-offs such as commit latency, recovery time, log growth, and checkpoint policy.

Core Concepts Explained

Concept 1: Why WAL Exists

Concrete example / mini-scenario: A database updates several records in memory, marks their pages dirty, and returns success to the client. The OS crashes before those pages are written back.

Intuition: Memory is fast, but memory is not durable. If the engine wants to acknowledge work before final page writes happen, it needs a smaller, append-friendly durable record that says what must survive the crash.

How it works mechanically:

The engine changes in-memory state first, often in a buffer pool or memtable.
It creates log records describing the change.
Those log records are appended to a sequential log.
Before commit is acknowledged, the relevant log records must be made durable.

That ordering is the rule behind the name:

Write-ahead means the log must reach durable storage before the corresponding data pages are allowed to become the only copy of truth.

Why append logs are attractive:

Sequential writes are cheaper than scattered page rewrites.
Group commit can amortize fsync cost across multiple transactions.
Recovery logic can reason from an ordered history of changes.

Connection to the previous lesson: LSM trees postpone layout work and rely on in-memory state plus later flushes and compaction. WAL is the protection that makes those deferred writes survivable.

Concept 2: Commit, Checkpoints, and Crash Recovery

Concrete example / mini-scenario: Transaction T1 commits. Some of its dirty pages are still only in memory. The process crashes before the next checkpoint.

Intuition: Recovery works because the log knows more than the final data files at crash time.

Typical commit path:

A transaction modifies in-memory state.
The engine appends log records describing those modifications.
On commit, the commit record and required log tail are forced to stable storage.
Only then can the client hear "committed."
Dirty data pages may be flushed much later.

Typical recovery path:

Start from the last checkpoint or safe recovery point.
Read forward through the relevant part of the log.
Redo changes that are committed but missing from data pages.
In some engines, undo incomplete transactions that had written partial effects to pages before the crash.

Why checkpoints matter:

They reduce how much log must be scanned during restart.
They help bound recovery time.
They do not replace WAL; they only reduce the amount of work recovery must revisit.

Useful distinction:

Commit latency depends heavily on when the log is forced.
Recovery time depends heavily on how much log must be replayed and how much dirty state existed at crash time.

Mental model:
Think of the WAL as the legally binding journal.
The data pages are the neatly organized filing cabinets.
After a fire, you rebuild the cabinets from the journal.

Concept 3: The Real Trade-offs in Production

Concrete example / mini-scenario: A team wants lower commit latency, shorter restart time, and minimal storage overhead at the same time.

Intuition: WAL is not "free durability." It shifts cost into log throughput, sync policy, checkpointing, and recovery engineering.

Main trade-offs:

Commit latency vs durability policy
- Synchronous log flush on commit gives stronger durability.
- Looser policies improve latency but widen the window of acknowledged data loss on crash.
Checkpoint frequency vs background I/O
- Frequent checkpoints shorten crash recovery.
- But they increase background write pressure.
Log growth vs operational simplicity
- Long log tails preserve history for recovery and replication.
- But they consume disk and can slow restart if not managed.
Foreground throughput vs restart time
- Letting lots of dirty state accumulate can help steady-state throughput.
- But it makes recovery more expensive when failure happens.

Where teams often go wrong:

They think replication removes the need for local crash recovery discipline.
They assume "fsync sometimes" is close enough to durable commit semantics.
They optimize steady-state write throughput without planning for restart behavior.

The systems view: WAL is the first half of durability. The second half is whether recovery remains fast and correct when the log is large and the crash timing is inconvenient.

Troubleshooting

Issue: The application received success, but the last few writes disappeared after a crash.

Why it happens: The system likely acknowledged commit before the log tail was durably forced, or relied on a weaker sync policy than the team assumed.

Clarification / Fix: Check commit durability settings, fsync behavior, storage guarantees, and whether the application confused "replicated eventually" with "durable locally now."

Issue: Restart after crash takes far too long.

Why it happens: The engine may need to scan and replay a large log tail because checkpoints were infrequent or dirty state had grown too large.

Clarification / Fix: Review checkpoint cadence, log retention, and recovery metrics. Recovery-time objectives should be treated as a first-class performance target.

Issue: Storage usage for logs keeps growing.

Why it happens: Logs may be pinned by checkpoints, replicas, snapshots, or archival policies.

Clarification / Fix: Inspect retention dependencies instead of deleting log files blindly. A WAL that looks old may still be required by replication or recovery guarantees.

Issue: Write throughput drops during bursts.

Why it happens: The log device, fsync path, or group-commit window may be the real bottleneck, not the data pages themselves.

Clarification / Fix: Measure log append bandwidth, fsync latency, queue depth, and commit batching. The hot path is often the WAL device.

Advanced Connections

Connection 1: WAL <-> Buffer Pools and Dirty Pages

The parallel: Buffer pools let engines delay page writes. WAL is what makes that delay safe.

Why this matters: Once you understand WAL, deferred flushing stops looking risky and starts looking like a controlled performance technique.

Connection 2: WAL <-> Replication and Distributed Logs

The parallel: Many replicated systems also rely on ordered durable logs before state-machine application. Local crash recovery and distributed replication are different problems, but they often share the same pattern: durable ordered intent first, materialized state second.

Why this matters: This is why database internals, consensus logs, and streaming systems feel related. They all separate "record the decision durably" from "fully materialize the effect everywhere."

Resources

Suggested Resources

[DOC] PostgreSQL WAL Introduction - Documentation
Focus: a production-grade explanation of why WAL exists and how it supports reliability and performance.
[DOC] SQLite Write-Ahead Logging - Documentation
Focus: a compact and practical description of WAL behavior in a well-known embedded engine.
[BOOK] Designing Data-Intensive Applications - Book site
Focus: strong mental models for storage engines, crash recovery, and durability trade-offs.

Key Insights

Durability does not mean "flush final pages immediately" - it means there is a durable history that recovery can trust.
WAL separates commit from page flush - that separation is what makes modern engines both fast and correct.
Crash recovery is part of the design, not an afterthought - checkpointing, redo cost, and log retention are all part of the same durability story.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub