LESSON
Day 389: Write Path Fundamentals and Append-Only Logging
The core idea: A database makes commits cheap and crash-safe by separating "record the change durably" from "rewrite every affected data page now," using an append-only log as the authoritative history during the gap.
Today's "Aha!" Moment
In 04.md, the buffer pool could not evict a dirty page whenever it pleased, because recovery rules still constrained writeback. That constraint comes from the write path. When Harbor Point ingests a new municipal-bond quote at 09:30, the engine does not treat the final heap page or B-tree page as the first place where durability happens. Those pages are just the current in-memory working set. The first durability boundary is the append-only log.
That is the shift that makes database write paths easier to reason about. A logical update usually fans out into several physical changes: a heap tuple changes, one or more index pages change, transaction state changes, and buffer metadata changes. If the engine waited for all of those pages to be rewritten in place before every commit returned, write latency would collapse under random I/O and concurrent writers would serialize on page flushes. Instead, the engine writes compact redo information to a sequential log, makes that log durable, and lets dirty data pages catch up later.
The common misconception is that COMMIT means "the new row bytes are already in their final data files." In most production engines, that is not what success means. Success means the engine has persisted enough ordered log information that, after a crash, it can reconstruct the committed result even if the data files still hold older page versions. Once that clicks, checkpoint behavior, recovery time, and the next lesson on 06.md all become easier to connect.
Why This Matters
Harbor Point's quote ingester updates the municipal_quotes heap and its issuer/time indexes continuously through the market-open burst. One quote revision can dirty multiple pages even though the client issued one logical statement. If each commit forced every affected page to disk immediately, storage would see a storm of small random writes, the hottest pages would bounce between CPU and SSD, and throughput would fall long before the database had exhausted CPU.
Append-only logging changes that shape. The commit path can batch many small logical updates into a mostly sequential stream, and the background writer or checkpoint process can flush data pages later in larger, more controlled bursts. That design is why production teams watch WAL fsync latency, checkpoint distance, dirty-page age, and recovery lag instead of only staring at query time. The write path is not just "how data gets stored"; it is the contract that determines commit latency, crash recovery behavior, and how much write amplification the engine is willing to absorb.
Learning Objectives
By the end of this session, you will be able to:
- Explain why database commits are split across log and data files - Describe why append-only logging exists and what problem it solves for concurrent writes.
- Trace the mechanics of a WAL-backed commit - Follow a write from buffer-pool modification through log records, flush ordering, acknowledgment, and later page persistence.
- Evaluate production trade-offs in the write path - Reason about group commit, checkpoint pressure, recovery time, and why physical file layout changes write amplification.
Core Concepts Explained
Concept 1: One logical write becomes several coordinated physical changes
Return to Harbor Point's recurring workload. A new dealer quote arrives for a bond already on the newest heap page. The engine updates the tuple version or inserts a new one, adjusts one or more index entries, and marks those pages dirty in the buffer pool. None of that is durable yet. At this point the engine has only changed memory.
What matters is that the logical statement does not map to one physical write. The heap page and the B-tree leaf probably live in different parts of the data files, and future readers may need both to agree about which quote version is visible. The write path therefore needs a single ordered history describing those page changes, or recovery after a crash would have no authoritative story for what happened.
That history is the write-ahead log. As the engine performs the in-memory update, it also generates redo records that describe enough detail to replay the change later. Each record is assigned a monotonically increasing log sequence number, often shortened to LSN. The modified data pages remember the most recent LSN that affected them, usually in a page header field such as page_lsn.
The normal path looks like this:
client issues UPDATE
-> executor changes heap/index pages in buffer pool
-> log manager appends redo records to WAL buffer
-> commit record is appended
-> WAL is flushed to durable storage
-> client receives COMMIT success
-> dirty data pages are written later by cleaners/checkpoints
The benefit is that commit latency is tied to a mostly sequential log write instead of every random data-page write immediately finishing. The cost is double work: the engine eventually writes both the log records and the data pages. That extra write amplification is intentional. The system is buying ordering and recoverability, not merely speed.
Concept 2: The write-ahead rule is what makes the append-only log authoritative
The previous lesson established that dirty pages cannot be evicted safely without checking recovery state. Here is the exact rule behind that behavior: before a dirty page reaches durable storage, the WAL records describing every change on that page must already be durable. In shorthand, durable_wal_lsn >= page_lsn must hold before the data page can be flushed.
That rule is what turns the log into the source of truth during crashes. Suppose Harbor Point acknowledges a commit for a quote at 09:31:17, flushes the WAL, and then loses power before the heap page or index leaf is written back. On restart, the database reads the last checkpoint, scans forward through the WAL, and reapplies the logged changes to any page whose on-disk page_lsn is behind. The committed quote reappears because the log survived, even though the main data files lagged at the moment of the crash.
This is why many engines use a no-force commit policy: they do not force every dirty data page to disk before returning success. They only force the log records that make recovery possible. Some engines also allow dirty pages to be written before transaction end under tighter recovery rules, but the universal invariant stays the same: the log must lead the data pages, never the other way around.
An append-only layout helps here for two reasons. First, sequential appends are much friendlier to storage devices than scattering tiny overwrites across many pages. Second, ordered log addresses make it easy to define recovery progress. If a page says page_lsn = 8F/19A24010, the system can ask a precise question: has durable WAL reached that point yet?
The trade-off is that recovery now depends on the health and sizing of the WAL subsystem. If log devices saturate or log buffers back up, commit latency rises even when the rest of the engine looks idle. The log is not "just metadata." It is on the hot path of every durable write.
Concept 3: Real write-path tuning is about shaping flushes, not eliminating them
At Harbor Point, market-open traffic produces bursts of tiny quote revisions. The engine benefits if it can batch many commit records into one fsync, a technique usually called group commit. That can improve throughput dramatically because 200 transactions may share one durable flush instead of 200 separate sync operations. The trade-off is visible in tail latency: transactions at the front of a batch may wait a little longer so the whole batch can amortize the flush.
The second production lever is checkpointing. Dirty pages cannot stay in memory forever, because restart time would grow without bound and the buffer pool would eventually run out of reclaimable frames. A checkpoint gives recovery a newer starting point and pushes part of the dirty working set back into the main data files. But the frequency matters. Very aggressive checkpoints create bursts of random writes and can compete with foreground work. Very relaxed checkpoints keep foreground writes cheap for a while but increase restart time and can cause painful cleaning spikes later.
This is where the earlier lessons in month 25 start to combine. Wider rows from 03.md mean one quote revision may dirty more page bytes. The buffer-pool rules from 04.md decide whether foreground queries stall waiting for safe victims. The next question, which 06.md takes up directly, is where those dirty pages live in the first place. A heap file, a sorted file, and a hash-organized file create very different write amplification, page-split behavior, and locality even when they all rely on the same append-only WAL discipline.
In other words, append-only logging solves the ordering problem, not the whole write problem. It tells the engine how to make many page changes durable as one history. It does not decide how many pages each workload dirties, how often those pages must be reorganized, or which data structure should absorb the next update. Those are storage-layout questions layered on top of the write path, not substitutes for it.
Troubleshooting
Issue: Commit latency spikes during write-heavy periods even though CPU and query-planner metrics look normal.
Why it happens / is confusing: The bottleneck is often WAL flush latency, saturated log bandwidth, or a collapse in group-commit efficiency rather than executor work. A database can look "idle" in CPU terms while transactions queue behind durable log sync.
Clarification / Fix: Check WAL bytes per second, fsync time, log-buffer waits, and whether checkpoint I/O is contending with the log device. If the log path is the limiter, tuning queries alone will not help.
Issue: After a crash, the main data files appear older than the most recently acknowledged commits.
Why it happens / is confusing: In a no-force design, that lag is expected. The data files are allowed to trail the durable commit point because restart recovery is supposed to replay WAL and close the gap.
Clarification / Fix: Verify that recovery completed, inspect replay LSN progress, and compare page LSNs with durable WAL position before concluding data was lost or corrupted.
Issue: Checkpoints cause bursty write stalls and sudden buffer-pool pressure.
Why it happens / is confusing: Too many dirty pages may have accumulated, forcing cleaners to flush a large backlog under time pressure. Foreground misses then compete for safe victims at the same moment the engine is trying to shrink recovery distance.
Clarification / Fix: Tune checkpoint cadence and dirty-page cleaning so writeback is smoother. The goal is not "fewest checkpoints possible"; it is a steady balance between restart time, flush bandwidth, and foreground latency.
Advanced Connections
Connection 1: Write-ahead logging <-> filesystem journaling
Both systems separate "record the intended change durably" from "apply the final structure update everywhere." A filesystem journal logs metadata changes before updating home locations; a database WAL logs page modifications before those pages are flushed. The similarity matters operationally because both systems trade extra write volume for predictable crash recovery.
Connection 2: Write-ahead logging <-> replication and change streams
An append-only commit history is useful for more than restart recovery. PostgreSQL physical replication ships WAL records to standbys, and many CDC pipelines read a database's ordered change stream to feed downstream systems. The same ordering guarantee that protects Harbor Point during a power loss also makes external replicas and consumers able to follow the database in the right sequence.
Resources
Optional Deepening Resources
- [DOC] PostgreSQL Documentation: Write-Ahead Logging (WAL)
- Focus: Trace the no-force commit model and the ordering rule between WAL and data-page flushes.
- [DOC] MySQL 8.0 Reference Manual: InnoDB Redo Log
- Focus: Compare redo logging, checkpointing, and log-flush behavior in another production storage engine.
- [PAPER] ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging
- Focus: Study the recovery model that explains why WAL ordering rules exist and why append-only logs support fast normal-path commits.
Key Insights
- A commit is usually a log-durability event, not an immediate data-file rewrite - That split is what keeps random data-page writes from dominating every transaction.
- The write-ahead rule is the recovery contract - Data pages may lag, but they are never allowed to outrun the durable log records needed to reconstruct them.
- Write-path tuning is mostly about shaping flush timing and amplification - Group commit, checkpoint policy, and physical file organization determine whether WAL remains an efficient abstraction or becomes the next bottleneck.