WAL Internals: LSN, pageLSN, and Flush Ordering

LESSON

Database Engine Internals and Implementation

030 30 min advanced

Day 407: WAL Internals: LSN, pageLSN, and Flush Ordering

The core idea: Once 06.md decides which transaction is allowed to commit, write-ahead logging makes that decision crash-safe by giving every log record an LSN, stamping dirty pages with a pageLSN, and refusing to flush data pages ahead of the WAL that explains them.

Today's "Aha!" Moment

At Harbor Point, the risk engine has already solved the logical conflict from 06.md: trader R1 won validation for a 0.5M MUNI-77 reservation, and trader R2 was forced to retry. That tells the database which history is legal. It does not yet tell the database what happens if the machine loses power three milliseconds later.

That is the job of WAL internals. The update to MUNI-77 is first described as a log record with a monotonically increasing log sequence number, or LSN. When the buffer manager applies that change to an in-memory page, it stamps the page with the same value as pageLSN, meaning "this page now reflects all log records up through here." From then on, the storage engine has an uncompromising rule: the page is not allowed to reach disk until the WAL is durable at least through that pageLSN.

The key realization is that the data file is not the primary source of truth at commit time. The WAL is. Harbor Point can acknowledge R1 before the modified page itself is written to the table file, because the durable promise is "the log contains enough information to redo this change after a crash," not "every touched page has already been flushed."

That is why LSNs and flush ordering matter in production. If the ordering is wrong, recovery may find a page whose contents have no durable explanation, or it may lose a committed trade that the application was told was safe. The next lesson, 08.md, depends on this contract: redo can only be idempotent if the engine knows exactly which log records a page has already absorbed.

Why This Matters

Harbor Point's market-open burst is the worst time to discover that "commit succeeded" and "page reached disk" are not the same event. During the burst, hundreds of reservations dirty pages faster than the data files can be written. If the database required every modified page to flush before acknowledging a commit, commit latency would be dominated by random data-page I/O and the desk would stall under load.

Real engines therefore use a steal/no-force design. "No-force" means a commit does not force every dirty page to disk. "Steal" means the buffer manager is still allowed to flush a dirty page before its transaction commits, which helps memory pressure but creates a recovery obligation. WAL is the mechanism that makes both choices safe. The log must become durable before a dirty page hits disk, and a transaction's commit record must become durable before the client hears "committed."

The production relevance is immediate. If the WAL device falls behind, Harbor Point may see rising commit latency even while the data device looks underutilized, because the checkpoint and page writers are blocked by the log-first rule. If the engine mis-orders those writes, the failure mode is worse than slowness: crash recovery may reconstruct an impossible state. Understanding LSNs and pageLSN is how you separate healthy buffering from correctness debt.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain what LSN and pageLSN each represent - Describe how the engine maps an in-memory page version back to the WAL timeline.
  2. Trace the flush-ordering path for a committing write - Follow Harbor Point's reservation from log record creation to commit acknowledgment and later page flush.
  3. Reason about production trade-offs - Explain why databases usually fsync WAL before acknowledging commit instead of forcing data pages, and what operational bottlenecks that creates.

Core Concepts Explained

Concept 1: LSN is the durable timeline, and pageLSN says how far a page has caught up

When R1 inserts the 0.5M reservation for MUNI-77, the engine does not begin by overwriting the table file. It first appends an update record to the WAL buffer. That record is assigned an LSN, often a byte position or logical address in the WAL stream. The exact encoding varies by engine, but the contract is the same: higher LSN means later in the durable history.

Now the buffer manager modifies the in-memory page that stores the reservation row. The important side effect is not just the new tuple; it is the page header update to pageLSN = update_lsn. pageLSN is a compact statement about page state: "this page already includes every logged change up through this LSN." Without that stamp, recovery would have no reliable way to compare page contents with redo records.

For Harbor Point, that relationship looks like this:

WAL stream:
  ... -> LSN 9810240: insert reservation R1 for MUNI-77
      -> LSN 9810312: commit transaction R1

Buffer page 842:
  rows: [..., reservation R1]
  pageLSN: 9810240

This explains why pageLSN belongs on the page, not just in a transaction table. Recovery works page by page. When the engine later scans redo, it can ask a local question for page 842: "Is this log record newer than the page image already on disk?" If the page on disk has pageLSN >= redo_lsn, the change is already reflected and redo can skip it safely. That is the bridge to 08.md.

The trade-off is tighter coupling between components that would otherwise look separate. The log manager assigns LSNs, the buffer manager mutates pages, and the recovery subsystem trusts pageLSN to tell the truth. Once those contracts drift apart, no amount of throughput tuning can restore correctness.

Concept 2: Flush ordering is what lets the engine acknowledge commit before the data page is written

The phrase "write-ahead logging" is easy to misread. It does not mean "flush the log before touching memory." Harbor Point's page is updated in memory immediately after the update record is generated. The actual rule is narrower and more useful: before a dirty page is written to persistent storage, the WAL must be durable through that page's pageLSN. And before a transaction is acknowledged as committed, the WAL must be durable through the commit record's LSN.

For R1, the normal path is:

  1. The engine appends an update log record for the reservation and gets LSN = 9810240.
  2. It inserts the row on buffer page 842 and sets pageLSN = 9810240.
  3. At commit, it appends a commit record, say LSN = 9810312.
  4. The WAL writer flushes the log to durable storage through 9810312.
  5. Only after that flush succeeds does Harbor Point return success to the trader.
  6. The modified table page may still remain dirty in memory and reach disk later.

That last step is the non-obvious one. The database is free to acknowledge R1 before page 842 is flushed because redo can recreate the page from WAL after a crash. The durable contract is attached to the log, not to immediate data-file persistence. This is why commit latency in many engines tracks WAL fsync latency much more closely than data-file write latency.

The same ordering rule also permits steal. Suppose memory pressure forces page 842 out of the buffer pool before R1 commits. The engine may flush that dirty page early, but only if the WAL is already durable through pageLSN = 9810240. If the process crashes after the flush but before the commit record is durable, recovery can still undo or ignore the uncommitted effects because the page never outran the log that explains how it got there.

The performance trade-off is deliberate. Sequential WAL writes and group commit are usually cheaper than forcing many scattered data pages on every commit. In exchange, the engine accepts a more sophisticated recovery path and stronger coupling between dirty-page management and WAL flush progress.

Concept 3: Production bottlenecks often show up as an LSN gap, not as obvious disk saturation

At 09:32, Harbor Point's dashboards show a confusing picture: the data volume is only half busy, but commit latency jumps from 4 ms to 35 ms, dirty pages accumulate, and checkpoints take longer to finish. The missing variable is WAL flush progress. If the current end of log advances much faster than the durable flushed LSN, the system starts to accumulate dirty pages whose pageLSN cannot legally reach disk yet.

That creates a chain reaction. Page cleaners cannot evict some dirty buffers because their explaining WAL is not durable. Checkpoints drag on because too many pages remain pinned behind the WAL frontier. Backpressure reaches transactions as longer commit waits, not because every commit flushes its own data pages, but because commit still needs the log device to certify the durable order. In other words, WAL can become the critical path even when the table-space device looks fine.

This is why production observability has to include LSN-centric signals: current insert LSN, flushed LSN, checkpoint redo LSN, dirty-page backlog, and recovery distance implied by the outstanding WAL. Without them, Harbor Point might misdiagnose the problem as a generic storage slowdown and spend days tuning page flushers when the real bottleneck is a saturated log device or an fsync configuration change.

The design trade-off is worth it because the alternative is much worse. If Harbor Point forced every data page at commit, the desk would pay higher steady-state latency. If it relaxed flush ordering instead, the database could expose pages whose contents have no durable explanation. WAL chooses a middle path: log first, data later, with exact ordering rules. The next lesson uses that same path to explain why redo can be repeated safely after multiple crashes.

Troubleshooting

Issue: Commit latency rises sharply, but the table-space disks are not saturated.

Why it happens / is confusing: Commit acknowledgment waits on WAL durability, not on every data page flush. A slow log device or more expensive fsync path can dominate latency while the main data volume still looks comfortable.

Clarification / Fix: Check WAL flush time, group-commit behavior, and the gap between inserted and flushed LSN. Treat the log device as a first-class performance bottleneck.

Issue: Dirty pages pile up and checkpoints stall even though the buffer pool is large.

Why it happens / is confusing: Buffer size helps only until dirty pages need eviction. A page with pageLSN greater than the durable WAL frontier cannot legally be written, so page cleaners stall behind WAL progress.

Clarification / Fix: Inspect checkpoint metrics together with WAL throughput. If the LSN gap is widening, reducing checkpoint aggressiveness alone will not solve the root cause.

Issue: Engineers assume a committed transaction must already exist in the data file on disk.

Why it happens / is confusing: "Committed" sounds like "fully materialized everywhere," but in a no-force system it really means the commit record is durable in WAL and the engine can redo the change later.

Clarification / Fix: Separate durability of intent from persistence of page images. The log makes the transaction durable first; data pages catch up afterward.

Issue: After enabling asynchronous commit, a small number of acknowledged writes disappear after crash tests.

Why it happens / is confusing: Async commit weakens the usual rule by allowing the client acknowledgment before the commit record is guaranteed durable. That can be an acceptable trade for some workloads, but it changes the durability contract.

Clarification / Fix: Use sync commit for truly durable operations, and make the weaker guarantee explicit in API semantics and operational docs.

Advanced Connections

Connection 1: Optimistic commit logic in 06.md needs WAL to make the winner durable

The previous lesson decided whether Harbor Point was allowed to commit R1 and reject R2. WAL picks up immediately after that decision. Validation answers "which history is legal?"; LSN assignment and flush ordering answer "which legal history survives power loss?" Production systems need both layers, because a transaction that is serializable but not durably ordered is still not safe.

Connection 2: pageLSN is the hook recovery uses in 08.md for idempotent redo

ARIES-style recovery and engines influenced by it replay log records by comparing each redo record's LSN with the page's stored pageLSN. That local comparison is what makes redo repeatable after repeated crashes: if the page already includes the change, skip it; if not, apply it. Today's flush-ordering rules are therefore the prerequisite for tomorrow's recovery correctness story.

Resources

Optional Deepening Resources

Key Insights

  1. LSN is the storage engine's durable ordering primitive - It gives updates, commits, and later recovery work a single comparable timeline.
  2. pageLSN turns a page image into a statement about WAL history - Recovery can decide whether redo is needed only because each page says how far into the log it already reflects.
  3. Flush ordering makes steal/no-force practical - Commit durability comes from WAL fsync, while delayed page flushes preserve throughput and shift complexity into recovery.
PREVIOUS Timestamp Ordering and Optimistic Concurrency NEXT Recovery Correctness and Idempotent Redo

← Back to Database Engine Internals and Implementation

← Back to Learning Hub