Day 463: Storage Engines: Log-Structured vs Page-Oriented

The core idea: Page-oriented engines keep durable state in mutable pages and use a log to make those page updates crash-safe, while log-structured engines treat append-only files as the durable state and push cleanup into background merge or compaction work.

Today's "Aha!" Moment

In Systems of Record vs Derived Data, PayLedger established PostgreSQL as the place where payout state becomes official and Elasticsearch as a derived search surface. That architectural split explains who is allowed to decide the truth. This lesson explains why those two systems behave so differently under the same payroll burst.

At payroll cutoff, PayLedger writes thousands of small status transitions such as approved -> processing -> settled while support agents search for problem payouts in parallel. PostgreSQL starts showing checkpoint pressure and dirty-buffer churn. Elasticsearch keeps up with ingest at first, then query latency rises while merges catch up. Both systems are persisting the same business story, but they are paying the I/O bill in different places because their storage engines are built around different assumptions.

The useful mental shift is that "log-structured" does not mean "has a log" and "page-oriented" does not mean "never appends." PostgreSQL also appends to a write-ahead log. LSM trees and Lucene segments also cache data in memory before flushing. The real question is which structure is treated as the durable truth after a crash: mutable pages that are repaired from a journal, or immutable append files that are periodically compacted into a cleaner shape.

Why This Matters

Teams often choose a database by feature checklist or benchmark headline, then discover the real trade-off only when production traffic settles into a pattern. A page-oriented engine may feel stable for mixed reads and point updates until random writes, page splits, and background cleanup collide with a burst window. A log-structured engine may absorb writes beautifully until compaction debt turns a cheap ingest path into expensive read amplification and disk churn.

PayLedger cannot afford to treat these effects as implementation trivia. The payout ledger has to keep small transactional reads and writes predictable because settlement workflows, operator tools, and audit checks all depend on it. The search index can tolerate seconds of lag, but it must survive heavy append traffic without turning every support query into a segment fan-out problem. Once the storage-engine model is explicit, symptoms such as checkpoint spikes, tombstone buildup, or compaction stalls stop looking mysterious and start looking like expected consequences of the chosen write path.

That is the production value of this lesson: you stop asking "which database is faster?" and start asking "where does this engine place write amplification, read amplification, space amplification, and recovery work for my workload?"

Learning Objectives

By the end of this session, you will be able to:

Explain the physical write path in each engine family - Trace how page-oriented and log-structured systems turn an application update into durable bytes.
Compare the operational trade-offs explicitly - Relate latency, amplification, and maintenance work to the underlying storage layout rather than treating them as isolated tuning issues.
Apply engine-family reasoning to architecture choices - Decide why an authoritative OLTP store, a search index, or an event-heavy projection may prefer different storage behavior.

Core Concepts Explained

Concept 1: Page-oriented engines optimize around mutable pages plus a write-ahead journal

For the authoritative payout ledger, PayLedger uses PostgreSQL because the workload is dominated by short transactions, point lookups by payout ID, constrained state transitions, and range queries such as "all payouts for employer X created today." A page-oriented engine fits that shape by organizing data into fixed-size pages and using page-aware index structures, usually B-trees, so reads can navigate a relatively compact path to the needed tuples.

The crucial mechanism is that the durable database is the page set, not the log. When a payout changes from processing to settled, PostgreSQL records the change in the write-ahead log first, updates in-memory copies of the affected heap and index pages, and flushes those dirty pages later during checkpoints or background writes. The log exists so that if the node crashes after acknowledging the commit, recovery can replay the change and bring the pages back to a consistent state.

That means page-oriented does not literally imply "overwrite one row in place." PostgreSQL's MVCC model often writes a new tuple version and marks older versions dead, but those versions still live on heap pages and are managed by page-level structures, free space maps, vacuum, and page splits in indexes. The page is still the unit the engine must cache, dirty, flush, and eventually reclaim.

transaction commit
      |
      v
 append WAL record
      |
      v
update heap/index pages in memory
      |
      v
flush dirty pages later
      |
      v
replay WAL on crash if needed

This design pays part of the cost early so the read path stays comparatively short. If the working set is warm and the index layout is healthy, a point lookup or ordered range scan does not need to consult a long chain of immutable files. The trade-off is that hot keys, random writes, page fragmentation, and cleanup tasks such as vacuum and checkpoint flushing can push latency into the write path.

Concept 2: Log-structured engines turn writes into append now and organization later

The search side of PayLedger has a different job. Support agents need to filter payouts by employer, bank, failure code, and settlement date across a large corpus, and the index is constantly ingesting fresh changes from the payout stream. For that pattern, a log-structured design is attractive because it converts many small writes into mostly sequential appends, then defers structural cleanup to background work.

Broadly, the engine accepts an update into an in-memory structure and durability log, flushes immutable files such as SSTables or Lucene segments, and later merges those files so reads do not have to consult an unbounded number of versions. The exact implementation differs: RocksDB uses leveled or tiered compaction over sorted tables, while Lucene builds immutable segments and periodically merges them. The shared idea is that the append path is cheap up front and the engine pays the organization cost later.

An update is therefore usually represented as a new version, not a direct overwrite of the old one. Deletes become tombstones. Secondary structures are rebuilt through flush and merge rather than page edits. Reads may check multiple files, bloom filters, or posting lists before they know which version is current.

write request
    |
    v
WAL + memtable / indexing buffer
    |
    v
flush immutable SSTable / segment
    |
    v
background compaction or merge
    |
    v
fewer files, lower lookup fan-out

This is why log-structured engines often excel at sustained ingest, bulk mutation, or event-heavy pipelines. They are converting random write pressure into append-friendly I/O. The bill shows up elsewhere: compaction can rewrite bytes multiple times, tombstones can hurt scans until merged away, and reads can become more expensive when too many files or segments overlap. The engine is not avoiding work. It is shifting work in time.

Concept 3: Choosing an engine family means choosing where amplification and maintenance will live

By the time PayLedger reaches payroll cutoff, the difference becomes operationally visible. PostgreSQL pays for durable transactional behavior with WAL volume, dirty-page management, vacuum pressure, and index maintenance. Elasticsearch pays with refresh, merge scheduling, temporary disk growth, and query fan-out across segments until merges finish. Both systems are healthy only if the team budgets for the maintenance work implied by the storage layout.

The useful comparison is not "page-oriented for reads, log-structured for writes" because that slogan is too shallow. A better framing is:

Page-oriented engines often spend more effort keeping the primary structure queryable during each update, which can make many transactional reads and ordered lookups simpler.
Log-structured engines often spend less effort on the initial write and more effort on deferred cleanup, which can make ingestion smoother but background maintenance more consequential.
Neither family wins in the abstract; the right choice depends on access pattern, update locality, latency SLOs, and how much operational variability the system can absorb.

For PayLedger, that leads to a coherent role assignment. PostgreSQL remains the system of record because the payout workflow needs strong transactional semantics and predictable point reads on authoritative state. Elasticsearch remains a derived store because its segment-oriented design is well suited to search-style indexing and replayable rebuilds from the committed source stream. If the team tried to push every mutable ledger decision into a heavily compacting log-structured store, it would need to manage different read-path and compaction risks. If it forced operational search into only a page-oriented transactional store, it would pay for filter-heavy text and faceted queries in a structure not optimized for that access pattern.

The next lesson, Module Capstone: Data Model and Storage Baseline, asks you to make those trade-offs explicit across a full architecture. The follow-on deep dives into B-Tree Path and Maintenance Costs and LSM compaction will then unpack the two cost profiles in more detail.

Troubleshooting

Issue: PostgreSQL commit latency spikes during batch settlement even though CPU is not saturated.

Why it happens / is confusing: In a page-oriented engine, the critical path includes WAL pressure, dirty-buffer accumulation, checkpoint behavior, and index-page churn. The problem can look like "the database is randomly slow" when the real cause is that page maintenance work is bunching up under write bursts.

Clarification / Fix: Inspect WAL generation rate, checkpoint timing, buffer write activity, and table or index bloat. If the workload is truly write-heavy and append-dominant, separate that flow from the authoritative ledger instead of forcing every pattern through the same page-oriented path.

Issue: Search ingest looks healthy, but query latency and disk usage climb for hours afterward.

Why it happens / is confusing: A log-structured index can accept writes cheaply up front while compaction or segment merging falls behind. Users experience the deferred cost as slower reads, temporary space amplification, and bursty background I/O rather than obvious write failures.

Clarification / Fix: Check segment counts, merge backlog, tombstone accumulation, and disk headroom. If the system is always recovering from compaction debt, the ingest rate, retention policy, or shard sizing is out of line with the merge budget.

Issue: Deleting or updating many records does not reclaim space quickly in either engine family.

Why it happens / is confusing: Both families keep historical artifacts around for correctness and concurrency reasons. Page-oriented engines may need vacuum or page reuse cycles before dead tuples disappear from the physical footprint. Log-structured engines may keep tombstones and superseded versions until compaction rewrites the overlapping files.

Clarification / Fix: Treat space reclamation as a maintenance pipeline, not an instant side effect of DELETE or UPDATE. Monitor bloat and compaction debt explicitly instead of assuming the filesystem view will track logical row counts in real time.

Advanced Connections

Connection 1: Systems of Record vs Derived Data explains the authority boundary; storage engines explain the cost profile underneath it

The previous lesson decided that PostgreSQL is allowed to settle payout truth and Elasticsearch is allowed to project it. This lesson shows why that division is not arbitrary. The system of record benefits from page-oriented transactional locality, while the derived search layer benefits from log-structured indexing and rebuildability. Authority and storage mechanics are different axes, but production systems work best when they reinforce each other.

Connection 2: Recovery, caching, and maintenance policy all depend on the same physical model

Checkpoint tuning, buffer cache hit rate, and vacuum settings make sense only if you understand a page-oriented engine's dependence on mutable pages. Bloom filters, compaction strategy, segment merges, and read amplification make sense only if you understand a log-structured engine's dependence on immutable files plus deferred cleanup. The operational knobs are not arbitrary vendor trivia; they are consequences of the write path.

Resources

[DOC] B-Tree Indexes
- Focus: How PostgreSQL's page-oriented index structure organizes lookups, updates, and page splits in the common OLTP path.
[DOC] Database Page Layout
- Focus: How PostgreSQL organizes heap pages, item identifiers, and tuple storage inside a page-oriented engine.
[DOC] RocksDB Tuning Guide
- Focus: Practical consequences of LSM compaction, write amplification, and memory or disk trade-offs in a log-structured engine.
[DOC] Merge settings
- Focus: How Lucene-style segment merging surfaces as an operational concern in a production search system.
[BOOK] Designing Data-Intensive Applications
- Focus: High-level comparison of B-trees, LSM trees, and storage-engine trade-offs in modern data systems.

Key Insights

The key distinction is where durable truth lives after a crash - Page-oriented engines recover mutable pages from a journal, while log-structured engines recover append-built files and then clean them up with merges or compaction.
Write cost is never eliminated, only relocated - Page-oriented systems pay more during page maintenance and checkpoints; log-structured systems pay more during read fan-out, merge work, and compaction debt.
Engine choice should follow workload role, not brand preference - An authoritative ledger, a search index, and a replayable projection may all store the same business entities while needing different physical write paths.

← Back to Data Architecture and Platforms

← Back to Learning Hub