LESSON

Database Engine Internals and Implementation

023 30 min advanced CAPSTONE

Day 400: Monthly Capstone: Design a Minimal Storage Engine

The core idea: A minimal storage engine is not a database with features removed at random. It is a narrow workload contract where page layout, indexing, WAL, checkpointing, and recovery all protect the same explicit invariants.

Today's "Aha!" Moment

In 15.md, Harbor Point's municipal-bond trading database reached the durable boundary only after the relevant WAL records were flushed. That solved one critical question: when is an acknowledged commit real? The capstone question is larger. If Harbor Point had to build a small storage engine for one production job, which pieces from the month are truly mandatory, and which ones can be postponed without lying about correctness?

Assume the desk needs a local engine called quotevault. It stores every quote correction for thinly traded bonds so operators can answer two questions under stress: "what is the latest committed state for quote Q12345?" and "show me every correction for CUSIP 9128285M8 in the last fifteen minutes." The workload is specific: bursty single-row upserts, frequent point reads, short range scans, one writer process, many readers, and a hard rule that acknowledged corrections must survive a power loss. Recovery has to finish in under a minute because the trading desk cannot wait for a manual rebuild at market open.

Once those constraints are explicit, the design stops being a vague "storage engine architecture" discussion. A heap file with slotted pages gives stable record identifiers. B-tree indexes give predictable point lookups and range scans. WAL plus group commit gives durable acknowledgements. Checkpoints keep recovery bounded. The important shift is that "minimal" does not mean "ignore failure cases." It means "promise only the capabilities you can defend mechanistically."

That also explains what this engine will not do. quotevault will support single-operation atomic writes, not full multi-statement transactions. It will allow one mutation thread, not arbitrary concurrent writers. Those are not omissions by accident. They are deliberate scope boundaries, and they lead directly into ../26/01.md, where the curriculum starts adding ACID and isolation on top of a durable storage substrate.

Why This Matters

Many production teams end up building small storage engines without calling them that. An embedded metadata store, a workflow ledger inside a service, a local cache that must survive restarts, or an audit trail with indexed replay all force the same design questions. Which bytes are authoritative? Which accesses must be fast? Which writes may be acknowledged? What has to happen after a crash before the service is trustworthy again?

Harbor Point currently has a weaker setup: append-only log files plus periodic snapshots. That is fine until the desk needs both durability and indexed reads during the same volatile morning. A raw log is excellent for sequential append, but point lookups turn into scans, snapshots lag behind committed state, and crash repair becomes a manual argument about which files were copied before the outage. That is exactly the kind of system that looks lightweight until the first stressful failure window.

Designing a minimal storage engine fixes that by turning the month's isolated mechanisms into one contract. 02.md and 03.md decide how rows fit on pages. 04.md decides how hot pages stay in memory. 07.md and 08.md give the read path its structure. 13.md, 14.md, and 15.md decide what the engine may claim after a crash. If those pieces are designed together, the desk gets predictable reads, honest commit semantics, and bounded restart behavior. If they are bolted together late, the engine becomes a pile of fast paths with no trustworthy failure story.

Learning Objectives

By the end of this session, you will be able to:

Define a minimal storage-engine contract from a real workload - Translate read patterns, write patterns, and crash requirements into a narrow but defensible engine scope.
Trace the chosen engine's write, read, and recovery paths - Explain how heap pages, B-tree indexes, WAL, group commit, and checkpoints cooperate in one coherent design.
Explain which capabilities are intentionally deferred - Show why full transactions, multiple concurrent writers, and richer isolation belong in the next module rather than in this minimal engine.

Core Concepts Explained

Concept 1: Start from the workload, then pick the smallest engine shape that satisfies it

Harbor Point's quotevault workload is demanding but not unlimited. Most operations are quote corrections arriving from a single ingest service. Each correction replaces the latest state for one quote_id and carries variable-length payload such as dealer notes and provenance metadata. Reads are mostly of two kinds: exact lookup by quote_id during investigations, and recent-range scans by (cusip, event_ts) for one bond during a fast market. Deletes are rare because the store is an audit trail, not a mutable order book.

That workload rules out several tempting designs. A flat append-only log is simple, but point lookups degrade into repeated scans unless a separate index is built anyway. An LSM tree would make the write path even more sequential, but it would also introduce compaction scheduling, stale SSTable overlap, and more complex secondary-index maintenance. For quotevault, Harbor Point values predictable read latency and a simple recovery story more than maximum ingest throughput. The dataset is measured in tens of gigabytes, not petabytes, so a B-tree-based engine is the better minimal choice.

The resulting physical design is intentionally small:

WAL file
  -> ordered redo records for every committed mutation

Heap file (8 KiB slotted pages)
  -> authoritative row storage
  -> each row has a stable RID = (page_id, slot_id)

Primary B-tree on quote_id
  -> quote_id -> RID

Secondary B-tree on (cusip, event_ts, quote_id)
  -> recent range scans per bond

The heap page layout uses the machinery from 02.md and 03.md: a page header, a slot directory, and variable-length records packed from the end of the page backward. That choice matters because dealer notes are not fixed-size, and stable slot identifiers let indexes point to records without rewriting every reference when the page compacts locally. A minimal engine still needs this discipline; otherwise even simple updates turn into pointer chaos.

The trade-off is clear. Harbor Point avoids LSM compaction debt and gets stronger read locality for hot index paths, but it accepts B-tree page splits, random dirty-page writes, and the need to maintain two indexes on every mutation. Minimality here is not "fewest files." It is "fewest mechanisms that still satisfy the actual query contract."

Concept 2: The write path must treat durability and visibility as one boundary

Because quotevault has one mutation thread, Harbor Point can simplify the write path aggressively without pretending recovery is optional. Each incoming correction becomes a mini-transaction: update the heap row, update the two B-tree indexes, append corresponding WAL records, write a commit record, and acknowledge success only after the commit LSN becomes durable. Multiple client requests may still share one flush through group commit, but the engine never exposes a correction as committed before that flush boundary moves.

The write path looks like this:

client correction
   -> writer thread builds new heap image in buffer pool
   -> update primary and secondary B-tree pages in buffer pool
   -> append redo records to WAL buffer
   -> assign commit_lsn
   -> group commit flush makes durable_lsn >= commit_lsn
   -> acknowledge correction

This is the direct continuation of 15.md. The desk does not care whether the heap page reached disk at acknowledgement time. It cares whether the correction will survive a crash. WAL plus group commit answers that question honestly. Checkpoints and background page flush handle the slower task of moving dirty heap and index pages to their long-term files.

The key design simplification is that background flush is allowed to write only pages whose page_lsn is at or below the durable boundary. In other words, the engine never writes uncommitted page images to disk. That means recovery can stay redo-only for this minimal scope: after a crash, the engine ignores any trailing WAL without a durable commit record and reapplies committed records whose LSNs are newer than the page image on disk. Harbor Point gives up concurrent writers and long-running write transactions to earn that simplicity.

Operationally, that is a real trade-off, not a classroom shortcut. One writer thread keeps latch design simple and makes durability bugs easier to reason about. It also caps write throughput and means a slow correction can hold up the whole mutation stream. For a narrow audit store inside one service, that is acceptable. For a general-purpose DBMS with many overlapping writers, it would not be.

Concept 3: Checkpointing, recovery, and observability decide whether the design is publishable

The engine becomes production-ready only when Harbor Point can restart it after a crash and explain exactly what it will do. quotevault therefore keeps a fuzzy checkpoint every few seconds. The checkpoint records the redo start LSN and the list of dirty pages that had not yet reached disk. On restart, recovery finds the latest completed checkpoint, scans WAL forward from that redo point, and reapplies committed heap and index changes whenever the on-disk page LSN is older than the record being replayed.

That recovery story is intentionally narrower than ARIES in 14.md. There is no general undo phase because the engine never flushes uncommitted page state. There is no MVCC visibility check because each key has only one committed current version. There is no lock table reconstruction because the writer thread serialized mutations in the first place. The restart sequence is therefore short enough to explain on one whiteboard, which is exactly what you want from a minimal engine.

The design also needs a short list of metrics that prove its assumptions are still true:

durable commit latency and group size, to confirm 15.md's flush sharing is working
dirty-page backlog, to catch checkpoints that are falling behind
B-tree split rate on both indexes, to detect page layout pressure
buffer-pool hit ratio for upper B-tree levels and hot heap pages, to confirm the read path stays memory-resident
recovery replay volume since the last checkpoint, to bound restart time before an outage tests it for real

Those metrics matter because storage engines usually fail first by violating one of their own hidden assumptions. If split rate spikes, the supposedly stable index shape is no longer stable. If replay volume grows without bound, the checkpoint policy is lying about restart time. If point lookups start missing cache residency, the chosen B-tree design may still be correct but no longer cheap enough.

This is also where the boundary with the next module becomes explicit. quotevault can promise durable single-operation updates and reliable indexed reads on one node. It cannot promise serializable multi-statement transactions, deadlock handling, or reader-writer isolation anomalies, because those require a transaction manager layered over the storage engine. That is the handoff into ../26/01.md: first make the bytes durable and recoverable, then decide how multiple transactions are allowed to observe and modify them.

Troubleshooting

Issue: After a crash, the heap contains the latest quote correction but the secondary (cusip, event_ts, quote_id) index misses it during range scans.

Why it happens / is confusing: The team treated heap persistence and index persistence as separate concerns, but recovery correctness depends on replaying both as one committed mini-transaction. A crash that lands between those writes exposes any mismatch immediately.

Clarification / Fix: Log heap and index page changes under the same commit boundary, and replay them from WAL together. Never assume an index can be "mostly current" in a store that promises crash-safe query answers.

Issue: Commit latency becomes erratic during the market open even though the WAL device is normally fast.

Why it happens / is confusing: The expensive step is often the flush barrier, not the number of log bytes. Checkpoint writeback, index-page churn, or OS cache pressure can interfere with the same storage path and make group commit look unreliable.

Clarification / Fix: Track durable-latency percentiles alongside checkpoint volume and dirty-page backlog. If they rise together, smooth checkpoint I/O or separate WAL and data-page traffic before blaming the commit algorithm itself.

Issue: The engine is correct, but one slow write blocks every other incoming correction.

Why it happens / is confusing: That is the cost of the single-writer simplification. Correctness is easier to reason about because mutations are serialized, but throughput and tail latency now depend on the longest operation in the queue.

Clarification / Fix: Keep the mutation path narrow, reject expensive per-write work, and be honest about the scale limit. If Harbor Point needs many concurrent writers, the right fix is not a larger queue. It is to move into the transaction and concurrency-control machinery introduced next month.

Advanced Connections

Connection 1: Minimal storage-engine design ↔ SQLite's pager and B-tree layering

SQLite is useful here because it shows how far a small engine can go when the boundaries stay disciplined. A pager layer decides which pages are authoritative and durable, while B-tree structures provide logical access paths on top. quotevault is not SQLite, but the architectural lesson is similar: a storage engine becomes understandable when the page manager, WAL contract, and index structures are clearly separated.

Connection 2: Durable storage substrate ↔ transaction management

Month 25 focused on where bytes live, when they become durable, and how the engine recovers them. Month 26 starts asking a different question: if two transactions overlap, which versions are visible, which schedules are legal, and how do locks or timestamps preserve those rules? The second layer depends completely on the first. Isolation without a trustworthy storage substrate is just a more complicated way to lose data.

Resources

Optional Deepening Resources

[DOC] SQLite Architecture
- Focus: See how a compact production engine separates pager, B-tree, and recovery responsibilities.
[DOC] PostgreSQL Documentation: Database Page Layout
- Focus: Compare the slotted-page design in this lesson with a mature heap-page implementation.
[DOC] PostgreSQL Documentation: Write Ahead Logging (WAL)
- Focus: Relate commit, checkpoint, and recovery boundaries in this capstone to a production database.
[PAPER] ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging
- Focus: See what extra machinery appears once you move beyond the simplified redo-only recovery boundary used in quotevault.

Key Insights

Minimal means narrow scope, not weak correctness - A small engine is credible only when its invariants are explicit and its omitted features are intentional.
The write path defines the truth boundary - Heap updates, index maintenance, WAL, and acknowledgement must share one durability contract or the engine will lie after a crash.
Recovery is part of the architecture, not an appendix - A storage design is only complete when restart steps, checkpoint policy, and replay cost are all predictable.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub