Day 393: LSM Trees: Memtables, SSTables, and Flush

The core idea: An LSM tree keeps the foreground write path short by admitting updates into a sorted in-memory memtable and deferring on-disk reordering to flush and compaction, so the critical correctness boundary is the handoff from WAL to memtable to immutable SSTable.

Today's "Aha!" Moment

In 08.md, Harbor Point's municipal-bond quote index paid for order immediately. Every insert that hit a full B-tree leaf risked a split, parent updates, and more latch pressure on the hot range. That works well when the main job is point lookup on a stable index, but Harbor Point also runs an audit store that must ingest every quote revision from every venue during market shocks. On a Federal Reserve announcement, that stream is dominated by writes, not by root-to-leaf page surgery.

The LSM tree changes when sorting work happens. A new quote is first appended to the write-ahead log and inserted into a memtable, which is already ordered in memory. When that memtable fills, the engine freezes it, creates a fresh mutable memtable for new writes, and flushes the frozen snapshot into an immutable SSTable on disk. Each individual structure is sorted, but the store as a whole is only partially merged until later compaction.

That is the mental shift that matters: an LSM tree does not eliminate maintenance; it reschedules maintenance. The foreground write path becomes cheap because it avoids rewriting deep page structure, while background flush and later compaction take on the cost of restoring durable sorted state. The next lesson on 10.md matters because once you accept that deferral, the central operational question becomes how much deferred work is accumulating.

A common misconception is that LSM writes are "just append-only." They are append-friendly on the critical path, but they still depend on ordered in-memory state, version visibility rules, immutable sorted files, manifest updates, and crash-safe installation of new files. If any of those handoffs are wrong, fast writes merely produce durable corruption faster.

Why This Matters

Harbor Point uses its audit store for two jobs that do not tolerate guesswork. Compliance staff must be able to replay the exact quote history for a bond after the market closes, and production services must keep accepting bursts of quote updates without turning every spike into a page-split storm. An LSM tree is attractive here because the cost of "accept this write now" is mostly a log append plus an in-memory insert, not a structural rewrite of several on-disk pages.

That operational benefit is easy to misread if you treat the engine as a black box. Seeing many small SSTables on disk can look like fragmentation when it is actually normal flush behavior. Seeing acknowledged writes survive a crash even though they never reached an SSTable can look mysterious until you remember WAL replay. Seeing read latency climb after an ingest burst can look like random I/O trouble when the real issue is that flushes created more overlapping files than the read path can cheaply consult.

Once you understand memtables, SSTables, and flush, those symptoms become mechanically legible. You can tell whether the system is healthy but busy, briefly behind on background work, or approaching a real stall condition.

Learning Objectives

By the end of this session, you will be able to:

Explain the LSM admission path - Trace how a write moves through the WAL and mutable memtable before the client is acknowledged.
Describe flush mechanics precisely - Explain how a full memtable becomes an immutable SSTable without blocking all new writes.
Reason about production trade-offs - Connect flush backlog, read amplification, recovery behavior, and file growth to the design choices in an LSM engine.

Core Concepts Explained

Concept 1: The WAL and memtable form the write admission contract

When Harbor Point receives a new quote update for CUSIP 13042AB7, the storage engine does not start by searching for the correct disk page to rewrite. Instead, it assigns the update a sequence number, appends a record to the write-ahead log, and inserts the entry into the current mutable memtable. The memtable is usually implemented as an ordered in-memory structure such as a skip list or balanced tree, so new entries can be inserted quickly while still being readable in key order.

The important invariant is not "the value exists somewhere in memory." It is that every acknowledged write has crossed a durability boundary and a visibility boundary. The durability boundary is the WAL append, because replaying the log after a crash can reconstruct the memtable state that had not yet reached disk. The visibility boundary is the memtable insert, because reads must be able to see the newest version before any flush happens. Many engines therefore treat the write as committed only after both steps are complete according to the configured sync policy.

For Harbor Point, the sequence is intentionally short:

client write
   |
   v
append WAL record  ---> durable recovery source
   |
   v
insert into mutable memtable  ---> newest visible version
   |
   v
acknowledge write

This is why the foreground write path can stay smooth under bursty ingest. No parent pages are split, and no existing disk blocks are rewritten just to keep the key space ordered immediately. The trade-off is that the engine is accumulating future work. A larger memtable reduces how often that work is triggered, but it also increases peak flush time and the amount of WAL that may need to be replayed after a crash.

Concept 2: Flushing freezes one sorted snapshot and installs it as an SSTable

Eventually Harbor Point's mutable memtable reaches its configured size limit. At that point the engine does not stop taking writes and dump everything synchronously to disk. It marks the current memtable immutable, allocates a fresh mutable memtable, and lets new writes continue on the new structure while a background task drains the frozen one.

Flush is possible because the immutable memtable is already sorted. The background worker can iterate it in key order and write an SSTable sequentially: data blocks first, then the block index and metadata, and finally the footer that lets the engine reopen the file later. The result is an immutable sorted-string table: a file whose entries are ordered by internal key and whose layout is optimized for sequential creation and block-based reads rather than in-place edits.

One more correctness step matters before the file becomes part of the database state. The engine must make the SSTable durable and then install metadata that says, in effect, "this file now belongs to the current version of the store." In LevelDB- and RocksDB-style systems, that metadata lives in a manifest or version set. Only after the new file is durably referenced can the old WAL segment for that memtable be considered reclaimable. If the process crashes before the manifest update, recovery should ignore the half-installed file and rebuild state from the WAL instead.

That ordering is what makes flush more than a background serialization job. It is a state transition from "this data exists as log plus memory" to "this data exists as a durable sorted run on disk." Larger flushes mean fewer files and better write efficiency, but they also create longer background tasks, more bytes to checkpoint into one file, and more risk of temporary stall if background bandwidth is insufficient.

Concept 3: Flush makes writes cheap now by making reads and background work harder later

Suppose Harbor Point asks for the latest quote history for 13042AB7 just after a burst. The newest version may still be in the mutable memtable. A slightly older version may sit in an immutable memtable waiting to flush. Older versions may already be in several SSTables created by previous flushes. The read path therefore has to search these structures in recency order and stop when it finds the newest visible version or tombstone for the key.

This is the first major LSM trade-off. Each memtable and each SSTable is internally sorted, but the database is not yet globally consolidated after every write. Flush creates another sorted run; it does not eliminate overlap between runs. That deferred merge strategy is exactly why writes stay cheap, and it is exactly why read amplification appears if too many recent SSTables accumulate. The next lesson on 10.md takes that deferred-work problem head-on through compaction strategy.

The same trade-off shows up in operations. If Harbor Point's flush thread keeps pace, the engine absorbs bursts gracefully and the write path remains short. If flush falls behind, immutable memtables pile up, level-0 files multiply, and the engine may throttle or stall foreground writes to keep the system recoverable. None of that is accidental behavior. An LSM engine is explicitly choosing to protect write throughput most of the time by accepting a more complex read path and a background-maintenance control problem.

Seen from that angle, memtables and SSTables are not just data structures. They are a contract about where order exists right now, where recovery will rebuild it if the process dies, and where operational debt is being stored until the background pipeline catches up.

Troubleshooting

Issue: Harbor Point sees bursty write latency even though memtable inserts are cheap and CPU usage is moderate.

Why it happens / is confusing: The foreground path is usually fast, so teams assume the problem cannot be inside the storage engine. In practice, stalls often come from the background side: too many immutable memtables, slow WAL syncs, or too many freshly flushed level-0 SSTables forcing write throttling.

Clarification / Fix: Inspect flush queue depth, pending bytes, WAL fsync latency, and write-stall counters together. A short write path is only sustainable when the background pipeline keeps consuming the work it creates.

Issue: After a crash, recently acknowledged quotes reappear even though no SSTable containing them had been flushed before the restart.

Why it happens / is confusing: Teams sometimes assume "durable" and "already present in an SSTable" mean the same thing. In an LSM engine, those are different milestones.

Clarification / Fix: Verify recovery by replaying the WAL into a rebuilt memtable. The flush boundary controls on-disk file creation, but the WAL controls whether acknowledged writes survive process failure.

Issue: Read latency rises after an ingest spike, especially for keys that were updated repeatedly.

Why it happens / is confusing: The write path looked healthy during the spike, so it is tempting to blame storage hardware later. Often the real cause is that the read path must consult more mutable state and more recently flushed SSTables before it finds the newest version.

Clarification / Fix: Measure memtable hit rate, number of immutable memtables, and level-0 file count. The symptom is usually deferred maintenance surfacing as read amplification, not random slowdown.

Advanced Connections

Connection 1: LSM flush ↔ crash recovery protocol

An LSM flush plays the same role for sorted runs that page flushing plays in a page-oriented engine: it moves data from a transient in-memory form into a durable on-disk form. The recovery difference is that the unit of installation is an immutable file plus manifest metadata, not a rewritten page image. Systems like LevelDB, RocksDB, Bigtable tablets, and Cassandra SSTable-based storage all depend on that log-plus-install ordering.

Connection 2: LSM flush ↔ B-tree structural maintenance

Harbor Point's comparison between 08.md and this lesson is the core design fork. B-trees preserve global search order immediately and therefore pay maintenance on the foreground write path through page splits and merges. LSM trees preserve only local sortedness immediately and postpone global cleanup to flush and compaction. The same product requirement, "store every quote update durably," leads to very different latency profiles depending on where that maintenance cost is scheduled.

Resources

Optional Deepening Resources

[PAPER] Bigtable: A Distributed Storage System for Structured Data
- Focus: Read the sections on memtables, SSTables, and minor compaction to see the classic production shape of the LSM write path.
[DOC] LevelDB Implementation Notes
- Focus: Follow the concrete sequence from log write to memtable, immutable memtable, and version-set installation in a compact engine implementation.
[DOC] RocksDB Wiki: MemTable
- Focus: Compare memtable implementations and notice how in-memory ordering choices affect insert cost, memory usage, and scan behavior.
[DOC] RocksDB Wiki: Write Stalls
- Focus: Connect flush debt and level-0 pressure to the real production symptoms that appear when background work stops keeping up.

Key Insights

The first durable home of an LSM write is the WAL, not the SSTable - Flush changes the physical representation later, but recovery correctness starts at log append plus memtable visibility.
A memtable is already an ordered index, not a temporary bag of writes - The engine buys fast admission by keeping order in memory first and on disk second.
Flush is where deferred maintenance becomes durable state - It preserves write throughput in the short term while creating the read and compaction costs that the next lesson must manage.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub