LSM Trees, SSTables, and Compaction Trade-offs

LESSON

Database Engine Internals and Implementation

004 30 min intermediate

Day 276: LSM Trees, SSTables, and Compaction Trade-offs

The core idea: an LSM tree makes writes cheap by delaying part of the maintenance work that a B-Tree pays immediately. That deferred work comes back later as compaction, read amplification, and operational tuning.


Today's "Aha!" Moment

The insight: A B-Tree keeps data organized during each write. An LSM tree accepts writes quickly first, then reorganizes data later in bigger background batches.

Why this matters: This is not a minor implementation detail. It changes where the system pays for ordering, deduplication, deletes, and space cleanup.

Concrete anchor: Imagine an event ingestion system receiving a continuous stream of updates. If every write had to maintain a large on-disk sorted structure in place, write throughput would suffer. An LSM design absorbs writes fast, then merges and cleans data in the background.

The practical sentence to remember:
LSM trees trade cheaper writes now for compaction work later.


Why This Matters

The problem: Many modern storage engines are built around write-heavy workloads: logs, metrics, key-value updates, event ingestion, user activity streams, and large distributed stores. These systems care a lot about write throughput, sequential I/O, and predictable ingestion under load.

Without this model:

With this model:

Operational payoff: Better engine choice, better tuning, and fewer surprises when a system that looked great in ingestion tests starts struggling in mixed read/write production traffic.


Learning Objectives

By the end of this lesson, you should be able to:

  1. Explain why LSM trees exist as a response to the cost of maintaining on-disk order during every write.
  2. Describe the write path and read path across memtables, SSTables, and immutable files.
  3. Reason about compaction trade-offs in terms of write amplification, read amplification, and space amplification.

Core Concepts Explained

Concept 1: Why LSM Trees Exist

Concrete example / mini-scenario: A telemetry platform ingests millions of small writes per second. The main requirement is to accept writes reliably and continuously, even when the working set is larger than memory.

Intuition: In a B-Tree, each write helps preserve the global sorted structure immediately. In an LSM design, writes first land in memory, then get flushed out as immutable sorted files. The system postpones some organization work so the write path stays cheaper.

How it works mechanically:

  1. New writes go into an in-memory structure, usually a sorted memtable.
  2. When that memtable fills, it is frozen and flushed to disk as an immutable sorted file called an SSTable.
  3. Over time, many SSTables accumulate, often across multiple levels or tiers.
  4. Background compaction merges them, removes obsolete versions, and restores a healthier read shape.

Why this is attractive:

The trade-off: You did not eliminate maintenance work. You moved it off the foreground write path and into later background merge work.

Mental model:
A B-Tree cleans the room after every new item.
An LSM tree puts items into labeled bins quickly, then schedules a later cleanup and reorganization session.


Concept 2: SSTables and the Read Path

Concrete example / mini-scenario: A user asks for the current value of a key after that key has been updated many times over the last hour.

Intuition: Because LSM trees accumulate immutable files over time, the read path may need to check several places before it knows which version is current.

How it works mechanically:

Why immutability helps:

Why reads get harder:

Practical consequence:
An LSM tree often wins when the workload is write-heavy, but mixed workloads must be tuned carefully so reads do not drown in too many files or too much stale history.

Connection to the next lesson: If recent writes live in memory first, durability cannot depend on memory alone. That is why the write path is usually paired with a write-ahead log, which is the focus of the next lesson.


Concept 3: Compaction Is Deferred Maintenance

Concrete example / mini-scenario: Your system ingests data very quickly all morning. By afternoon, disk usage rises, background I/O increases, and read latency becomes erratic.

Intuition: Compaction is the bill you pay for earlier write efficiency.

What compaction does:

Why compaction matters so much:

The three amplifications to reason about:

  1. Write amplification: a logical write may be rewritten several times during compaction.
  2. Read amplification: a read may need to inspect several files or levels before finding the answer.
  3. Space amplification: disk usage may temporarily exceed live data size because old and new versions coexist.

Important design tension:

Common strategy families:

You do not need to memorize vendor defaults yet. What matters is seeing compaction as policy, not housekeeping trivia.


Troubleshooting

Issue: Writes benchmark well, but production read latency keeps drifting upward.

Why it happens: Too many SSTables, too much overlap, or too much stale history can increase read amplification.

Clarification / Fix: Check compaction health, file counts, per-level growth, Bloom filter effectiveness, and whether the workload now has more reads than the original design assumed.

Issue: Deletes do not seem to free disk space.

Why it happens: In LSM systems, deletes usually become tombstones first. Space is reclaimed later by compaction, not at delete time.

Clarification / Fix: Look for compaction backlog, retention settings, or long-lived snapshots that prevent obsolete data from being dropped.

Issue: Throughput is fine until compaction starts, then latency spikes.

Why it happens: Background merge work competes with foreground reads and writes for disk bandwidth, CPU, and cache.

Clarification / Fix: Tune compaction concurrency, I/O throttling, level sizes, and write buffer settings. The real fix is often policy tuning, not just more hardware.


Advanced Connections

Connection 1: LSM Trees <-> B-Trees

The contrast: A B-Tree pays more structure-maintenance cost during writes so reads stay relatively direct. An LSM tree makes writes cheaper first, then pays later through compaction and more complicated reads.

Why this matters: Engine choice is not "modern vs old." It is about where you want to pay the cost.

Connection 2: LSM Trees <-> Log-Structured Systems

The parallel: Like logs and segment-based storage, LSM trees favor append-like writes and deferred cleanup. The cleanup phase is where old history is compacted away.

Why this matters: Once you see the pattern, storage engines, message logs, and even some cache designs start looking like variations of the same systems idea: cheap ingestion now, reconciliation later.


Resources

Suggested Resources


Key Insights

  1. LSM trees do not avoid maintenance - they delay and batch it.
  2. SSTables make writes and storage simpler in one dimension - but the read path and cleanup logic become more complex.
  3. Compaction is the real systems trade-off - it controls how you balance write cost, read cost, and space cost over time.

PREVIOUS B-Tree Index Internals and Access Paths NEXT Write-Ahead Logging and Crash Recovery Mechanics

← Back to Database Engine Internals and Implementation

← Back to Learning Hub