LESSON

Database Engine Internals and Implementation

004 30 min intermediate

Day 276: LSM Trees, SSTables, and Compaction Trade-offs

The core idea: an LSM tree makes writes cheap by delaying part of the maintenance work that a B-Tree pays immediately. That deferred work comes back later as compaction, read amplification, and operational tuning.

Today's "Aha!" Moment

The insight: A B-Tree keeps data organized during each write. An LSM tree accepts writes quickly first, then reorganizes data later in bigger background batches.

Why this matters: This is not a minor implementation detail. It changes where the system pays for ordering, deduplication, deletes, and space cleanup.

Concrete anchor: Imagine an event ingestion system receiving a continuous stream of updates. If every write had to maintain a large on-disk sorted structure in place, write throughput would suffer. An LSM design absorbs writes fast, then merges and cleans data in the background.

The practical sentence to remember:
LSM trees trade cheaper writes now for compaction work later.

Why This Matters

The problem: Many modern storage engines are built around write-heavy workloads: logs, metrics, key-value updates, event ingestion, user activity streams, and large distributed stores. These systems care a lot about write throughput, sequential I/O, and predictable ingestion under load.

Without this model:

Teams compare engines using only benchmark headlines.
Fast writes look "free" until read latency and compaction debt appear.
Deletes and updates seem simple until tombstones and stale versions accumulate.

With this model:

You can explain why RocksDB-, LevelDB-, or Cassandra-style systems behave the way they do.
You can predict when write-heavy workloads fit LSM well.
You can recognize when compaction becomes the real bottleneck.

Operational payoff: Better engine choice, better tuning, and fewer surprises when a system that looked great in ingestion tests starts struggling in mixed read/write production traffic.

Learning Objectives

By the end of this lesson, you should be able to:

Explain why LSM trees exist as a response to the cost of maintaining on-disk order during every write.
Describe the write path and read path across memtables, SSTables, and immutable files.
Reason about compaction trade-offs in terms of write amplification, read amplification, and space amplification.

Core Concepts Explained

Concept 1: Why LSM Trees Exist

Concrete example / mini-scenario: A telemetry platform ingests millions of small writes per second. The main requirement is to accept writes reliably and continuously, even when the working set is larger than memory.

Intuition: In a B-Tree, each write helps preserve the global sorted structure immediately. In an LSM design, writes first land in memory, then get flushed out as immutable sorted files. The system postpones some organization work so the write path stays cheaper.

How it works mechanically:

New writes go into an in-memory structure, usually a sorted memtable.
When that memtable fills, it is frozen and flushed to disk as an immutable sorted file called an SSTable.
Over time, many SSTables accumulate, often across multiple levels or tiers.
Background compaction merges them, removes obsolete versions, and restores a healthier read shape.

Why this is attractive:

Writes mostly become append-like or sequential work.
The engine avoids a lot of small in-place on-disk mutations.
High write throughput becomes easier to sustain.

The trade-off: You did not eliminate maintenance work. You moved it off the foreground write path and into later background merge work.

Mental model:
A B-Tree cleans the room after every new item.
An LSM tree puts items into labeled bins quickly, then schedules a later cleanup and reorganization session.

Concept 2: SSTables and the Read Path

Concrete example / mini-scenario: A user asks for the current value of a key after that key has been updated many times over the last hour.

Intuition: Because LSM trees accumulate immutable files over time, the read path may need to check several places before it knows which version is current.

How it works mechanically:

The engine checks the current memtable first because the newest writes may still be in memory.
It may also check immutable memtables waiting to flush.
Then it consults SSTables on disk, often across multiple levels.
Because SSTables are sorted and immutable, the engine can use metadata such as indexes, fence pointers, and often Bloom filters to skip files that definitely do not contain the key.

Why immutability helps:

SSTables are simple to read and safe to share.
Background merges are easier because files are not being modified in place.
Replication and checksumming are often simpler with immutable segments.

Why reads get harder:

A key may exist in several files.
Deletes are often represented as tombstones, not immediate erasure.
Old versions may remain until compaction removes them.

Practical consequence:
An LSM tree often wins when the workload is write-heavy, but mixed workloads must be tuned carefully so reads do not drown in too many files or too much stale history.

Connection to the next lesson: If recent writes live in memory first, durability cannot depend on memory alone. That is why the write path is usually paired with a write-ahead log, which is the focus of the next lesson.

Concept 3: Compaction Is Deferred Maintenance

Concrete example / mini-scenario: Your system ingests data very quickly all morning. By afternoon, disk usage rises, background I/O increases, and read latency becomes erratic.

Intuition: Compaction is the bill you pay for earlier write efficiency.

What compaction does:

Merges SSTables into larger, cleaner structures.
Drops overwritten versions that no longer matter.
Removes tombstones when it is safe.
Reduces the number of files the read path must touch.

Why compaction matters so much:

Without it, reads get slower because too many files must be checked.
Deletes do not reclaim space promptly.
Stale versions accumulate.

The three amplifications to reason about:

Write amplification: a logical write may be rewritten several times during compaction.
Read amplification: a read may need to inspect several files or levels before finding the answer.
Space amplification: disk usage may temporarily exceed live data size because old and new versions coexist.

Important design tension:

More aggressive compaction can improve reads and space usage.
But aggressive compaction increases background I/O and CPU cost.
Less aggressive compaction preserves write throughput for a while, but tends to worsen reads and leave more stale data around.

Common strategy families:

Size-tiered compaction: merge similarly sized files in batches; good write characteristics, can tolerate more overlap.
Leveled compaction: keep tighter structure by levels; often improves reads, but may increase rewrite cost.

You do not need to memorize vendor defaults yet. What matters is seeing compaction as policy, not housekeeping trivia.

Troubleshooting

Issue: Writes benchmark well, but production read latency keeps drifting upward.

Why it happens: Too many SSTables, too much overlap, or too much stale history can increase read amplification.

Clarification / Fix: Check compaction health, file counts, per-level growth, Bloom filter effectiveness, and whether the workload now has more reads than the original design assumed.

Issue: Deletes do not seem to free disk space.

Why it happens: In LSM systems, deletes usually become tombstones first. Space is reclaimed later by compaction, not at delete time.

Clarification / Fix: Look for compaction backlog, retention settings, or long-lived snapshots that prevent obsolete data from being dropped.

Issue: Throughput is fine until compaction starts, then latency spikes.

Why it happens: Background merge work competes with foreground reads and writes for disk bandwidth, CPU, and cache.

Clarification / Fix: Tune compaction concurrency, I/O throttling, level sizes, and write buffer settings. The real fix is often policy tuning, not just more hardware.

Advanced Connections

Connection 1: LSM Trees <-> B-Trees

The contrast: A B-Tree pays more structure-maintenance cost during writes so reads stay relatively direct. An LSM tree makes writes cheaper first, then pays later through compaction and more complicated reads.

Why this matters: Engine choice is not "modern vs old." It is about where you want to pay the cost.

Connection 2: LSM Trees <-> Log-Structured Systems

The parallel: Like logs and segment-based storage, LSM trees favor append-like writes and deferred cleanup. The cleanup phase is where old history is compacted away.

Why this matters: Once you see the pattern, storage engines, message logs, and even some cache designs start looking like variations of the same systems idea: cheap ingestion now, reconciliation later.

Resources

Suggested Resources

[BOOK] Designing Data-Intensive Applications by Martin Kleppmann - Official site
Focus: clear explanations of storage engine trade-offs.
[BOOK] Database Internals by Alex Petrov - Book site
Focus: practical engine structures, including LSM-based systems.
[DOC] RocksDB Wiki - Documentation
Focus: real operational details of an LSM engine used in production.
[PAPER] The Log-Structured Merge-Tree (LSM-Tree) - Reference
Focus: the original design motivation and core mechanism.

Key Insights

LSM trees do not avoid maintenance - they delay and batch it.
SSTables make writes and storage simpler in one dimension - but the read path and cleanup logic become more complex.
Compaction is the real systems trade-off - it controls how you balance write cost, read cost, and space cost over time.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub