LESSON
Day 394: LSM Compaction Strategies and Amplification
The core idea: Compaction is the policy that pays back the disorder created by cheap LSM writes, and the chosen policy decides whether the engine spends more of that bill on write amplification, read amplification, or temporary extra space.
Today's "Aha!" Moment
In 09.md, Harbor Point's municipal-bond audit store got its fast write path by admitting updates into a memtable, flushing that memtable into new SSTables, and postponing full on-disk cleanup. After a trading spike, that delay becomes visible: the engine may have several overlapping level-0 files, each individually sorted but collectively messy. A point lookup for one bond now has to check more places, and disk still contains older versions that are no longer logically current.
Compaction is the mechanism that turns that pile of sorted runs back into a manageable shape. It reads overlapping SSTables in key order, resolves which version of each record is the newest visible one, carries forward tombstones when they are still needed, and writes replacement SSTables. That sounds like background housekeeping, but it is actually where the engine decides how aggressively it will trade extra rewriting for better reads and lower space usage.
The important shift is that "amplification" is not accidental overhead. It is the price of an LSM design. Write amplification is how many bytes the engine rewrites in compaction for each byte of logical update. Read amplification is how many files, blocks, or key ranges a lookup must consult before it can answer. Space amplification is how much extra disk remains occupied by obsolete versions and overlapping files. No compaction strategy minimizes all three at once.
That is why leveled and tiered compaction feel different in production even when they store the same logical dataset. Leveled compaction spends more background I/O to keep overlap low and reads predictable. Tiered compaction preserves write throughput by merging less often, but it allows more overlap and more temporary disk growth between merges. The next lesson on 11.md matters because once overlap exists, the engine needs Bloom filters and fence pointers to avoid paying the full read cost on every lookup.
Why This Matters
Harbor Point's audit store has two conflicting jobs. During the trading day, it must absorb every quote revision for tens of thousands of bonds without turning bursts into foreground write stalls. At the end of the day, compliance analysts and downstream pricing services run point lookups and short range queries that expect stable latency. Flush alone solves only the first half of that problem. Compaction policy decides whether the system stays queryable after the burst.
This is also where teams misread capacity signals. If storage usage spikes, it may not mean the dataset suddenly grew; it may mean compaction has not yet collapsed overlapping runs and obsolete versions. If read latency climbs, it may not mean the SSDs are failing; it may mean the current strategy is allowing too many files per lookup. If writes stall, the culprit may not be WAL sync at all; it may be compaction debt forcing the engine to slow new writes before overlap becomes unbounded.
Once you can explain compaction mechanically, those symptoms stop looking random. You can reason about whether the engine is intentionally paying more write amplification for predictable reads, intentionally preserving ingest throughput at the cost of query fan-out, or simply falling behind on background maintenance.
Learning Objectives
By the end of this session, you will be able to:
- Explain why compaction exists after flush - Describe how overlapping SSTables, stale versions, and tombstones create deferred work in an LSM tree.
- Compare leveled and tiered compaction precisely - Relate each strategy to write amplification, read amplification, and space amplification.
- Diagnose production compaction pressure - Interpret write stalls, read-latency regressions, and disk growth as consequences of compaction debt and policy choices.
Core Concepts Explained
Concept 1: Compaction is the deferred merge contract created by flush
After the Federal Reserve announcement hits, Harbor Point flushes several memtables in quick succession. The result might look like this:
L0: [A-D] [B-F] [E-H] [G-K]
\ | | /
overlapping key ranges
Each SSTable is internally sorted, but level 0 is not globally tidy. If Harbor Point asks for the latest quote of CUSIP 13042AB7, the engine may have to inspect multiple files because the newest version could be in any overlapping run. Older versions and tombstones also remain on disk because the engine has not yet done the work of proving which bytes are safe to discard.
Compaction performs that proof by streaming multiple input SSTables together in key order. For each key, it compares sequence numbers or version metadata, keeps the newest visible record, and decides whether an older version or tombstone can finally be dropped. That last step is stricter than it first appears. A tombstone is only removable when the engine knows there is no older live version hiding in a lower level, no snapshot that still needs the deleted row, and no recovery or replication rule that requires preserving it longer.
This is where the three amplification terms become concrete rather than abstract vocabulary. If Harbor Point inserts 1 GB of logical updates and compaction later rewrites 8 GB while pushing those updates through the level hierarchy, the engine incurred high write amplification. If a point lookup touches one memtable, two Bloom filters, and six SSTables before it can answer, that is read amplification. If 40 GB of live quote history occupies 70 GB on disk because old versions and overlapping runs have not yet collapsed, that is space amplification. Compaction strategy is the policy that chooses where those costs will land.
Concept 2: Leveled and tiered compaction buy different amplification profiles
Leveled compaction tries to keep each level, except level 0, mostly non-overlapping by key range. When Harbor Point compacts from one level into the next, it takes an input file or key range from level N, finds all overlapping files in level N+1, merges them, and writes replacement files whose key ranges no longer overlap within the destination level. Reads benefit because a point lookup usually needs at most one candidate file per level after level 0. The cost is that the same data may be rewritten repeatedly as it moves through several levels, which increases write amplification.
Tiered compaction, sometimes called size-tiered or universal compaction depending on the engine, makes the opposite bet. It allows multiple sorted runs of similar size to coexist and waits until enough runs accumulate before merging them into a larger run. That means Harbor Point can absorb ingest bursts with less immediate rewriting, which lowers write amplification and often improves sustained write throughput. The downside is that reads must search more overlapping files, and temporary disk usage can grow substantially before the next large merge finishes.
For Harbor Point, the trade-off is tied directly to workload shape. If daytime traffic is dominated by quote ingest and analysts rarely query the freshest data until after the market closes, a tiered strategy may be acceptable because it keeps writes cheap when they matter most. If pricing services need tight p99 point-lookups throughout the day, leveled compaction is often worth the extra background I/O because it keeps the number of candidate SSTables per lookup under better control.
| Strategy | What it optimizes first | How it does it | Main bill it creates |
|---|---|---|---|
| Leveled | Lower read amplification and tighter space bounds | Repeatedly merge overlapping key ranges so lower levels stay mostly non-overlapping | Higher write amplification from frequent rewrites |
| Tiered | Lower write amplification and higher ingest throughput | Accumulate multiple runs, then merge in larger batches | Higher read amplification and more temporary disk growth |
Neither strategy is "the LSM way" in the abstract. Each is a policy choice about which amplification profile best matches the product's workload and SLOs.
Concept 3: Production tuning is mostly about controlling compaction debt
Compaction becomes operationally visible when the engine is generating new SSTables faster than it can merge old ones. At Harbor Point, this first shows up as growing pending_compaction_bytes, rising level-0 file counts, and read latency drift on hot bond identifiers. If the debt continues to grow, the engine may deliberately throttle or stall foreground writes. That is not a bug; it is a safety mechanism that stops level 0 from expanding until every read becomes a file-system treasure hunt.
This is also why manual compaction is not automatically a fix. If the team launches a broad manual compaction in the middle of the trading day, it may compete with foreground traffic for disk bandwidth and CPU, causing the very p99 regression it was meant to solve. Smarter tuning starts with the workload shape: memtable size, flush bandwidth, size ratios between levels, number of background compaction threads, compaction priority, and whether hot levels live on faster storage. The question is always the same: where should the engine spend I/O so that user-visible latency stays within budget?
Deletes make the same point in a less obvious way. Suppose Harbor Point drops a large batch of expired quote records after the retention window closes. Disk usage may stay high for hours because the delete only created tombstones; reclaiming space requires compactions that encounter those tombstones together with every older version they invalidate. Until that happens, the logical delete is complete but the physical bytes are still present.
The next lesson on 11.md picks up exactly here. Even a well-tuned engine cannot keep overlap at zero all the time, especially near level 0 or under tiered policies. Bloom filters and fence pointers are the read-side tools that reduce how much of that overlap a lookup actually has to pay for.
Troubleshooting
Issue: Harbor Point starts throttling writes after an ingest burst even though WAL fsync latency still looks normal.
Why it happens / is confusing: The foreground write path is short, so teams often assume stalls must come from logging or locks. In LSM engines, a common cause is compaction debt: too many level-0 files or too many pending bytes waiting to be merged.
Clarification / Fix: Inspect level-0 file count, pending_compaction_bytes, background compaction throughput, and write-stall counters together. If compaction is the bottleneck, retune triggers, increase background bandwidth, or choose a strategy that matches the actual ingest pattern.
Issue: Disk usage stays elevated long after a large delete campaign.
Why it happens / is confusing: The delete is logically complete as soon as the tombstone is durable, but the old bytes are still present in older SSTables until compaction rewrites the affected key ranges.
Clarification / Fix: Check whether snapshots, retention windows, or lower-level overlaps still require preserving tombstones. Space returns only after compaction can prove that no older visible version remains that the tombstone must suppress.
Issue: Point lookups got slower after switching from leveled to tiered compaction.
Why it happens / is confusing: Write throughput often improves immediately, which makes the strategy change look successful at first. The delayed cost arrives on the read path because more overlapping runs survive between merge events.
Clarification / Fix: Measure files consulted per lookup, Bloom-filter usefulness, and read p95/p99. If the query SLO matters more than peak ingest, the engine may need leveled compaction or tighter tiered merge thresholds.
Advanced Connections
Connection 1: Compaction ↔ external merge sort
The I/O pattern in compaction looks like external merge sort because both stream multiple sorted runs and produce fewer sorted runs through mostly sequential reads and writes. The storage-engine version is harder because it must also honor version visibility, tombstones, snapshots, and crash-safe installation of replacement files.
Connection 2: Compaction ↔ MVCC garbage collection
Compaction is one of the places where an LSM engine performs version garbage collection. The engine cannot drop an older value simply because a newer value exists; it has to know that no active snapshot, replica, or lower-level file still makes the older version relevant. That is the same discipline MVCC systems apply when deciding which row versions are safe to reclaim.
Resources
Optional Deepening Resources
- [DOC] RocksDB Wiki: Compaction
- Focus: Read how RocksDB defines compaction triggers, files-by-level layout, and the metrics it exposes when background work falls behind.
- [DOC] RocksDB Wiki: Leveled Compaction
- Focus: Trace how overlap rules per level reduce read amplification while increasing the number of bytes rewritten.
- [DOC] RocksDB Wiki: Universal Compaction
- Focus: Compare a tiered-style policy that preserves write throughput by allowing more overlapping runs between merge events.
- [PAPER] Bigtable: A Distributed Storage System for Structured Data
- Focus: Review the sections on minor and major compaction to see the classic LSM trade-off between foreground ingest and background consolidation.
Key Insights
- Compaction is where an LSM engine repays deferred ordering work - Flush creates sorted runs cheaply, but compaction decides how expensive those deferred writes, reads, and bytes on disk will become.
- Leveled and tiered strategies optimize different pain points - Leveled spends more rewrite I/O to keep reads predictable, while tiered preserves ingest throughput by tolerating more overlap.
- Compaction debt eventually becomes user-visible - Write stalls, elevated disk usage, and slower lookups are all ways deferred merge work leaks into production behavior.