Day 466: LSM Compaction and Amplification

The core idea: An LSM tree keeps foreground writes cheap by buffering updates into memory and immutable SSTables, but it repays that shortcut later through compaction work that sets the real write, read, and space amplification budget.

Today's "Aha!" Moment

In 017.md, PayLedger kept PostgreSQL focused on authoritative payout rows because every extra B-tree index charged the write path immediately. That solves the transactional side of same-day payroll, but it leaves a different pressure untouched: the platform also stores every processor callback, retry decision, risk flag, and reconciliation event for later replay. During the cutoff hour, that event history arrives in bursts that would punish an in-place page structure if it had to keep every write perfectly arranged as it landed.

The useful shift is to stop calling an LSM engine "write optimized" as if the work disappears. An LSM tree is really a promise to write now and reorganize later. PayLedger can accept a new payout event by appending to a WAL, inserting into a memtable, and eventually flushing an immutable SSTable. The foreground request stays cheap because the engine is not seeking out old pages and editing them in place. The background system is not free, though. Compaction must later merge overlapping SSTables, discard overwritten versions or tombstones when it is safe, and restore a layout that reads can traverse without touching too many files.

That deferred maintenance is the trade-off. LSM engines are excellent when the workload is dominated by sustained writes, ordered scans, and "latest value by key" lookups over a disciplined key design. They become painful when teams celebrate low p50 write latency while compaction backlog, disk usage, and read amplification quietly climb underneath them. The common mistake is to think compaction is just housekeeping. In production it is the mechanism that determines whether yesterday's cheap writes are still affordable today.

Why This Matters

PayLedger now keeps a payout_event_history store for replayable operational history. PostgreSQL still owns the canonical payout row from 016.md and 017.md, but the event history sees far more writes because every settlement attempt, webhook retry, operator correction, and downstream reconciliation note is recorded separately. Incident response depends on being able to read that history by tenant, by payout, and by recent time window without turning the transactional database into a general-purpose event archive.

An LSM-based store fits because it turns random incoming writes into sequential log appends, in-memory sorting, and later bulk merges. That moves cost away from the request that accepts the event. The price is that the system now carries three budgets instead of one. Write amplification measures how many times bytes are rewritten during compaction. Read amplification measures how many memtables, SSTables, and tombstones a lookup has to cross before it finds the current answer. Space amplification measures how much extra disk is occupied by overlap, stale versions, and not-yet-compacted tombstones.

If the team misses those budgets, the history store looks healthy right until the cutoff window gets busy. Writes still succeed, but compaction falls behind, cache efficiency drops, read latency widens, and disk alerts arrive hours after the ingest spike that caused them. Understanding compaction and amplification is what lets the team decide whether an LSM engine is the right fit, which compaction strategy to use, and which queries should stay off this store entirely. That is the bridge to 019.md, where index strategy and selectivity planning determine whether the read path stays narrow enough for the chosen engine.

Learning Objectives

By the end of this session, you will be able to:

Explain why an LSM write path is fast at ingest time - Trace a payout event from WAL and memtable to immutable SSTables and describe what work is deferred.
Quantify the three amplification budgets - Connect compaction strategy to write amplification, read amplification, and temporary disk growth in production.
Choose when an LSM store fits the workload - Decide which PayLedger queries belong in the event-history store and which ones should remain in PostgreSQL or a different derived system.

Core Concepts Explained

Concept 1: Foreground writes are cheap because LSM engines delay structural cleanup

PayLedger keys its event-history store so the hottest read is mechanical: fetch recent events for one payout or one tenant-scoped slice of time. When processor northbank sends a webhook that payout p_842119 moved from submitted to failed, the storage engine does something much closer to log shipping than to B-tree page maintenance:

processor webhook
  -> WAL append for durability
  -> memtable insert/update
  -> memtable fills
  -> flush immutable SSTable into L0
  -> background compaction merges L0/L1/L2...

The critical detail is that the engine does not find an old disk page and rewrite it in place. Newer values are written to memory and then to new sorted files. Older versions can remain in earlier SSTables for a while. Deletes and TTL expiry usually behave the same way by writing tombstones instead of freeing space immediately. The store therefore accepts sustained ingest well because disk writes are mostly sequential appends and batched flushes.

That is the first half of the trade-off introduced in 017.md. A B-tree pays to preserve order on every write. An LSM tree accepts temporary disorder across multiple immutable runs and promises to clean it up later. That is why it works well for an append-heavy operational history, but it also means the correctness of the read path depends on merge rules, tombstone handling, and a background scheduler that never quite gets to stop.

Concept 2: Compaction is where write, read, and space amplification are negotiated

Imagine PayLedger during the final payroll hour of the day. Tens of thousands of events arrive per second, so level 0 fills with many new SSTables that overlap in key range. A lookup for the latest events on p_842119 may need to check the memtable, one or more recent SSTables, and possibly older levels before it knows which version is newest and whether a tombstone hides an older value. Bloom filters and per-file indexes help skip irrelevant files, but they do not remove overlap by themselves.

Compaction is the process that restores order. The engine selects SSTables, merges them by key, keeps the newest surviving version, carries forward still-valid tombstones, and writes larger output files into lower levels. That is where the three amplification budgets come from:

Write amplification grows because the same logical event may be rewritten several times as it moves through levels.
Read amplification grows when too many overlapping files or tombstones must be consulted to answer one query.
Space amplification grows when old versions and duplicate key ranges remain on disk until compaction catches up.

The compaction strategy decides which bill is smallest. Leveled compaction keeps overlap low in lower levels, so reads and disk usage stay more predictable, but each record may be rewritten many times on its way down the tree. Size-tiered compaction absorbs bursts with less immediate rewrite cost, but it tolerates more overlapping SSTables, so read amplification and temporary disk growth are usually worse. There is no universal winner. PayLedger has to choose whether the event-history store should favor smoother reads during incident replay or cheaper ingest during the cutoff spike.

Deletes make the trade-off more visible. Suppose raw webhook payloads expire after 30 days. In an LSM engine, that policy often produces tombstones first and reclaimed disk later. Until compaction sees the tombstones and the shadowed data in the same merge, storage does not shrink and reads may still pay to step over deleted keys. TTL is therefore not a free cleanup primitive; it is a promise that compaction debt will be paid in the future.

Concept 3: Workload shape decides whether compaction debt stays bounded

For PayLedger, the event-history store is a good fit when the key structure matches the real queries. Fetching recent events for one payout, scanning a tenant's last few minutes of callbacks, or replaying one reconciliation stream are all LSM-friendly because they align with an ordered primary key and tolerate immutable-file compaction in the background. Maintaining one "latest event per payout" projection is also reasonable when it is keyed directly and updated as a small number of upserts.

The same store is a poor fit for exploratory support filters like "show every failed payout for tenant acme-co where processor code is R03, operator note contains 'duplicate', and the settlement window is larger than two hours." Those filters demand secondary indexing, high selectivity planning, or a search surface designed for combinatorial predicates. Forcing them into the LSM history store creates more indexes to maintain, more data to compact, and more unpredictable read amplification. That is not a failure of the engine; it is a workload mismatch.

This is the operational lesson. Choosing an LSM engine is not the end of storage design. Partition width, key order, TTL policy, update frequency, and secondary-index count all determine whether compaction remains background maintenance or turns into the dominant cost center. That is why the next lesson, 019.md, matters immediately after this one. Once the team accepts the LSM trade-off, it still has to choose index structures and predicate shapes that keep the read path selective instead of letting compaction debt leak into every query.

Troubleshooting

Issue: Write latency still looks fine, but reads suddenly slow down during the cutoff window.

Why it happens / is confusing: LSM engines can continue accepting writes while compaction falls behind. The front door looks healthy even as level-0 overlap, pending compaction bytes, and tombstones push up read amplification behind the scenes.

Clarification / Fix: Watch compaction backlog, SSTable counts per level, read amplification, and p95/p99 lookup latency together. If the backlog climbs during ingest spikes, raise compaction throughput, reduce overlapping flush pressure, or revisit whether the compaction strategy matches the burst shape.

Issue: Disk usage does not drop after TTL expiry or large delete jobs.

Why it happens / is confusing: Deletes in an LSM system usually write tombstones first. Space is reclaimed only when future compaction merges the tombstones with the shadowed data and the engine's safety rules allow the old bytes to disappear.

Clarification / Fix: Treat retention policy as a compaction workload, not as an instant free-space command. Align TTL windows, compaction scheduling, and tombstone-heavy deletes so disk forecasts include the temporary space amplification they create.

Issue: One tenant's queries time out even though cluster-wide ingest metrics look normal.

Why it happens / is confusing: Average write throughput can hide a bad key design. A tenant with very wide partitions or too many overlapping recent SSTables pays a local read-amplification penalty that global graphs smooth away.

Clarification / Fix: Inspect partition width, hottest key ranges, and per-query file touches for the affected tenant. If one partition is carrying too much history, rebucket the key, shorten retention in the hot store, or move exploratory queries into a store built for broader filtering.

Advanced Connections

Connection 1: 017.md and this lesson price opposite sides of the same storage-engine decision

The previous lesson showed why B-trees make writes expensive by preserving order immediately. This lesson shows the mirror image: LSM engines make writes cheaper up front by accepting immutable overlap and paying later through compaction. PayLedger needs both mental models because the canonical payout row and the replayable event history do not deserve the same physical cost profile.

Connection 2: 019.md turns engine choice into query-shape discipline

An LSM engine does not remove the need for good indexing and predicate design. It changes the costs of bad choices. Once PayLedger commits to an event-history store with compaction, the next practical question is which indexes and filters are selective enough to belong there at all. That is exactly the ground the next lesson covers.

Resources

[PAPER] The Log-Structured Merge-Tree (LSM-Tree)
- Focus: Read the original description of deferred merging and compare its cost model with the page-oriented maintenance cost from 017.md.
[DOC] RocksDB Compaction
- Focus: Study how leveled, universal, and FIFO compaction strategies shift write, read, and space amplification in a production engine.
[DOC] RocksDB Tuning Guide
- Focus: Look for the operational metrics that reveal compaction backlog, stalled writes, and read amplification before user-facing latency moves.
[DOC] Apache Cassandra Compaction Overview
- Focus: Compare a distributed LSM implementation's compaction choices, tombstone behavior, and operational trade-offs with the single-node engine view.

Key Insights

Cheap LSM writes are deferred work, not missing work - The write path is fast because ordering and cleanup move into memtable flushes and compaction.
Compaction strategy is a production policy choice - It decides whether the system spends more on rewrite cost, on lookup fan-out, or on temporary disk overhead.
LSM engines reward disciplined query shapes - They fit append-heavy histories and keyed range scans far better than broad exploratory filters with weak selectivity.

← Back to Data Architecture and Platforms

← Back to Learning Hub