Day 396: Compression and Encoding in Storage Engines

The core idea: A storage engine does not try to maximize compression ratio in the abstract. It encodes bytes at the same granularity it must still search, cache, flush, and recover, then applies compression only in ways that preserve that access path.

Today's "Aha!" Moment

In 11.md, Harbor Point's municipal-bond audit store used Bloom filters and fence pointers to narrow a lookup down to one candidate SSTable block. That solved the "which block should I touch?" problem, but not the "what should the bytes inside that block look like?" problem. If every key and value were stored verbatim, Harbor Point would waste SSD bandwidth on repeated prefixes, repeated field names, and timestamps that differ by only a few bytes from the previous record.

The important shift is that encoding and compression are not the same thing. The engine first exploits structure it already knows about: keys are sorted, neighboring records share prefixes, sequence numbers often move in one direction, and some value fields repeat. Only after it has turned that structure into a denser block layout does it optionally run a general-purpose codec such as LZ4, Snappy, or Zstd over the block. Encoding preserves navigability; compression shrinks the remaining redundancy.

That is why "just compress the whole file harder" is usually the wrong instinct. Harbor Point serves mostly point lookups during the trading day. If a point lookup has to inflate a huge byte region or reconstruct too many keys before it can compare the target CUSIP, the store may save disk space while making p99 latency worse. Storage-engine compression is successful only when the block stays cheap to seek inside after the bytes become compact.

This also prepares the ground for the next lesson. Once the engine decides how records are represented inside blocks or pages, it still has to decide when modified in-memory pages are written back and how much dirty state can accumulate before writeback becomes dangerous. Compression changes the size and cost of that writeback path; checkpointing and dirty-page control decide when the bill is paid.

Why This Matters

Harbor Point stores quote history for thousands of bonds whose keys share long prefixes: desk, issuer family, CUSIP, and timestamp. The values are not random either. Venue codes repeat, price deltas between consecutive quotes are small, and tombstones or version metadata follow predictable layouts. A raw representation would force the engine to write more bytes during flush and compaction, keep fewer hot blocks in cache, and read more data per lookup than the logical workload actually needs.

Poorly chosen compression is not harmless optimization debt. If the engine uses a heavy codec on hot levels, point lookups spend too much CPU decompressing blocks. If restart points are too sparse, seeks degrade because the reader has to reconstruct many prefix-compressed keys before it can compare the target. If the encoding hides fields needed for sequence ordering or tombstone visibility, the engine risks returning the wrong version. In production, compression sits on the data path for reads, writes, and recovery, so the trade-offs show up everywhere.

Once you can describe the pipeline precisely, the tuning questions become concrete. You can ask whether block size is too large for point probes, whether bottommost levels deserve a denser codec than top levels, whether prefix compression is helping the actual key layout, and whether the buffer manager or compaction threads are spending too much time rebuilding compressed pages.

Learning Objectives

By the end of this session, you will be able to:

Explain why storage engines separate encoding from compression - Relate sorted-order-aware layouts to later codec choices and lookup behavior.
Trace how a compressed block is written and read - Follow the path from block builder to restart array to point lookup and decompression.
Evaluate production trade-offs in block layout choices - Connect block size, restart interval, codec strength, and workload shape to read, write, and recovery costs.

Core Concepts Explained

Concept 1: Encoding comes first because the engine already knows structural facts about neighboring records

Harbor Point's SSTable block might contain keys like these in sorted order:

desk=muni|cusip=59333LAE2|ts=2026-03-31T14:32:10.123Z
desk=muni|cusip=59333LAE2|ts=2026-03-31T14:32:10.140Z
desk=muni|cusip=59333LAF0|ts=2026-03-31T14:32:11.005Z

Storing each key in full would repeat desk=muni|cusip= over and over, even though sorted order guarantees that adjacent keys often share a long prefix. A block encoder uses that fact directly. Instead of writing the full key every time, it can write shared_prefix_length, unshared_suffix_length, the suffix bytes, and the value length. The first key in a restart interval is stored in full; later keys in the same interval are reconstructed by copying the shared prefix from the previous reconstructed key and appending the suffix.

LevelDB-style and RocksDB-style block formats make this idea concrete because they encode each entry as metadata plus bytes rather than as a raw record dump. Similar tricks apply to values. Sequence numbers, lengths, and other small integers are often written as varints so short numbers cost fewer bytes. If the value layout contains repetitive low-cardinality fields such as quote venue or trade side, the engine can map them to compact codes before any general-purpose compression runs.

The invariant is that encoding must preserve the logical comparator and the metadata needed for visibility. Harbor Point still needs to know which version of a quote is newest, whether a tombstone masks older entries, and where one key ends so the next can begin. Compression can remove redundancy, but it cannot remove ordering information or the fields required to resolve versions correctly.

Concept 2: Block compression is aligned with lookup granularity, not with the entire file

After the previous lesson's read-avoidance steps, Harbor Point usually arrives at one candidate data block. At that point the engine wants to do the minimum work required to answer the lookup:

Bloom/filter metadata says "maybe"
        ->
fence pointer selects one block handle
        ->
read one compressed block from disk
        ->
decompress that block only
        ->
binary-search restart points
        ->
reconstruct keys within one restart interval
        ->
compare target key and return value or tombstone

That sequence explains why storage engines compress per block or per page rather than as one opaque file blob. If Harbor Point had to decompress an entire SSTable to answer a point probe for a single bond, every lookup would inherit the cost of unrelated data. Block-local compression keeps the unit of decompression aligned with the unit chosen by the fence-pointer index and the block cache.

Restart points make this read path practical. Prefix compression saves space, but a binary search cannot jump into the middle of a block unless some entries are stored in full. The restart array provides those anchors. The reader binary-searches the restart offsets, decompresses or reconstructs only from the chosen restart point onward, and then scans a short local segment to find the exact key. Denser restart points make seeks cheaper but consume more metadata bytes; sparser restart points save space but force more key reconstruction work per lookup.

This layered approach also explains why engines often stack two forms of compaction-friendly density. First they apply semantic encodings such as prefix compression, delta encoding, or short integer encodings. Then they run a byte-level codec over the whole block. The first layer exploits domain structure the codec does not know about; the second layer squeezes the remaining repeated byte patterns. Together they reduce I/O without giving up fast in-block navigation.

Concept 3: Codec choice, block size, and restart interval are operational budget decisions

Suppose Harbor Point enables Zstd on all SSTables because benchmarks show a better compression ratio than LZ4. The disk footprint drops, but during the trading day the hot levels now spend more CPU decompressing small blocks on every point lookup. The block cache also behaves differently: fewer bytes per block means more logical data can fit, but each miss is more expensive to inflate. Whether that is a win depends on the real workload, not on the compression ratio alone.

The same trade-off appears in block size. Larger blocks usually compress better because the codec sees more repeated material, and they reduce index overhead because fewer block handles are needed per file. They also make a miss more expensive because each point lookup pulls in more unrelated keys and values. Smaller blocks improve point-read locality and reduce over-read, but they create more index metadata and give the codec less context. Harbor Point's quote-probe workload often prefers smaller or medium blocks on hot data even if archival levels use larger ones.

Restart interval tuning is a third dial. If full keys appear every 16 entries, Harbor Point pays extra metadata overhead but reconstructs only a short prefix-compressed run during a seek. If full keys appear every 128 entries, the block shrinks, but CPU work per seek rises sharply because the reader may rebuild many keys before it can compare the target. That cost is easy to miss in synthetic throughput tests and obvious in p99 latency once production point probes dominate.

These choices leak into the write path too. Every flush or compaction must rebuild encoded blocks, compute checksums, and emit the chosen compressed representation. If the engine keeps many modified pages or blocks in memory, later writeback has to perform that work at eviction or checkpoint time. That is the bridge to 13.md: storage layout determines not only how much data is written, but how expensive it is to turn dirty in-memory state back into durable bytes.

Troubleshooting

Issue: Harbor Point's disk footprint improved after enabling stronger compression, but point-lookups became slower.

Why it happens / is confusing: Teams often assume fewer bytes on disk must imply faster reads. For point-heavy workloads, the real bottleneck may shift from SSD I/O to CPU decompression on hot blocks.

Clarification / Fix: Measure read CPU, block-cache hit rate, and per-level latency separately. A common production fix is to use a lighter codec on hot levels and reserve denser compression for colder or bottommost data.

Issue: Seeks inside one SSTable block got slower after a format change, even though the compression ratio barely moved.

Why it happens / is confusing: The problem may not be the outer codec at all. A sparser restart interval or more aggressive prefix compression can increase the amount of key reconstruction needed after each binary search step.

Clarification / Fix: Inspect restart interval, average comparisons per seek, and the number of keys reconstructed per lookup. In point-probe workloads, a modest increase in metadata can be worth a large drop in CPU per seek.

Issue: Compaction throughput dropped after dictionary or block compression changes.

Why it happens / is confusing: Compression work happens on writes too. Flush and compaction threads must re-encode every output block, so a heavier format can move the bottleneck from storage bandwidth to background CPU.

Clarification / Fix: Compare compaction bytes-written, compaction CPU, and write-stall counters before and after the change. If background work is falling behind, reduce codec cost on frequently rewritten levels or increase compaction parallelism only if CPU headroom exists.

Advanced Connections

Connection 1: Storage-engine encoding ↔ columnar encoding

Columnar systems such as Parquet also exploit repetition with dictionary encoding, delta encoding, and varints. The difference is access shape. A columnar file is optimized for scanning many values from one column; Harbor Point's storage engine is optimized for point lookups, short seeks, and overwrite-friendly compaction. The encoding families overlap, but the chosen unit of compression is different because the query path is different.

Connection 2: Compression layout ↔ checkpoint and dirty-page policy

Compression is not independent from writeback control. If rebuilding a page or block into its compressed form is CPU-heavy, a checkpoint that flushes many dirty pages at once can create bursts of latency and background pressure. Systems with compressed pages, log-structured flush buffers, or bottommost recompression all face the same operational question: how much transformed state can remain dirty before flushing becomes disruptive?

Resources

Optional Deepening Resources

[DOC] RocksDB Wiki: Compression
- Focus: Review the codec options RocksDB exposes and how teams often use different compression strengths for hot versus cold levels.
[DOC] RocksDB Wiki: Rocksdb BlockBasedTable Format
- Focus: See how a production LSM engine lays out data blocks, index blocks, restart intervals, and compression boundaries.
[DOC] LevelDB Table Format
- Focus: Read the canonical explanation of data blocks, restart arrays, and the block-oriented layout inherited by later LSM implementations.
[CODE] LevelDB block_builder.cc
- Focus: Inspect how prefix sharing and restart points are emitted in code, which makes the space-versus-seek trade-off tangible.

Key Insights

Encoding is structural; compression is residual - The engine first uses sorted-order facts such as shared prefixes and small integers, then applies a general codec to what remains.
The decompression unit must match the lookup unit - Block-local compression works because the reader can decompress one candidate block rather than an entire file.
Compression settings are workload policy, not free savings - Block size, restart interval, and codec strength move cost between SSD I/O, CPU, cache residency, and background write throughput.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub