Day 469: Compression, Checksums, and Corruption Handling

The core idea: Compression changes how bytes are laid out, checksums decide when those bytes can be trusted, and corruption handling determines whether the system recovers cleanly or serves nonsense.

Today's "Aha!" Moment

After 020.md, PayLedger stopped wasting database work on repeated dashboard reads. That solved one pressure, but it exposed another: the platform must retain 18 months of payout evidence for audits, chargeback disputes, and incident replay. Recent payroll runs stay on NVMe in an LSM-backed settlement-history service so support can drill into one employee's payout in seconds. Older runs are shipped to object storage. By March, the hot tier is growing fast enough that the team is considering aggressive compression everywhere.

The easy mistake is to treat compression as a storage-only knob. PayLedger tried a first pass that gzipped full segment files before upload and used a single file hash after the fact. Storage cost went down, but a restore drill later failed halfway through one archive, and an on-call engineer discovered that a single damaged byte forced the system to reject an entire 64 MiB file. Worse, the hot tier had no fast way to tell whether a failed read came from a bad block, a bad upload, or a buggy decompressor.

The important shift is that compression, checksums, and corruption handling are one design problem. Compression decides the unit of work for reads and writes. Checksums define the trust boundary at each unit. Corruption handling is the playbook for what happens when trust is lost. If those three decisions are made separately, the system usually gets a good benchmark and a bad incident.

Why This Matters

Production systems pay for bytes more than once: on SSD, over the network, during compaction, in snapshots, and during restore. Compression can cut those costs dramatically, but the trade-off is CPU time and more complicated failure behavior. Whole-file compression often maximizes ratio, yet it destroys locality for random reads and enlarges the blast radius of a single bad byte. Smaller compressed blocks preserve locality, but they cost more metadata and usually compress less efficiently.

Checksums matter because modern storage failures are often silent until a read or restore touches the wrong bytes. A torn write, flaky controller, partial multipart upload, bad RAM, or buggy codec upgrade may not announce itself as "corruption" in a friendly way. It may surface as an impossible record count, a decompression error during compaction, or a replay job that quietly skips data. A checksum turns that ambiguity into a binary statement: these are not the bytes we intended to store.

That is why corruption handling belongs in the lesson, not just compression and checksums. Detection without a repair path only upgrades silent corruption into loud downtime. PayLedger needs a concrete policy: retry from a replica, refetch from object storage, quarantine the bad segment, rebuild derived indexes, and page a human only when no clean copy remains. The previous lesson asked which copy is authoritative. This lesson asks whether the bytes in each copy can still be believed. The next lesson, 022.md, builds on that same trust model when hot state moves primarily into memory and durability shifts to logs and snapshots.

Core Walkthrough

The storage layout that fits the workload

PayLedger stores payout history as immutable segment files. Each segment covers one tenant and one payroll run, with records sorted by employee_id so support can seek into one employee's history without scanning the entire run. The hot tier keeps the last 14 days on local NVMe for fast investigation traffic. Nightly archival re-encodes older segments for object storage.

The team chooses two codecs because the workload has two very different pressures. Hot segments use LZ4 on 32 KiB blocks because random reads and compaction speed matter more than squeezing every byte. Cold archives use Zstandard because restore and compliance export jobs are batch-oriented and can afford more CPU in exchange for lower storage and transfer cost. That is the first trade-off: one codec for everything looks operationally neat, but it often forces the hot tier to pay cold-storage CPU costs or forces the archive tier to waste space.

A simplified hot-segment layout looks like this:

segment manifest
  -> block index
  -> block 0: header(codec, compressed_len, raw_len, crc32c_compressed, crc32c_raw)
  -> block 0 bytes
  -> block 1: header(...)
  -> block 1 bytes
  -> ...

This layout is more verbose than "gzip the whole file," but it changes the system's behavior in production. A support read that needs one employee can seek to the relevant block, verify a small checksum, decompress only that block, and continue. If one block is damaged, the system can isolate the failure to that block instead of condemning the entire segment immediately.

Where checksums belong and what they prove

A checksum is only meaningful if it protects the bytes at a real trust boundary. PayLedger uses two checksum layers for each hot block:

crc32c_compressed covers the exact bytes written to disk or uploaded over the network.
crc32c_raw covers the decompressed payload that query code will actually parse.

That split is deliberate. The compressed checksum catches storage and transport damage before the codec is asked to interpret garbage. The raw checksum catches cases where decompression succeeds but the payload is still wrong because of memory corruption, truncation bugs, or an incompatible codec change. Neither checksum proves the business meaning of the records; they prove byte identity at two different boundaries.

The write path makes those boundaries explicit:

def encode_block(records: list[bytes]) -> EncodedBlock:
    raw = b"".join(records)
    raw_crc = crc32c(raw)

    compressed = lz4_compress(raw)
    compressed_crc = crc32c(compressed)

    return EncodedBlock(
        codec="lz4",
        raw_len=len(raw),
        compressed_len=len(compressed),
        raw_crc=raw_crc,
        compressed_crc=compressed_crc,
        payload=compressed,
    )

The read path validates in the reverse order:

def decode_block(block: EncodedBlock) -> bytes:
    if crc32c(block.payload) != block.compressed_crc:
        raise CorruptBlock("stored bytes changed")

    raw = lz4_decompress(block.payload)

    if crc32c(raw) != block.raw_crc:
        raise CorruptBlock("decoded payload does not match original bytes")

    return raw

This is also why "the object store already gives us an ETag" is not enough. An object-store checksum proves something about the uploaded object as the service received it. It does not localize which 32 KiB block is damaged inside a larger segment, and it does not help the hot tier detect a page-level issue on local NVMe. Systems that care about recovery speed usually need checksums at more than one layer because corruption can be introduced after a higher-level hash was computed.

What corruption handling looks like in production

The failure policy is where many systems stay vague. PayLedger does not treat a checksum mismatch as a retryable parse error and it never serves a partially trusted block. The response is staged:

Fail the local read of the suspect block immediately and increment a corruption metric with tenant_id, run_id, segment_id, and block offset.
Try the same block from a clean replica or from the archived object copy if the local tier is suspect.
If the alternate copy passes validation, return the reconstructed data and quarantine the local segment for background replacement.
If every copy fails, mark the payroll run's history as degraded, stop serving unverifiable data, and raise a human-visible incident.

That policy turns corruption from a mystery into an operational branch. It also makes background scrubbing worth the effort. A low-priority scanner that reads blocks, verifies checksums, and compares manifest metadata can detect latent corruption before a support engineer hits the one damaged block during an incident. Scrubbing spends I/O and CPU, but the trade-off is lower surprise and faster repair when the platform is under stress.

Compression settings also change the repair story. With whole-file compression, one bad byte can make a large archive unreadable and force a full refetch. With block compression, the system can often replace only the damaged block or rebuild one segment from the immutable event log. Better recovery granularity is frequently worth a slightly worse compression ratio, especially for user-facing analytical or support workloads that need partial reads.

Failure Modes and Misconceptions

"Compression and integrity are separate concerns." That is true only on slides. In a real engine, the compression unit determines the failure unit. Choosing whole-file compression means corruption is discovered and repaired at whole-file scope. Choosing block compression means corruption can be localized and retried at block scope.

"A checksum means the data is safe." A checksum only answers "do these bytes still match the bytes we intended to write?" It does not recreate missing data. Without a second clean copy or an authoritative source log, checksums help the system fail loudly, but they do not provide recovery.

"If decompression succeeds, the payload must be correct." Some corruptions show up as decoder errors, but others can yield plausible output or truncated output. That is why PayLedger keeps the raw checksum and explicit length in the block header instead of trusting the codec alone.

"Scrubbing can wait until a user hits the bad data." Delayed detection makes incidents longer and repair decisions riskier. Background verification is operational overhead, but the trade-off is catching latent corruption on a calm Tuesday instead of during payroll cutoff.

Connections

020.md established that every cache layer is a copy of state with its own freshness contract. This lesson extends that idea one layer down: every compressed copy also needs a trust contract.
019.md focused on storage layout and access paths. Compression choices sit directly on top of those access paths because block size and codec selection change read amplification and compaction cost.
022.md will reuse the same integrity logic for snapshot and log files in memory-first systems, where the hot path is in RAM but recovery still depends on trustworthy persisted bytes.

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Read the sections on storage engines and data integrity together; the trade-off is not just compression ratio but also how failures are detected and repaired.
[DOC] PostgreSQL Data Checksums
- Focus: See how a production database treats page-level corruption detection as an explicit operational feature rather than an implementation detail.
[RFC] RFC 8878: Zstandard Compression and the application/zstd Media Type
- Focus: Pay attention to frame structure and integrity features so compression format details do not stay abstract.
[DOC] RocksDB Compression
- Focus: Compare codec choices in an LSM engine where write amplification, compaction CPU, and read locality all matter.
[DOC] Amazon S3 Checking Object Integrity
- Focus: Object-store checksums are useful, but note what they prove and what they do not localize inside a larger data layout.

Key Takeaways

Compression changes failure granularity, not just storage cost. The unit you compress is usually the unit you can validate, retry, and repair.
Checksums must sit at real trust boundaries. Protect stored bytes before decompression and the decoded payload before query code trusts it.
Corruption handling needs a repair path. Detection without a clean replica, archive copy, or rebuild source only tells you when to stop serving data.
Block-level designs often win for operational reasons. They sacrifice some ratio, but the trade-off buys locality, targeted recovery, and faster incident response.

← Back to Data Architecture and Platforms

← Back to Learning Hub