Day 220: Snapshotting, Checkpointing, and Log Compaction

Logs are wonderful because they remember history. They are dangerous because they remember too much. Snapshotting, checkpointing, and compaction are the tools that let a system keep the value of history without paying forever to replay all of it.

Today's "Aha!" Moment

Once we start building systems around logs, one uncomfortable question appears quickly:

if history is the source of truth, do we really need to replay the whole history every time?

In theory, maybe yes. In real systems, almost never.

That is why these three ideas exist, and the big aha is that they solve related but different problems:

snapshotting says: "here is the current state as of some point in the log"
checkpointing says: "here is enough durable progress for this running computation to resume safely"
log compaction says: "here is how we keep a log useful without storing every obsolete version forever"

These ideas are often mentioned together because all three reduce recovery cost. But they are not interchangeable.

A snapshot is about reconstructing state faster.
A checkpoint is about restarting computation safely.
A compacted log is about retaining the latest meaningful history per key while discarding older superseded entries.

Once we see that distinction, a lot of design confusion disappears. We stop asking "which one should we use?" and start asking the better question:

"which kind of recovery or retention problem am I actually solving?"

Why This Matters

Imagine a service that maintains account state from an append-only event log:

credits
debits
freezes
compliance flags

On day one, replay is easy. A few hundred events, no problem.

On day 500, replay from the beginning becomes painful:

crash recovery takes too long
new replicas are slow to catch up
stateful stream processors take forever to resume
storage fills with ancient superseded values
operational incidents get worse because "restart it" is no longer cheap

That is where these techniques matter.

A snapshot might let us restore the current account state and replay only the tail of recent events. A checkpoint might let a stream processor restart from a durable state boundary instead of recomputing the world. Log compaction might let us retain the latest value for each account or key while shrinking old churn.

This matters because recovery speed is part of system design, not a postscript. The ability to restart quickly affects:

failover time
deployment safety
autoscaling speed
storage cost
how much history is still practically usable

A system that is correct only if it can replay millions of entries from scratch is often a system that will feel brittle in production.

Learning Objectives

By the end of this session, you will be able to:

Distinguish the three techniques clearly - Explain what snapshots, checkpoints, and log compaction each preserve and what they are not meant to do.
Reason about recovery paths - Describe how systems rebuild state faster than "replay everything from zero."
Choose the right reduction strategy - Match storage, replication, and stream-processing needs to the appropriate mechanism.

Core Concepts Explained

Concept 1: Snapshotting Trades Long Replay for a Durable State Summary

Concrete example / mini-scenario: A Raft-backed service keeps a replicated log of commands that mutate metadata. A new follower joins and needs to catch up.

Without snapshots, the follower might need to replay the entire command history from the first day the cluster existed.

With snapshots, the system can say:

"here is the state machine as of log index 950000"
"now replay only entries after that"

ASCII sketch:

old log:   1 2 3 4 ... 950000 950001 950002
snapshot: [state @ 950000]
replay:                       950001 950002

That is the central value of snapshotting:

reduce restart and catch-up cost
bound how much old history must remain operationally hot
allow replicated state machines to compact long prefixes of already-applied history

But a snapshot is not magic. It has to mean something precise:

which point in history it represents
which state it contains
whether it is consistent
what tail of the log must still be replayed afterward

So the trade-off is:

faster recovery and smaller active history
in exchange for snapshot creation cost, coordination, storage, and careful consistency rules

Concept 2: Checkpointing Is About Restarting Computation Safely, Not Just Saving State

This is where teams often blur terms.

In a stateful stream processor, we may care not only about current state, but also about the relationship between:

operator state
input offsets
output side effects

A checkpoint says:

"this running computation reached a durable recovery boundary"

If the system crashes after that, it can restart from that boundary instead of recomputing arbitrarily from the past.

That is why checkpointing is so important in stream processing:

operator state must match consumed input progress
otherwise the system may resume with duplicated or missing work

A useful mental picture is:

input position + operator state + recovery marker

The subtle but important difference from snapshots is:

a snapshot often summarizes what state is true
a checkpoint often preserves where a computation can resume consistently

Sometimes the mechanism looks similar on disk, but the design intent is different.

That distinction becomes critical in the next lesson on exactly-once and idempotency, because recovery is not only about data structure size. It is about whether resuming computation replays work safely.

Concept 3: Log Compaction Keeps the Log Useful by Forgetting Superseded Versions

Some logs are not valuable because every historical write matters forever. They are valuable because readers need the latest known value per key, plus enough ordering to reconstruct or bootstrap state.

That is where log compaction helps.

Example:

user:42 -> email=a@x
user:17 -> email=b@y
user:42 -> email=c@z

If the goal is to rebuild the latest key-value state, we do not need every obsolete version forever in the hottest storage path.

Compaction says:

keep enough of the log structure to rebuild the latest state
allow older superseded records for the same key to be discarded eventually

Important nuance:

compaction is not the same as deleting all old history immediately
compaction is not the same as a snapshot
compaction is usually key-based, not arbitrary state summarization

That is why compacted logs are great for:

changelog topics
metadata replication
restoring latest state stores

But they are the wrong tool when the application truly needs:

every historical event forever
full audit trails in the hot path
exact event history for each mutation without loss of older versions

Useful comparison:

Technique        Main promise
---------------  --------------------------------------------
Snapshot         "Here is state as of some point in history."
Checkpoint       "Here is a safe place to resume computation."
Log compaction   "Here is a log that keeps latest useful values."

That distinction is the core of the lesson.

Troubleshooting

Issue: "A snapshot and a checkpoint are basically the same thing."

Why it happens / is confusing: Both can involve writing state to durable storage.

Clarification / Fix: Ask what problem is being solved. If the goal is fast state reconstruction, think snapshot. If the goal is safe restart of an ongoing computation with aligned progress, think checkpoint.

Issue: "Log compaction means deleting history, so it breaks replay."

Why it happens / is confusing: People imagine compaction as arbitrary truncation.

Clarification / Fix: Compaction preserves a log form, but removes superseded records by key over time. It still supports rebuilding latest state, but not every possible historical question.

Issue: "If we have snapshots, we no longer need the log."

Why it happens / is confusing: Snapshots look like a full replacement for old entries.

Clarification / Fix: Most systems still need the tail after the snapshot, and often need the log as the authoritative update stream. A snapshot usually accelerates recovery; it does not erase the role of the log.

Advanced Connections

Connection 1: Logs and Clocks <-> Recovery Boundaries

The parallel: Logs and clocks help us describe history; snapshots and checkpoints help us stop replaying more history than necessary. Together they define how a system remembers enough without redoing everything.

Connection 2: Checkpointing <-> Exactly-Once Processing

The parallel: Checkpoints become especially important when the system wants to resume a stateful computation without duplicating externally visible work. That is why they connect directly to exactly-once claims, deduplication, and idempotent sinks.

Resources

Optional Deepening Resources

Key Insights

All three techniques reduce replay cost, but in different ways - Snapshotting summarizes state, checkpointing preserves safe computation progress, and compaction removes obsolete per-key history.
Recovery speed is part of correctness in practice - A system that cannot restart or catch up fast enough becomes operationally fragile even if its theory is sound.
History is valuable, but not all history must stay equally expensive - Good systems decide what to keep hot, what to summarize, and what to compact.

Knowledge Check (Test Questions)

Which technique is primarily about reconstructing current state faster from a long log?
- A) Snapshotting
- B) Vector clocks
- C) Deduplication
What makes checkpointing especially important in stateful stream processing?
- A) It proves that no duplicates can ever occur anywhere in the system.
- B) It aligns computation state with durable progress so the job can resume consistently.
- C) It replaces all input logs permanently.
What is the most accurate description of log compaction?
- A) A way to delete an entire log once it grows too large.
- B) A way to keep a log useful for latest-state reconstruction by removing older superseded records by key over time.
- C) A way to serialize snapshots into a queue.

Answers

1. A: A snapshot is a durable summary of state at some point in history, which reduces how much log must be replayed afterward.

2. B: Checkpointing is valuable because it records a safe recovery boundary for an ongoing computation, not just raw data state.

3. B: Log compaction keeps the log shape but removes obsolete versions by key so the latest useful state can still be reconstructed efficiently.

← Back to Learning