Snapshotting, Checkpointing, and Log Compaction

Day 220: Snapshotting, Checkpointing, and Log Compaction

Logs are wonderful because they remember history. They are dangerous because they remember too much. Snapshotting, checkpointing, and compaction are the tools that let a system keep the value of history without paying forever to replay all of it.


Today's "Aha!" Moment

Once we start building systems around logs, one uncomfortable question appears quickly:

In theory, maybe yes. In real systems, almost never.

That is why these three ideas exist, and the big aha is that they solve related but different problems:

These ideas are often mentioned together because all three reduce recovery cost. But they are not interchangeable.

A snapshot is about reconstructing state faster.
A checkpoint is about restarting computation safely.
A compacted log is about retaining the latest meaningful history per key while discarding older superseded entries.

Once we see that distinction, a lot of design confusion disappears. We stop asking "which one should we use?" and start asking the better question:


Why This Matters

Imagine a service that maintains account state from an append-only event log:

On day one, replay is easy. A few hundred events, no problem.

On day 500, replay from the beginning becomes painful:

That is where these techniques matter.

A snapshot might let us restore the current account state and replay only the tail of recent events. A checkpoint might let a stream processor restart from a durable state boundary instead of recomputing the world. Log compaction might let us retain the latest value for each account or key while shrinking old churn.

This matters because recovery speed is part of system design, not a postscript. The ability to restart quickly affects:

A system that is correct only if it can replay millions of entries from scratch is often a system that will feel brittle in production.


Learning Objectives

By the end of this session, you will be able to:

  1. Distinguish the three techniques clearly - Explain what snapshots, checkpoints, and log compaction each preserve and what they are not meant to do.
  2. Reason about recovery paths - Describe how systems rebuild state faster than "replay everything from zero."
  3. Choose the right reduction strategy - Match storage, replication, and stream-processing needs to the appropriate mechanism.

Core Concepts Explained

Concept 1: Snapshotting Trades Long Replay for a Durable State Summary

Concrete example / mini-scenario: A Raft-backed service keeps a replicated log of commands that mutate metadata. A new follower joins and needs to catch up.

Without snapshots, the follower might need to replay the entire command history from the first day the cluster existed.

With snapshots, the system can say:

ASCII sketch:

old log:   1 2 3 4 ... 950000 950001 950002
snapshot: [state @ 950000]
replay:                       950001 950002

That is the central value of snapshotting:

But a snapshot is not magic. It has to mean something precise:

So the trade-off is:

Concept 2: Checkpointing Is About Restarting Computation Safely, Not Just Saving State

This is where teams often blur terms.

In a stateful stream processor, we may care not only about current state, but also about the relationship between:

A checkpoint says:

If the system crashes after that, it can restart from that boundary instead of recomputing arbitrarily from the past.

That is why checkpointing is so important in stream processing:

A useful mental picture is:

input position + operator state + recovery marker

The subtle but important difference from snapshots is:

Sometimes the mechanism looks similar on disk, but the design intent is different.

That distinction becomes critical in the next lesson on exactly-once and idempotency, because recovery is not only about data structure size. It is about whether resuming computation replays work safely.

Concept 3: Log Compaction Keeps the Log Useful by Forgetting Superseded Versions

Some logs are not valuable because every historical write matters forever. They are valuable because readers need the latest known value per key, plus enough ordering to reconstruct or bootstrap state.

That is where log compaction helps.

Example:

user:42 -> email=a@x
user:17 -> email=b@y
user:42 -> email=c@z

If the goal is to rebuild the latest key-value state, we do not need every obsolete version forever in the hottest storage path.

Compaction says:

Important nuance:

That is why compacted logs are great for:

But they are the wrong tool when the application truly needs:

Useful comparison:

Technique        Main promise
---------------  --------------------------------------------
Snapshot         "Here is state as of some point in history."
Checkpoint       "Here is a safe place to resume computation."
Log compaction   "Here is a log that keeps latest useful values."

That distinction is the core of the lesson.


Troubleshooting

Issue: "A snapshot and a checkpoint are basically the same thing."

Why it happens / is confusing: Both can involve writing state to durable storage.

Clarification / Fix: Ask what problem is being solved. If the goal is fast state reconstruction, think snapshot. If the goal is safe restart of an ongoing computation with aligned progress, think checkpoint.

Issue: "Log compaction means deleting history, so it breaks replay."

Why it happens / is confusing: People imagine compaction as arbitrary truncation.

Clarification / Fix: Compaction preserves a log form, but removes superseded records by key over time. It still supports rebuilding latest state, but not every possible historical question.

Issue: "If we have snapshots, we no longer need the log."

Why it happens / is confusing: Snapshots look like a full replacement for old entries.

Clarification / Fix: Most systems still need the tail after the snapshot, and often need the log as the authoritative update stream. A snapshot usually accelerates recovery; it does not erase the role of the log.


Advanced Connections

Connection 1: Logs and Clocks <-> Recovery Boundaries

The parallel: Logs and clocks help us describe history; snapshots and checkpoints help us stop replaying more history than necessary. Together they define how a system remembers enough without redoing everything.

Connection 2: Checkpointing <-> Exactly-Once Processing

The parallel: Checkpoints become especially important when the system wants to resume a stateful computation without duplicating externally visible work. That is why they connect directly to exactly-once claims, deduplication, and idempotent sinks.


Resources

Optional Deepening Resources


Key Insights

  1. All three techniques reduce replay cost, but in different ways - Snapshotting summarizes state, checkpointing preserves safe computation progress, and compaction removes obsolete per-key history.
  2. Recovery speed is part of correctness in practice - A system that cannot restart or catch up fast enough becomes operationally fragile even if its theory is sound.
  3. History is valuable, but not all history must stay equally expensive - Good systems decide what to keep hot, what to summarize, and what to compact.

Knowledge Check (Test Questions)

  1. Which technique is primarily about reconstructing current state faster from a long log?

    • A) Snapshotting
    • B) Vector clocks
    • C) Deduplication
  2. What makes checkpointing especially important in stateful stream processing?

    • A) It proves that no duplicates can ever occur anywhere in the system.
    • B) It aligns computation state with durable progress so the job can resume consistently.
    • C) It replaces all input logs permanently.
  3. What is the most accurate description of log compaction?

    • A) A way to delete an entire log once it grows too large.
    • B) A way to keep a log useful for latest-state reconstruction by removing older superseded records by key over time.
    • C) A way to serialize snapshots into a queue.

Answers

1. A: A snapshot is a durable summary of state at some point in history, which reduces how much log must be replayed afterward.

2. B: Checkpointing is valuable because it records a safe recovery boundary for an ongoing computation, not just raw data state.

3. B: Log compaction keeps the log shape but removes obsolete versions by key so the latest useful state can still be reconstructed efficiently.



← Back to Learning