Day 220: Snapshotting, Checkpointing, and Log Compaction
Logs are wonderful because they remember history. They are dangerous because they remember too much. Snapshotting, checkpointing, and compaction are the tools that let a system keep the value of history without paying forever to replay all of it.
Today's "Aha!" Moment
Once we start building systems around logs, one uncomfortable question appears quickly:
- if history is the source of truth, do we really need to replay the whole history every time?
In theory, maybe yes. In real systems, almost never.
That is why these three ideas exist, and the big aha is that they solve related but different problems:
- snapshotting says: "here is the current state as of some point in the log"
- checkpointing says: "here is enough durable progress for this running computation to resume safely"
- log compaction says: "here is how we keep a log useful without storing every obsolete version forever"
These ideas are often mentioned together because all three reduce recovery cost. But they are not interchangeable.
A snapshot is about reconstructing state faster.
A checkpoint is about restarting computation safely.
A compacted log is about retaining the latest meaningful history per key while discarding older superseded entries.
Once we see that distinction, a lot of design confusion disappears. We stop asking "which one should we use?" and start asking the better question:
- "which kind of recovery or retention problem am I actually solving?"
Why This Matters
Imagine a service that maintains account state from an append-only event log:
- credits
- debits
- freezes
- compliance flags
On day one, replay is easy. A few hundred events, no problem.
On day 500, replay from the beginning becomes painful:
- crash recovery takes too long
- new replicas are slow to catch up
- stateful stream processors take forever to resume
- storage fills with ancient superseded values
- operational incidents get worse because "restart it" is no longer cheap
That is where these techniques matter.
A snapshot might let us restore the current account state and replay only the tail of recent events. A checkpoint might let a stream processor restart from a durable state boundary instead of recomputing the world. Log compaction might let us retain the latest value for each account or key while shrinking old churn.
This matters because recovery speed is part of system design, not a postscript. The ability to restart quickly affects:
- failover time
- deployment safety
- autoscaling speed
- storage cost
- how much history is still practically usable
A system that is correct only if it can replay millions of entries from scratch is often a system that will feel brittle in production.
Learning Objectives
By the end of this session, you will be able to:
- Distinguish the three techniques clearly - Explain what snapshots, checkpoints, and log compaction each preserve and what they are not meant to do.
- Reason about recovery paths - Describe how systems rebuild state faster than "replay everything from zero."
- Choose the right reduction strategy - Match storage, replication, and stream-processing needs to the appropriate mechanism.
Core Concepts Explained
Concept 1: Snapshotting Trades Long Replay for a Durable State Summary
Concrete example / mini-scenario: A Raft-backed service keeps a replicated log of commands that mutate metadata. A new follower joins and needs to catch up.
Without snapshots, the follower might need to replay the entire command history from the first day the cluster existed.
With snapshots, the system can say:
- "here is the state machine as of log index 950000"
- "now replay only entries after that"
ASCII sketch:
old log: 1 2 3 4 ... 950000 950001 950002
snapshot: [state @ 950000]
replay: 950001 950002
That is the central value of snapshotting:
- reduce restart and catch-up cost
- bound how much old history must remain operationally hot
- allow replicated state machines to compact long prefixes of already-applied history
But a snapshot is not magic. It has to mean something precise:
- which point in history it represents
- which state it contains
- whether it is consistent
- what tail of the log must still be replayed afterward
So the trade-off is:
- faster recovery and smaller active history
- in exchange for snapshot creation cost, coordination, storage, and careful consistency rules
Concept 2: Checkpointing Is About Restarting Computation Safely, Not Just Saving State
This is where teams often blur terms.
In a stateful stream processor, we may care not only about current state, but also about the relationship between:
- operator state
- input offsets
- output side effects
A checkpoint says:
- "this running computation reached a durable recovery boundary"
If the system crashes after that, it can restart from that boundary instead of recomputing arbitrarily from the past.
That is why checkpointing is so important in stream processing:
- operator state must match consumed input progress
- otherwise the system may resume with duplicated or missing work
A useful mental picture is:
input position + operator state + recovery marker
The subtle but important difference from snapshots is:
- a snapshot often summarizes what state is true
- a checkpoint often preserves where a computation can resume consistently
Sometimes the mechanism looks similar on disk, but the design intent is different.
That distinction becomes critical in the next lesson on exactly-once and idempotency, because recovery is not only about data structure size. It is about whether resuming computation replays work safely.
Concept 3: Log Compaction Keeps the Log Useful by Forgetting Superseded Versions
Some logs are not valuable because every historical write matters forever. They are valuable because readers need the latest known value per key, plus enough ordering to reconstruct or bootstrap state.
That is where log compaction helps.
Example:
user:42 -> email=a@x
user:17 -> email=b@y
user:42 -> email=c@z
If the goal is to rebuild the latest key-value state, we do not need every obsolete version forever in the hottest storage path.
Compaction says:
- keep enough of the log structure to rebuild the latest state
- allow older superseded records for the same key to be discarded eventually
Important nuance:
- compaction is not the same as deleting all old history immediately
- compaction is not the same as a snapshot
- compaction is usually key-based, not arbitrary state summarization
That is why compacted logs are great for:
- changelog topics
- metadata replication
- restoring latest state stores
But they are the wrong tool when the application truly needs:
- every historical event forever
- full audit trails in the hot path
- exact event history for each mutation without loss of older versions
Useful comparison:
Technique Main promise
--------------- --------------------------------------------
Snapshot "Here is state as of some point in history."
Checkpoint "Here is a safe place to resume computation."
Log compaction "Here is a log that keeps latest useful values."
That distinction is the core of the lesson.
Troubleshooting
Issue: "A snapshot and a checkpoint are basically the same thing."
Why it happens / is confusing: Both can involve writing state to durable storage.
Clarification / Fix: Ask what problem is being solved. If the goal is fast state reconstruction, think snapshot. If the goal is safe restart of an ongoing computation with aligned progress, think checkpoint.
Issue: "Log compaction means deleting history, so it breaks replay."
Why it happens / is confusing: People imagine compaction as arbitrary truncation.
Clarification / Fix: Compaction preserves a log form, but removes superseded records by key over time. It still supports rebuilding latest state, but not every possible historical question.
Issue: "If we have snapshots, we no longer need the log."
Why it happens / is confusing: Snapshots look like a full replacement for old entries.
Clarification / Fix: Most systems still need the tail after the snapshot, and often need the log as the authoritative update stream. A snapshot usually accelerates recovery; it does not erase the role of the log.
Advanced Connections
Connection 1: Logs and Clocks <-> Recovery Boundaries
The parallel: Logs and clocks help us describe history; snapshots and checkpoints help us stop replaying more history than necessary. Together they define how a system remembers enough without redoing everything.
Connection 2: Checkpointing <-> Exactly-Once Processing
The parallel: Checkpoints become especially important when the system wants to resume a stateful computation without duplicating externally visible work. That is why they connect directly to exactly-once claims, deduplication, and idempotent sinks.
Resources
Optional Deepening Resources
- [DOC] Apache Kafka Documentation: Log Compaction
- [PAPER] In Search of an Understandable Consensus Algorithm (Raft)
- [DOC] Apache Flink Documentation: Stateful Stream Processing and Fault Tolerance
- [BOOK] Designing Data-Intensive Applications
Key Insights
- All three techniques reduce replay cost, but in different ways - Snapshotting summarizes state, checkpointing preserves safe computation progress, and compaction removes obsolete per-key history.
- Recovery speed is part of correctness in practice - A system that cannot restart or catch up fast enough becomes operationally fragile even if its theory is sound.
- History is valuable, but not all history must stay equally expensive - Good systems decide what to keep hot, what to summarize, and what to compact.
Knowledge Check (Test Questions)
-
Which technique is primarily about reconstructing current state faster from a long log?
- A) Snapshotting
- B) Vector clocks
- C) Deduplication
-
What makes checkpointing especially important in stateful stream processing?
- A) It proves that no duplicates can ever occur anywhere in the system.
- B) It aligns computation state with durable progress so the job can resume consistently.
- C) It replaces all input logs permanently.
-
What is the most accurate description of log compaction?
- A) A way to delete an entire log once it grows too large.
- B) A way to keep a log useful for latest-state reconstruction by removing older superseded records by key over time.
- C) A way to serialize snapshots into a queue.
Answers
1. A: A snapshot is a durable summary of state at some point in history, which reduces how much log must be replayed afterward.
2. B: Checkpointing is valuable because it records a safe recovery boundary for an ongoing computation, not just raw data state.
3. B: Log compaction keeps the log shape but removes obsolete versions by key so the latest useful state can still be reconstructed efficiently.