Snapshotting, Checkpointing, and Log Compaction

LESSON

012 30 min intermediate

Snapshotting, Checkpointing, and Log Compaction

The core idea: Snapshotting, checkpointing, and log compaction all reduce the cost of replaying history, but each preserves a different recovery boundary, and the trade-off is between faster recovery, retained history, and stricter consistency rules.

Core Insight

A replicated service can start with a simple promise: every state change goes into an append-only log, and any replica can recover by replaying that log. At small scale, this feels clean. After months of membership changes, metadata updates, account events, or stream-processing records, "just replay the log" becomes a dangerous operational answer.

The log may still be the source of truth, but not every recovery path should begin at entry 1. New replicas need to catch up quickly. Crashed workers need to resume without duplicating work. Storage systems need to keep latest useful state without paying forever for superseded values in the hot path.

Snapshotting, checkpointing, and log compaction are three ways to make history cheaper to use. They are easy to confuse because they all write something durable and all reduce replay. The difference is what each one preserves: a snapshot preserves state at a log point, a checkpoint preserves safe computational progress, and compaction preserves a log that can still rebuild latest state after older per-key versions are removed.

The Recovery Pressure

Consider a Raft-backed control-plane service that manages service discovery records, leases, and configuration. Every command is appended to a replicated log, then applied to an in-memory state machine. This gives a strong audit trail for the ordered decisions the cluster made.

Now a new follower joins after the cluster has processed 9 million commands. If it must receive and apply the entire log before it can serve, bootstrap may take minutes or hours. If an operator restarts several nodes during an incident, recovery time becomes part of the outage. A design that is theoretically correct but operationally unable to recover in time is not complete.

The system needs a way to say, "this is enough history for this recovery task." That boundary must be explicit. It must say what state is covered, which log entries are still needed, and what can be safely removed or replayed.

Snapshotting: State as of a Log Point

A snapshot is a durable summary of state at a specific point in the log. In a replicated state machine, that point is usually named by a log index and term, not by vague wall-clock time.

log:       1 2 3 4 ... 950000 950001 950002
snapshot: [state after applying 950000]
replay:                         950001 950002

The snapshot lets a recovering node restore the state machine directly, then replay only the tail after the snapshot. For a new follower, a leader can install a snapshot instead of sending every compacted log entry from the beginning of time.

The snapshot must be meaningful enough to replace the prefix it summarizes. That means the system must know:

which log index and term the snapshot covers
whether all commands up to that point were applied consistently
what state belongs inside the snapshot
which log entries after the snapshot remain necessary

The main trade-off is simple but serious: snapshots reduce replay and active log size, but they add snapshot creation cost, storage cost, transfer cost, and consistency rules. A torn snapshot, a snapshot taken at an ambiguous boundary, or a snapshot installed without the right log metadata can turn a recovery optimization into a correctness bug.

Checkpointing: Safe Progress for a Computation

Checkpointing is related to snapshotting, but the intent is different. A checkpoint is a durable recovery boundary for a running computation. It often includes state, but the key question is not only "what state was true?" It is "where can this computation restart without losing or duplicating the work it has already made visible?"

In a stateful stream processor, a useful checkpoint ties together several pieces:

input offsets + operator state + output commit boundary

If those pieces drift apart, recovery becomes unsafe. A processor might restore old state while starting from a newer input offset, losing records. Or it might restore new state while replaying older input, applying records twice. The checkpoint is the durable claim that these pieces line up.

This is why checkpointing leads directly into exactly-once semantics, idempotency, and deduplication. A system that crashes after updating local state but before committing output has to know whether replay will duplicate externally visible work. Checkpoints do not remove that problem, but they define the boundary where the system can reason about it.

The trade-off is frequency and cost. More frequent checkpoints reduce recovery work after a crash, but increase steady-state I/O, coordination, and latency pressure. Less frequent checkpoints make normal operation cheaper, but leave more work to replay and more ambiguity to handle after failure.

Log Compaction: Keeping Latest Useful Records

Log compaction solves a different problem. Some logs are useful because they contain the latest value for each key, not because every old value must stay in the hot recovery path forever.

user:42 -> email=a@example
user:17 -> email=b@example
user:42 -> email=c@example

If a consumer only needs to rebuild the current key-value state, the older user:42 record may eventually be superseded by the newer one. A compacted log can retain the latest meaningful record per key while allowing older versions to disappear over time.

That still leaves a log, not a single snapshot. New consumers can scan the compacted topic and reconstruct latest state. Existing consumers can continue reading updates. What changes is the retention promise: the log is no longer a complete event history for every mutation.

The trade-off is between storage efficiency and historical fidelity. Compaction is appropriate for changelog topics, metadata replication, cache warmup, and state-store restoration. It is not appropriate when the hot log must answer full audit questions, replay every business event, or preserve every intermediate value.

Worked Example: Recovering an Account Service

Suppose an account service stores commands like credits, debits, freezes, and compliance flags.

With only an append-only log, a restarted node applies every command from the beginning. This is clear, but slow.

With snapshots, the node restores account state as of log index 950000, then replays commands after that point:

restore snapshot @ 950000
apply log entries 950001..current

With checkpoints, a stream processor that computes account risk can resume from a boundary where input offsets, operator state, and committed output agree. It is not merely loading an account-state file; it is restarting a computation at a safe point.

With log compaction, a changelog topic for account profile fields may keep the latest profile record per account key. A new materialized view can rebuild the latest profiles without reading every older profile update. But if the compliance team needs every historical profile value, that audit history must live somewhere else.

Choosing the Mechanism

Mechanism	Preserves	Best Use	Main Trade-off
Snapshotting	State as of a known log point	Fast replica bootstrap, crash recovery, state-machine log truncation	Snapshot consistency and transfer cost
Checkpointing	Safe progress for a computation	Stateful stream processing and resumable jobs	More coordination and I/O during normal operation
Log compaction	Latest useful records by key	Changelog topics, metadata replication, state-store rebuilds	Loses older superseded versions from the compacted log

The practical decision is not "which one is more advanced?" The decision is "what kind of replay am I trying to avoid, and what history must remain available afterward?"

If recovery needs a complete state image, use a snapshot. If a running computation needs a restart boundary, use a checkpoint. If a log only needs to preserve latest per-key values in the hot path, compaction may be the right reduction.

Common Misreadings

A snapshot and a checkpoint are not always the same thing. Both may write state to storage, but a snapshot usually summarizes state at a point in history, while a checkpoint records a safe resume boundary for a computation.

Log compaction is not arbitrary deletion. It follows a retention rule, usually by key, so the compacted log can still rebuild latest state. It no longer promises to answer every historical question.

A snapshot does not eliminate the log. The system usually still needs the tail after the snapshot, and the log often remains the authoritative stream of changes.

Connections

The previous lesson showed why timestamps and logical clocks need explicit meaning in distributed histories. Snapshots and checkpoints need the same discipline: "as of index X" or "after offset Y" is useful only if the system can define what belongs before and after that boundary.

The next lesson on exactly-once semantics builds directly on checkpointing. Once a system can resume from a durable boundary, it still has to make repeated work safe at the boundaries where messages, state, and external side effects meet.

Resources

[PAPER] In Search of an Understandable Consensus Algorithm
- Focus: How Raft uses snapshots to compact replicated state-machine logs safely.
[DOC] Apache Kafka Documentation: Log Compaction
- Focus: The retention model for keeping latest records by key while preserving a consumable log.
[DOC] Apache Flink Documentation: Fault Tolerance
- Focus: How checkpoints connect input progress, operator state, and recovery.
[BOOK] Designing Data-Intensive Applications
- Focus: The broader relationship between logs, recovery, replication, and stream processing.

Key Takeaways

Snapshotting, checkpointing, and compaction all reduce replay cost, but they preserve different recovery boundaries.
A snapshot summarizes state at a known log point; a checkpoint records where a computation can resume safely.
Log compaction keeps latest useful records by key, which is powerful for rebuilding state but not a substitute for full audit history.

← Back to Consensus and Coordination

← Back to Distributed Systems

← Back to Learning Hub