Day 440: Raft Log Compaction and Snapshot Installation

The core idea: Raft can discard old log entries only after it has captured the replicated state machine at a specific (lastIncludedIndex, lastIncludedTerm); snapshot installation is the protocol that lets a lagging follower rejoin from that proof point instead of replaying history the leader no longer stores.

Today's "Aha!" Moment

In 040.md, Harbor Point replaced the suspected voter ny-db-3 with learner sg-db-1 for shard 184. That solved the quorum problem, but it created a new one immediately: sg-db-1 starts far behind the leader. If the leader had to replay every command ever committed for that shard, catch-up time would grow without bound and the log would become an accidental archive instead of a coordination structure.

Raft's answer is not "throw old entries away when disk gets full." The safe answer is narrower. Once the state machine has durably captured the effects of a committed log prefix, the system can replace that prefix with a snapshot plus two pieces of log identity: the last included index and the term at that index. Those fields are what let the cluster keep its log-matching guarantees even after most of the old history has been compacted away.

That is why snapshot installation is a consensus mechanism, not just a storage optimization. When sg-db-1 asks for entries the leader no longer has, the leader is not sending a convenient backup. It is sending a compact proof of "the replicated state up through this exact log point." Once the learner accepts that proof, ordinary append replication can resume from the next index. In the next lesson, 042.md, that same mechanism will start to look different depending on where replicas live and which failure domains carry the snapshot traffic.

Why This Matters

Harbor Point's Maryland leader md-db-4 has committed entries through index 12880 for shard 184. During the failure-handling sequence from the previous lesson, it already snapshotted the state machine through index 9240 and compacted away entries 1..9240. When learner sg-db-1 joins from an empty disk, it cannot ask for "the whole log" because that history no longer exists on the leader. If the only recovery path were replay, adding a new replica after a network incident would become slower every week the system stayed in service.

The production risk cuts both ways. Without compaction, logs grow until restart time, disk cost, and recovery bandwidth become unacceptable. With sloppy compaction, a leader can discard entries that are still needed to prove ordering, followers can install stale state, and a returning node can re-enter the cluster with the wrong configuration epoch. The feature that keeps recovery bounded is also sitting directly on the consensus safety boundary.

This is why mature Raft systems treat snapshots as part of normal operation rather than a once-in-a-while maintenance task. Snapshot cadence affects disk I/O, learner promotion time, leader bandwidth, and how much history a stalled follower can miss before it needs a full state transfer instead of ordinary replication.

Learning Objectives

By the end of this session, you will be able to:

Explain what information a Raft snapshot must preserve - Show why compacting a prefix is safe only after the state machine and boundary metadata are durable.
Trace snapshot installation from leader to lagging follower - Follow how a learner moves from "missing old entries" to "ready for normal AppendEntries replication."
Evaluate production trade-offs around compaction policy - Reason about snapshot frequency, install cost, leader load, and failure recovery time in a real cluster.

Core Concepts Explained

Concept 1: Log compaction works only because the snapshot preserves the committed prefix's identity

Raft's log is both a command history and a proof structure. Harbor Point cannot just serialize the reservation shard's key-value state and delete the old prefix whenever it feels convenient, because the cluster still needs a way to agree on where the snapshot fits into the replicated history. That is why a real snapshot carries the application state plus metadata such as lastIncludedIndex, lastIncludedTerm, and the latest committed cluster configuration available at that point.

For shard 184, suppose md-db-4 has applied entries through 9240 to the state machine and written a durable snapshot for that exact prefix. Only then may it compact away entries 1..9240. The snapshot is saying: "Everything these commands would have produced is now encoded here, and the last command represented by this snapshot came from index 9240 in term 318." That boundary matters because later log entries still have to attach to a known, agreed-upon prefix.

before compaction
log: [1 ........................................ 9240][9241 .......... 12880]

after compaction
snapshot: state@9240, lastIncludedTerm=318, config_epoch=84
log:                         [9241 .............. 12880]

If the leader crashed after deleting the prefix but before the snapshot became durable, it would have destroyed committed history. If the snapshot omitted the last included term, a follower could not safely verify continuity at the truncation boundary. The invariant is strict: compaction is safe only when the discarded log prefix has been replaced by an equally durable and uniquely identified snapshot boundary.

The trade-off is operational, not conceptual. Frequent snapshots keep replay windows short and make recovery cheaper, but they spend disk bandwidth and can stall foreground work if snapshot generation contends with writes. Infrequent snapshots are cheaper in the steady state, but they allow the retained log to grow and increase the chance that a recovering replica must replay an enormous tail before it becomes useful.

Concept 2: Snapshot installation is the fallback replication path for followers that are too far behind

After sg-db-1 joins as a learner, the leader initially tries ordinary log replication. In Raft terms, it starts with a nextIndex for that follower and attempts to send AppendEntries. The problem is that sg-db-1 needs entries earlier than 9241, while the leader's compacted log now begins at 9241. At that point the leader has no sequence of retained entries that can bridge the gap, so it switches from append replication to InstallSnapshot.

The mechanism is deliberately careful. The leader streams the snapshot in chunks so it does not need one giant RPC, and each chunk is tied to the leader's current term and the snapshot boundary. The follower writes those chunks to temporary durable storage rather than exposing partially received state to the running state machine. Only after the final chunk arrives and term checks still succeed does the follower atomically install the snapshot, update its local snapshot metadata, and move its durable state forward.

Harbor Point's flow looks like this:

1. sg-db-1 joins shard 184 as learner in configuration epoch 84
2. md-db-4 tries AppendEntries starting below 9241
3. md-db-4 detects that requested entries were compacted
4. md-db-4 sends InstallSnapshot for state through index 9240, term 318
5. sg-db-1 persists snapshot, installs it atomically, records epoch 84
6. md-db-4 sets nextIndex for sg-db-1 to 9241
7. ordinary AppendEntries resumes for 9241..12880 and beyond

One subtle rule is easy to miss. If the follower already has a log entry at lastIncludedIndex whose term matches the snapshot's lastIncludedTerm, it may keep any later suffix that follows that entry. If not, it must discard the conflicting log and trust the snapshot as the new base. That is how Raft preserves log matching even when the follower arrives with stale or divergent local history.

This is also where the previous lesson's membership story meets actual bytes on disk. sg-db-1 cannot be promoted safely just because a new configuration was committed. It becomes a useful future voter only after snapshot installation plus trailing-log catch-up put it on the same committed state as the rest of the group.

Concept 3: Production snapshot policy is a balance between recovery speed and foreground cost

Harbor Point wants learner catch-up to be quick during trading hours, but it cannot afford for every snapshot to freeze the leader's storage engine. That forces a policy decision: how much retained log should the cluster keep, how often should each replica snapshot, and how much bandwidth may snapshot installation consume before it starts harming live traffic?

Too-aggressive compaction creates its own pathology. If the leader snapshots every few seconds and trims the log immediately, a slow learner may repeatedly fall behind the compaction horizon and get forced into another full snapshot transfer instead of cheaper incremental appends. Too-conservative compaction pushes the pain elsewhere: restart time grows, snapshot restores take longer because the remaining tail is larger, and disk pressure can turn routine recovery into an emergency.

Good implementations separate snapshot creation from the critical write path as much as possible. Storage engines often use copy-on-write checkpoints, background file copying, or throttled chunk streaming so that snapshot work does not monopolize the same I/O queues that commit latency depends on. The system still needs observability to know whether the policy is working. For Harbor Point, the useful signals are "applied index minus snapshot index," snapshot build duration, install throughput per follower, compaction pause time, and how often a recovering node needs a full snapshot instead of the log tail alone.

This is where the lesson starts leaning toward 042.md. Once you understand the local mechanism, the next question is placement: if the only up-to-date leader for shard 184 is in Maryland and the replacement learner is in Singapore, the snapshot path now crosses an ocean. Replication topology and failure domains determine whether that recovery traffic is merely expensive or whether it shares the same weak links that caused the recovery in the first place.

Troubleshooting

Issue: A learner never finishes catching up and keeps receiving full snapshots.
- Why it happens: Compaction is moving the retention boundary faster than the learner can apply the trailing log, so every retry lands behind the snapshot horizon again.
- Clarification / Fix: Slow the compaction policy, throttle leader-side snapshot fan-out less aggressively, or give the learner enough bandwidth and I/O budget to consume the trailing log before the next compaction cycle.
Issue: Leader commit latency spikes whenever a snapshot is created.
- Why it happens: Snapshot generation is contending with foreground fsyncs, page cache, or storage-engine compaction on the same hot path as live writes.
- Clarification / Fix: Build snapshots from a stable checkpoint or copy-on-write view, run transfer work on background limits, and measure snapshot pause time separately from ordinary append latency.
Issue: A returning node installs a snapshot but still behaves as if it belongs to the old cluster configuration.
- Why it happens: The node restored application state without restoring or enforcing the membership metadata that was committed at the snapshot boundary.
- Clarification / Fix: Persist configuration state with the snapshot, fence all RPCs and client routing by current term and configuration epoch, and treat snapshot restore as both data recovery and consensus-state recovery.

Advanced Connections

Connection 1: 040.md committed the safe membership change; this lesson explains how the new member actually becomes current

The previous lesson established that Harbor Point could not promote sg-db-1 until the cluster committed a new configuration. This lesson fills in the missing operational step: safe reconfiguration still needs a bounded, durable way to transfer the committed state to the replacement member.

Connection 2: 042.md turns snapshot traffic into a topology question

Snapshot installation is the same protocol whether the learner is one rack away or on another continent, but its cost profile is not the same. The next lesson uses that fact to compare replication layouts by blast radius, recovery bandwidth, and which failure domains the catch-up path depends on.

Resources

Optional Deepening Resources

[PAPER] In Search of an Understandable Consensus Algorithm (Extended Version)
- Focus: Read Section 7 on log compaction and the InstallSnapshot RPC details; notice how lastIncludedIndex and lastIncludedTerm preserve log identity after truncation.
[DOC] Runtime Reconfiguration Design
- Focus: Connect the learner-promotion workflow from the previous lesson with the catch-up cost that snapshot installation is meant to bound.
[DOC] Disaster recovery
- Focus: Pay attention to snapshot restore semantics and why old member identity and cluster metadata cannot be reused casually after recovery.
[DOC] Recommendations for operating Consul servers at scale
- Focus: Look for how a production Raft system talks about snapshot transfer, log retention, and the cost of bringing a new server into sync.

Key Insights

A snapshot replaces history only when it preserves the exact boundary of that history - lastIncludedIndex and lastIncludedTerm are what make truncation safe rather than approximate.
InstallSnapshot is a replication protocol, not a backup convenience - It is the mechanism that lets a lagging follower re-enter the log at a known committed point.
Compaction policy is a recovery policy - Every choice about snapshot cadence and retention changes how expensive failover, replacement, and long-tail lag will be in production.

← Back to Consistency and Replication

← Back to Learning Hub