Raft Log Replication and Commit Semantics

LESSON

006 30 min intermediate

Raft Log Replication and Commit Semantics

The core idea: Raft separates replication from commitment so followers can receive and repair log entries before the cluster exposes them as authoritative state.

Core Insight

Imagine a Raft leader accepts a client command to create a namespace. It appends the command locally and sends it to followers. One follower stores it quickly, another is slow, and two others are temporarily unreachable. The command now exists on more than one machine, but the cluster should not yet treat it as durable committed truth.

That gap is the key idea. Replication is a placement fact: where has the entry been stored? Commitment is a consensus fact: has the protocol gathered enough evidence that this entry belongs to the authoritative history future leaders must preserve?

Raft needs the distinction because leaders can fail. A leader may replicate an entry to some followers and then disappear before the cluster has enough evidence to commit it. A later leader may have to repair or overwrite an uncommitted suffix. If the old leader had already exposed that suffix to users, the system would have made unstable history visible.

The trade-off is that Raft accepts a little more bookkeeping and latency before applying commands, in exchange for a clearer safety boundary: only committed entries may affect the state machine.

AppendEntries Is a Consistency Check, Not Just Shipping

Raft's leader does not blindly copy log records to followers. Each append request carries information about the entry immediately before the new entries:

AppendEntries:
  term
  leaderId
  prevLogIndex
  prevLogTerm
  entries
  leaderCommit

The follower uses prevLogIndex and prevLogTerm to answer a simple question:

Do my log and the leader's log share the prefix where this append begins?

If the answer is yes, the follower can append the new entries after that point. If the answer is no, the follower rejects the request. The leader then backs up and tries an earlier prefix until it finds the place where the logs agree.

The repair path looks like this:

leader:   [1][2][3][4][5][6]
follower: [1][2][3][X][Y]

shared prefix: [1][2][3]
divergent suffix on follower: [X][Y]
leader replaces suffix with: [4][5][6]

This is how Raft maintains the log matching property: if two logs contain an entry with the same index and term, then they share the same prefix up to that entry. The property is not a magic result of having a leader. It comes from this repeated discipline: prove the prefix, then extend it.

Replicated, Committed, and Applied

Raft has three different states that are easy to collapse if you move too quickly:

stored in a log      -> replicated somewhere
safe protocol truth  -> committed
visible behavior     -> applied to the state machine

A leader tracks follower progress with two useful pieces of state:

nextIndex: the next log index the leader will try to send to that follower
matchIndex: the highest log index the leader knows that follower has stored

Suppose the leader has entries through index 43:

leader log:    40 41 42 43

matchIndex:
  leader    -> 43
  follower1 -> 43
  follower2 -> 42
  follower3 -> 39
  follower4 -> 42

In a five-node cluster, a majority is three nodes. Here, a majority has stored at least index 42. That makes index 42 a candidate for commit, but Raft includes one more important restriction: a leader commits entries from its current term by counting replicas. Once a current-term entry is committed, earlier entries become committed as part of the same prefix.

That current-term rule prevents an old replicated entry from being declared committed too casually after leadership changes. The practical result is:

do not apply merely because an entry exists
do not apply merely because one leader wrote it locally
apply after the commit index says the prefix is authoritative

This is the boundary that keeps the state machine from observing a log suffix that might still be repaired away.

Worked Example: The Entry That Exists but Must Not Run

Consider a five-node cluster in term 8. Leader L8 appends entry 50 with command charge(invoice-7). It stores the entry locally and sends it to follower F1, but then crashes before reaching a majority.

At that moment:

L8: entry 50 present
F1: entry 50 present
F2: entry 50 absent
F3: entry 50 absent
F4: entry 50 absent

The command exists on two machines. It is still not committed. A later leader in term 9 might not contain entry 50, and it may repair the cluster toward a different suffix. If charge(invoice-7) had already been applied to the state machine, the system would have exposed a result that the consensus log was not forced to keep.

Now compare a safer case. A leader in term 9 appends entry 51, replicates it to a majority, and advances commitIndex to 51. Once that happens, the state machines can apply committed entries in order:

log present locally -> commitIndex advances -> lastApplied catches up

This ordering is why Raft separates:

the last log entry stored
the highest committed index
the highest applied index

The log can contain tentative future history. The state machine should only see committed history.

Operational Failure Modes

Several production symptoms come from confusing these states.

"The follower has the entry, so the client is safe" is too weak. One follower plus the leader is not necessarily enough in a five-node cluster. Client-visible durability depends on commitment, not mere presence on some disks.

"Followers deleting entries means data loss" is also incomplete. If those entries were part of an uncommitted divergent suffix, deleting them is the repair mechanism that brings the follower back to the leader's authoritative prefix.

"The leader wrote it locally, so it can apply immediately" breaks the safety boundary. The leader is a coordinator, not a one-node source of truth. Applying before commitment can expose state that leadership change might later invalidate.

The trade-off is deliberate: waiting for commit may add latency and require retry behavior for clients, but it prevents unstable log suffixes from becoming externally visible system state.

Connections

The previous lesson introduced Raft's strong leader. This lesson shows what the leader actually coordinates: prefix checks, follower repair, replication progress, and commit advancement.

The next lesson on membership change builds directly on commit semantics. Reconfiguration is hard because it changes who counts in the majority while the log still needs a single authoritative committed history.

Resources

[PAPER] In Search of an Understandable Consensus Algorithm
- Focus: Read the log replication and safety sections while tracking append, commit, and apply as separate states.
[DOC] The Raft Consensus Algorithm
- Focus: Use the official resource hub for diagrams and references that make replication flow concrete.
[ARTICLE] The Secret Lives of Data: Raft
- Focus: Helpful visualization for leader append, follower catch-up, and commit progression.

Key Takeaways

Raft append requests prove a shared prefix before extending a follower's log; replication is not blind copying.
Replicated, committed, and applied are distinct states, and only committed entries should reach the state machine.
Commit semantics trade a little latency and bookkeeping for a clear safety boundary against exposing unstable log history.

← Back to Consensus and Coordination

← Back to Distributed Systems

← Back to Learning Hub