LESSON
Day 419: Raft for Database Replication
The core idea: In a Raft-backed database, a write is not considered committed because one primary decided it was durable. It is committed because a majority of the replicas for that data range have accepted the same log entry in the same position, which makes that history survive leader failure.
Today's "Aha!" Moment
In 02.md, Harbor Point had a familiar primary-standby question: when reservation R-88421 is approved, should the database return success after local WAL flush, after one standby flushes, or after some stricter remote milestone? Raft changes that question. The reservation range for issuer MUNI-77 is no longer "one primary plus followers." It is a three-replica Raft group, and commit means the group has chosen an ordered log entry by majority.
That sounds like a small wording change, but it changes the whole failure story. In leader-follower replication, teams often debate which standby acknowledgment should be required before success. In Raft, the log itself is the contract: the leader proposes an entry, followers accept it only if their existing log prefix matches, and the leader acknowledges the client only after a majority has that entry durably recorded. If the leader dies after that point, any future leader must carry that committed history forward. If the leader dies before that point, the entry may disappear and the client must retry.
The useful mental shift is this: Raft is not "synchronous replication with fancier elections." It turns replication into the mechanism that defines authority. Harbor Point adopts it because losing a committed reservation after failover is worse than paying an extra quorum round trip, but that safety comes with real database costs: every hot shard has a leader, every write waits for quorum I/O, and every read path has to be explicit about whether it needs fresh linearizable data or is allowed to trail behind.
Why This Matters
Harbor Point's reservation platform now has a harder requirement than the earlier primary-standby deployment could comfortably explain. The issuer_exposure record for MUNI-77 cannot be overbooked, and a reservation approval cannot vanish just because the current leader host failed before a replica promotion finished. A simple asynchronous primary is too weak for that contract. A synchronous primary is better, but the rule still sounds like an implementation setting: "wait for remote flush on one standby." Operations wants something stronger and easier to reason about during failover.
Raft gives Harbor Point that stronger contract for the data range itself. The leader for the MUNI-77 range may change, but the committed log history for that range cannot fork as long as a majority of replicas remain available. That makes post-failover reasoning cleaner: either the reservation reached quorum and will survive, or it did not and the client must retry. There is no third state where an old primary believed the write was committed but the new leader is free to forget it.
This matters operationally because it pushes coordination directly into the database write path. Approval latency now depends on quorum network and disk behavior, not only the old leader's local flush. Hot keys concentrate on the current leader for their Raft group. Lagging replicas may need snapshots instead of ordinary log catch-up. The next lesson, 04.md, generalizes that quorum idea and asks how read and write quorums behave when the system exposes them as tunable policy rather than fixing them inside Raft's majority rule.
Learning Objectives
By the end of this session, you will be able to:
- Explain how Raft changes database commit semantics - Trace why a write becomes durable when a majority agrees on one log position, not when one primary decides to acknowledge it.
- Analyze failover outcomes for committed and uncommitted entries - Predict which reservation updates survive leader loss and why some timed-out writes must be retried.
- Evaluate when Raft is worth the cost in a database system - Connect quorum commit, read freshness, replica catch-up, and hot-range leadership to real production trade-offs.
Core Concepts Explained
Concept 1: A database write becomes a Raft log entry before it becomes visible state
Harbor Point stores the MUNI-77 reservation range on three replicas: db-a, db-b, and db-c. At 09:30:00.120, the API asks to reserve another 500000 notional, which means incrementing issuer_exposure.reserved_notional and inserting reservation R-88421. In a Raft-backed database, the leader for that range does not apply those changes to its local state machine first and then "ship" them as an afterthought. It first turns the write into a log entry, such as "apply reservation delta for MUNI-77," with a term and an index.
The leader appends that entry to its local Raft log and sends AppendEntries RPCs to followers. Each follower checks the prefix information attached to the request: "Do I already have the same previous log index and term as the leader?" If the answer is yes, it appends the new entry durably to its own log. If the answer is no, the follower rejects the append and the leader backs up until the logs share a matching prefix again. That is the log-matching rule that prevents replicas from quietly drifting into different histories.
Only after the leader sees that a majority has stored the entry may it advance commitIndex for that log position. At that point the entry is committed by protocol. The database can then apply the command to the replicated state machine: update the exposure total, create the reservation row, and make the result visible to subsequent reads. Followers also wait for the leader's commit information before applying the entry locally. A follower may have the bytes for an entry in its log without yet exposing that entry to queries.
For Harbor Point's range, the write path looks like this:
client
-> range router
-> leader db-b (term 48)
append log[1051] = reserve R-88421 for MUNI-77
fsync local Raft log
send AppendEntries(prev=1050/48, entry=1051) to db-a, db-c
-> db-a fsyncs log[1051]
-> majority reached: db-b + db-a
-> leader sets commitIndex = 1051
-> leader applies entry to storage engine state
-> leader replies success to client
-> followers apply once they learn commitIndex >= 1051
This is the first big difference from the previous lesson. In 02.md, synchronous replication meant "the primary waits for a standby milestone." Here, majority acknowledgment is not an optional durability enhancement layered on top of the write. It is what makes the write committed at all. The trade-off is straightforward: Harbor Point gets a failover-safe history for that range, but every write now pays for quorum coordination and cannot outrun the health of the majority.
Concept 2: Leader changes are safe because committed and uncommitted entries are treated differently
The hardest production confusion in Raft-backed databases is assuming that "replicated somewhere" means "safe forever." Harbor Point needs the sharper distinction. Suppose leader db-b appends reservation R-88421 locally and sends it to followers, but crashes before anything else happens. If no follower durably stored the entry, the write never reached quorum. db-a and db-c can elect a new leader whose log does not include R-88421, and the old entry disappears. That is not data loss according to the protocol, because the system never promised that entry was committed.
Now change the scenario slightly. db-b appends R-88421, db-a durably appends it too, quorum is reached, and db-b crashes before the client sees the response. Harbor Point now has a different problem: the client may time out and be unsure whether the reservation succeeded, but the cluster itself is not unsure. Because a majority stored the entry, any future leader must contain it. db-c cannot win leadership with an older log if db-a is available, and if only db-c survives the cluster will stop making progress rather than elect a leader that could forget a committed command.
This is where Raft's election restriction matters to database behavior. Candidates include their last log index and term when requesting votes. Replicas refuse to vote for a candidate whose log is less up to date than their own. That rule ensures the new leader contains every committed entry. The database can therefore expose a clean invariant to Harbor Point: once a reservation update is committed, failover may delay availability, but it must not rewrite history.
The client-side implication is equally important. A timeout after leader failure is not enough to tell Harbor Point whether R-88421 exists. The right question is not "Did the old leader start the write?" but "Did the write reach quorum before the old leader disappeared?" Production systems handle that ambiguity with idempotent request identifiers, deduplication on retry, and observability around proposal success versus application success. Raft protects the history once the cluster commits an entry; it does not magically remove the need for careful client retry semantics.
Concept 3: Raft makes databases easier to reason about, but not cheaper to operate
Raft is especially attractive in databases because it gives each shard or key range its own small consensus group. Harbor Point does not need one monolithic cluster-wide log for every table. It can split the keyspace so MUNI-77, ENERGY-12, and other issuer ranges each live in separate Raft groups. That spreads write throughput across many leaders and makes rebalancing possible as the dataset grows. Systems such as CockroachDB and TiKV build their storage layer around exactly this idea: ranges or regions are the unit of replication, leadership, and repair.
The cost is that every range now has consensus behavior that must be operated deliberately. A hot issuer creates a hot Raft leader, because writes for that range must flow through one replica at a time. A slow follower may not block every write if the other two replicas still form a majority, but it can increase snapshot traffic and recovery work. If a follower falls too far behind, catching up from the log may be slower than shipping a snapshot of the current state and then replaying the remaining entries. Rebalancing a range to another node is therefore not just copying bytes; it is copying state while preserving the log's authoritative order.
Read behavior also stops being trivial. Harbor Point's approval API needs linearizable answers when it asks "what is the current reserved notional for MUNI-77?" Many Raft-backed databases serve those reads from the leader or leaseholder, because that node already has the freshest committed prefix. Replica reads can be cheaper or geographically closer, but then the system must either prove freshness through a mechanism such as ReadIndex or accept staleness. That is the bridge to 04.md: once the database is defined by quorum commit, read semantics become a first-class design choice rather than a side effect.
So the trade-off is not "Raft is safer, therefore use it everywhere." The real trade-off is narrower. Use Raft when a shard's state must survive leader loss without ambiguity and when paying quorum cost is acceptable. Do not use it casually for workloads that mostly need local durability, loose replica freshness, or extreme write fan-out across wide geographies. Harbor Point accepts the cost for exposure control because a duplicated or missing reservation is more expensive than a few extra milliseconds of coordinated commit latency.
Troubleshooting
Issue: Reservation writes slowed down after one replica moved onto a busier node, even though the leader's own disk metrics still look healthy.
Why it happens / is confusing: The leader is not waiting only on itself. Commit needs a majority of replicas for that range, so follower fsync time and network delay are part of write latency as soon as the local node stops being the only bottleneck.
Clarification / Fix: Measure proposal latency, follower append latency, and quorum commit latency separately. If one follower is consistently slow, Harbor Point may need to move that replica, rebalance the range, or temporarily accept reduced fault tolerance while repair work finishes.
Issue: After a leader crash, the client retried R-88421 because the original request timed out, and support is unsure whether one or two reservations were created.
Why it happens / is confusing: A client timeout does not reveal whether the old leader failed before or after quorum commit. The cluster knows the answer; the client often does not.
Clarification / Fix: Make reservation requests idempotent with a stable request ID stored in the replicated state machine. On retry, the new leader can answer whether R-88421 already committed instead of applying it twice.
Issue: A follower appears almost caught up, but the database still sends some reads to the leader instead of serving them locally.
Why it happens / is confusing: Having most log entries is not the same as proving a linearizable read. The follower may still lack the latest committed prefix or may not have a freshness proof tied to the current leader term.
Clarification / Fix: Distinguish follower-read mode from leader/leaseholder reads in dashboards and runbooks. If the request needs current exposure values, route it through the linearizable path rather than trusting approximate replica lag.
Issue: A recovering replica starts consuming large amounts of bandwidth and disk I/O instead of quietly replaying the Raft log.
Why it happens / is confusing: Once a replica falls too far behind, the retained log prefix may be too long or too expensive to replay incrementally. The database will ship a snapshot and then catch up from a later log position.
Clarification / Fix: Watch snapshot send/receive metrics, tune log retention relative to repair time, and keep replica placement stable enough that brief outages do not force repeated full-range snapshots.
Advanced Connections
Connection 1: 02.md asked which remote milestone to wait for; Raft makes quorum acceptance the milestone
The previous lesson treated replica acknowledgment as a policy choice on top of leader-follower replication. Raft collapses that debate into the protocol itself for a shard: a write is committed when a majority has accepted the log entry and the leader advances commit index. The operator still chooses replication factor and placement, but not an ad hoc definition of commit for each write.
Connection 2: 04.md and 05.md widen the scope from one Raft group to many replicated partitions
04.md will compare Raft's fixed majority semantics with systems that expose tunable read and write quorums directly. 05.md then asks how the keyspace should be partitioned so those quorum costs land on sensible shards. In practice, databases scale Raft by running many small consensus groups, so sharding strategy determines where leadership, quorum latency, and hot spots show up.
Resources
Optional Deepening Resources
- [PAPER] In Search of an Understandable Consensus Algorithm (Extended Version)
- Focus: The original Raft paper's treatment of leader election, log replication, commitment, and why committed entries survive leadership change.
- [DOC] CockroachDB Architecture: Replication Layer
- Focus: A production database implementation of per-range Raft groups, leaseholders, snapshots, and quorum writes.
- [DOC] TiKV Architecture: Storage
- Focus: How a real distributed KV store turns each region into a Raft group and combines replicated logs with range-based sharding.
- [DOC] etcd API Guarantees
- Focus: How a Raft-backed system explains completed operations, linearizable reads, and the trade-off between fresh and stale read modes.
Key Insights
- Raft commit is a quorum fact, not a primary's opinion - A database write becomes durable for the cluster only when a majority has accepted the same log entry in the same position.
- Leader failure does not make all in-flight writes equal - Committed entries must survive leadership change, while uncommitted entries may disappear and require client retry.
- Raft simplifies failover reasoning by charging coordination upfront - The price of cleaner safety semantics is quorum latency, leader hot spots, explicit read freshness rules, and replica repair work.