LESSON
Day 472: Leader-Based Replication Mechanics
The core idea: In leader-based replication, the leader is not just the node that accepts writes first. It is the process that assigns one durable log order and proves that order has reached enough replicas before a client is told "committed."
Today's "Aha!" Moment
In 042.md, Harbor Point decided that reservation shard 184 would keep its voting quorum close to East Coast users: leader md-db-4, synchronous follower md-db-2, and an asynchronous disaster-recovery replica sg-db-1. That topology choice solved the placement question, but it immediately created a harder mechanical one. When a customer reserves cabin C14, what exact event lets Harbor Point say the booking is real rather than merely written on one machine?
The non-obvious answer is that leader-based replication is a commit protocol disguised as a deployment pattern. The leader does not make the system safe by being "primary." It makes the system safe by assigning a single log position, tracking which followers have durably stored that position, and only then advancing the commit point. Followers are not passive copies. Their acknowledgements define whether the next leader can prove the reservation existed after a crash.
That distinction corrects a common misconception. Teams often think single-leader replication removes consistency problems because there is only one writer. It removes one class of problem, namely concurrent write conflict inside one shard, but it leaves several others fully alive: follower lag, stale reads, unsafe failover, and acknowledgements that outrun durability. Those are mechanical issues, not abstract theory, and they show up first during incidents when the leader dies at the worst possible moment.
This lesson stays with Harbor Point's reservation path all the way through the write, the follower acks, and the failover boundary. The next lesson, 044.md, can then contrast that single ordered log with multi-leader systems, where conflict classes exist precisely because there is no longer one sequencer for every write.
Why This Matters
Harbor Point's reservation API handles bursts when a ferry operator opens a new weekend schedule. During those bursts, latency pressure pushes the team toward faster acknowledgements, while durability pressure pushes them toward waiting for more replicas. That trade-off is the center of leader-based replication mechanics: acknowledge too early and a crash can erase an "accepted" reservation; wait for every lagging follower and the write path becomes hostage to the slowest replica.
Production incidents around primary-replica databases usually come from a fuzzy answer to one of three questions: when is a log entry durable, when is it visible to readers, and which replicas are safe to promote after the leader disappears? If different teams answer those questions differently, the system behaves coherently only on the happy path. The failure path becomes an argument about whether the database "really committed" the write.
Once the mechanism is explicit, the operational consequences become measurable. Harbor Point can decide whether 201 Created means "written to leader memory," "fsynced on the leader," or "stored on a synchronous quorum." It can publish follower-lag SLOs that mean something. It can also explain why Singapore is useful for disaster recovery without pretending it participates in every East Coast commit. Leader-based replication matters because it turns those promises into concrete storage and networking behavior.
Learning Objectives
By the end of this session, you will be able to:
- Explain what the leader actually owns - Describe how ordering, commit index advancement, and follower acknowledgements produce a safe write.
- Trace a write through the replication pipeline - Follow one reservation from client request to follower persistence, apply, and possible failover.
- Evaluate the operational trade-off in acknowledgement policy - Reason about latency, durability, and promotion safety using observable replication state.
Core Concepts Explained
Concept 1: The leader is the shard's sequencer, not just its write endpoint
Suppose a client sends POST /reservations for cabin C14 to md-db-4. The leader first validates that the cabin is still free, then turns the change into a log entry such as "term 27, index 981244, reserve C14 for customer 5521." That index assignment is the real beginning of the replicated write. From that point on, every replica has to agree not only on the content of the entry but also on where it sits in the log relative to everything before and after it.
The mechanical flow looks like this:
client -> md-db-4 (leader): reserve C14
md-db-4: append log entry [term 27, index 981244]
md-db-4 -> md-db-2: replicate entry 981244
md-db-4 -> sg-db-1: send entry 981244 asynchronously
md-db-2 -> md-db-4: durable ack for 981244
md-db-4: advance commit index to 981244
md-db-4 -> client: 201 Created
Two details matter. First, the leader may append an entry locally before it is committed. Local append means "candidate state exists," not "client-visible truth." Second, the leader should apply the reservation to the queryable state machine only after the commit index includes that entry. This separation between appended and committed state is what lets the system discard speculative entries from a failed leader without violating the log's ordering rules.
The trade-off shows up immediately in acknowledgement policy. If Harbor Point returns success after local disk flush only, write latency stays low, but a leader crash can still lose the reservation if no promotable follower has the entry. If Harbor Point waits for md-db-2 to durably acknowledge before responding, failover is safer, but the client now pays synchronous replication cost on every write. Leader-based replication is therefore not merely a routing pattern. It is an explicit bargain between durability and latency.
Concept 2: Follower progress is tracked continuously, because lag decides both throughput and failover safety
Once md-db-4 accepts more traffic, it cannot treat all followers as equally current. It needs per-follower replication state such as "next log index to send," "highest replicated index," and "highest committed index known to that follower." In Raft terms these are often represented as nextIndex and matchIndex; in database products the names differ, but the mechanism is similar. The leader is always asking: how far behind is each replica, and which replicas are caught up enough to matter for commit or promotion?
That state drives ordinary throughput as much as it drives recovery. A synchronous follower like md-db-2 must keep pace because Harbor Point's commit point depends on it. An asynchronous follower like sg-db-1 can lag for seconds or minutes without breaking local write availability, but that lag directly weakens the disaster-recovery position. The same leader that batches East Coast commits also has to decide when Singapore needs a snapshot instead of another long stream of incremental entries, which is exactly where 041.md becomes operationally relevant.
A simplified replication loop looks like this:
for follower in followers:
send_entries(follower, from_index=next_index[follower])
if follower.acked_index > match_index[follower]:
match_index[follower] = follower.acked_index
next_index[follower] = follower.acked_index + 1
commit_index = quorum_commit(match_index, local_index)
apply_entries_through(commit_index)
The production lesson is that lag is not just a dashboard number. It determines whether a follower can take over, whether a read replica can answer fresh-enough queries, and whether backpressure is needed on the leader. If Harbor Point ignores sg-db-1 until disaster strikes, it may discover that the remote replica is so far behind that promotion meets the letter of "replica exists" but not the spirit of recovery readiness.
Concept 3: Read visibility and failover correctness come from the same commit boundary
Harbor Point initially focused on write acknowledgements, but the same mechanics govern reads. If the load balancer sends a read immediately after the C14 reservation write to sg-db-1, that client may still see the cabin as available. Nothing is "wrong" with replication there; the replica is simply asynchronous. The real mistake would be promising read-after-write semantics without a read barrier, leader routing, or an explicit "read after LSN/index" token.
Failover uses the same boundary from the opposite direction. Imagine md-db-4 crashes after sending entry 981244 to md-db-2 but before Singapore receives it. If md-db-2 has the entry durably and the election rules require any new leader to have the most up-to-date log, the reservation survives. If an operator promotes a lagging node that never stored 981244, the write can vanish even though the original leader believed replication was proceeding normally. Safe failover is therefore not a separate feature layered on top of replication. It is the continuation of the same log-ordering and commit rules under leader loss.
This is also why leader fencing matters during partitions. A stale leader that still accepts writes after losing quorum can create two incompatible histories, even in a "single leader" system. Production systems prevent that with terms, epochs, leases, or external fencing tokens that make old leadership impossible to use once the cluster has moved on. The trade-off here is operational strictness versus convenience: the more aggressively Harbor Point fences stale leaders and restricts follower reads, the clearer its correctness envelope becomes, but the more carefully it must design around partitions and replica lag.
Troubleshooting
-
Issue: A reservation returns success, but disappears after a failover test.
- Why it happens: The leader acknowledged after local persistence or after sending the entry, not after a durable synchronous quorum stored it.
- Clarification / Fix: Define the commit point precisely in configuration and documentation. Make client success depend on the same durability condition that promotion rules assume.
-
Issue: A promoted follower comes up read/write, then immediately needs rollback or reconciliation.
- Why it happens: The promotion path ignored replication freshness and treated "replica is alive" as equivalent to "replica has the committed log."
- Clarification / Fix: Gate promotion on replicated index, last durable log position, and term/epoch freshness. Surface those values directly in operator tooling instead of relying on intuition.
-
Issue: Users observe stale reads right after creating or updating a reservation.
- Why it happens: Reads were sent to asynchronous followers without a session guarantee or commit-position check.
- Clarification / Fix: Route read-after-write traffic to the leader, or carry a commit token so replicas wait until they have applied at least that log position before answering.
-
Issue: The leader becomes a throughput bottleneck even though follower nodes look idle.
- Why it happens: One process owns log ordering, batching, flow control, and synchronous acknowledgement handling for the shard.
- Clarification / Fix: Measure batching efficiency, fsync cost, and per-follower lag separately. Scale by partitioning the workload or changing the acknowledgement set, not by assuming replicas remove the leader's sequencing burden.
Advanced Connections
Connection 1: 042.md chose the topology; this lesson explains how that placement turns into a commit rule
The previous lesson established that Harbor Point wanted a local voting quorum and a remote asynchronous copy. This lesson makes the operational consequence explicit: East Coast durability depends on the leader plus synchronous follower, while Singapore trails behind as a recovery asset rather than a commit participant.
Connection 2: 044.md changes the problem by removing the single sequencer
Leader-based replication works because one node assigns a canonical order for writes on a shard. Multi-leader systems gain locality and write availability in more places, but they lose that single ordering source and must classify, detect, and resolve conflicts explicitly.
Resources
- [BOOK] Designing Data-Intensive Applications
- Focus: Read the leader-based replication chapter with attention to the difference between write acknowledgment, follower lag, and failover safety.
- [PAPER] In Search of an Understandable Consensus Algorithm (Raft)
- Focus: Pay attention to log matching, leader completeness, and why committed entries survive leader change.
- [DOC] PostgreSQL Documentation: Replication
- Focus: Map synchronous standby settings and WAL sender state to the abstract commit-point discussion in this lesson.
- [DOC] MongoDB Replica Set Elections
- Focus: Notice how election safety depends on replication freshness, not merely on which node is reachable first.
Key Insights
- A leader is valuable because it assigns one durable order - Without a shared commit order, "replicated" does not yet mean "safe to expose."
- Follower lag is a correctness signal, not only a performance metric - The same lag number affects read freshness, promotion safety, and disaster-recovery readiness.
- Acknowledgement policy is the core trade-off - Faster success responses reduce latency, but every step away from quorum durability increases failover risk.