LESSON
Day 439: Failure Detectors and Membership Changes
The core idea: Failure detectors turn missing communication into suspicion, and membership changes turn that suspicion into a committed cluster configuration; treating those as the same step is how failovers become split-brain or unnecessary data movement.
Today's "Aha!" Moment
In 039.md, Harbor Point used Merkle trees to compare replicas cheaply, but that only worked because every participant agreed which shard members were supposed to own the data. This lesson starts one layer earlier. During market open, ny-db-3 stops answering heartbeats for shard 184 during an Atlantic congestion burst. The leader can observe silence, but silence does not reveal whether New York crashed, whether packets were delayed, or whether the process is alive but paused. That uncertainty is the raw material of every real failure detector.
The important correction is that a detector does not output truth. It outputs suspicion strong enough to drive policy. If Harbor Point ejects ny-db-3 the moment one timeout fires, a merely slow node can be replaced, its range ownership can be reassigned, and repair can begin against the wrong topology. If Harbor Point waits too long, a genuinely dead voter keeps counting in quorum calculations and write availability collapses. Fast failover and false-positive resistance are trading against each other from the start.
Membership changes solve a different problem: they turn local suspicion into an authoritative cluster view with an epoch number, a quorum rule, and an explicit catch-up path for any replacement node. A node is not "out of the cluster" because one peer feels nervous. It is out only after the reconfiguration is committed. That separation is what keeps returning nodes from serving stale traffic and sets up the next lesson, 041.md, where newly added or badly lagging members catch up through log compaction and snapshot installation.
Why This Matters
Harbor Point runs each reservation shard as a three-voter Raft group. For shard 184, the current voters are md-db-2, md-db-4, and ny-db-3; a warm standby sg-db-1 can join if a voter must be replaced. A 1.8-second WAN stall is long enough to trip a heartbeat timeout even though ny-db-3 may still be alive and may still hold the freshest durable copy for some entries. If the automation removes New York too early, quorum math changes in the middle of trading, Merkle-based repair work from the previous lesson becomes invalid because ownership just moved, and the leader suddenly has to stream state to a replacement node while clients are already contending for bandwidth.
Teams that understand this layer stop asking "is the node dead?" as if the system can ever know that directly. The better operational question is: what evidence do we have, how much delay can the product tolerate, and what configuration change can we commit without breaking quorum intersection? That framing matters because the same decision affects leader election, write availability, lease safety, repair scheduling, and how much surprise the cluster will absorb when the supposedly dead node comes back.
In production, failure detectors and membership changes are not just control-plane details. They decide whether a brief network hiccup becomes a harmless suspicion event, a noisy failover, or a full-blown split-brain incident with stale reads and duplicated rebalancing work.
Learning Objectives
By the end of this session, you will be able to:
- Explain what a practical failure detector actually knows - Distinguish observed communication failure from an authoritative claim that a node has crashed.
- Trace a safe membership change from suspicion to committed configuration - Show how epochs, learners, and quorum intersection protect the cluster during replacement.
- Evaluate the operational trade-offs around detector tuning and reconfiguration - Reason about false positives, failover speed, fencing, and catch-up cost in a production replica group.
Core Concepts Explained
Concept 1: Failure detectors observe missing communication, not machine death
Harbor Point's leader expects regular heartbeats from ny-db-3. When four heartbeat intervals pass with no acknowledgment, the replica becomes suspicious. That sounds simple until you remember the environment is asynchronous enough that a paused process, a congested router, and a power-off event can all look identical from the outside for a while. A timeout is evidence about communication, not proof about process state.
That is why the formal language around failure detectors talks about completeness and accuracy rather than certainty. A useful detector should eventually suspect nodes that have really crashed, but any practical timeout-based implementation can also suspect healthy nodes when the network or the local machine behaves badly enough. Production systems hide the mathematics behind friendlier tools such as fixed heartbeat thresholds, phi-accrual suspicion scores, or SWIM-style gossip probes, but the mechanism is the same: convert missing or delayed messages into a level of suspicion.
For Harbor Point, the tuning problem is concrete. A 500 ms timeout gives fast reaction when a voter actually dies, but it also treats a brief transatlantic stall as failure and can trigger an unnecessary election. A 3 second timeout reduces false positives, but every real failure now blocks quorum decisions for much longer. The correct setting depends less on average latency than on tail latency, garbage-collection pauses, packet loss bursts, and the business cost of a delayed failover.
The operational rule is to keep detector output separate from authority. Suspicion may trigger extra probes, lease reevaluation, or a reconfiguration proposal, but it should not by itself authorize a new node to vote or serve writes. That boundary is what membership change machinery is for.
Concept 2: Membership changes must be committed like any other replicated state change
Suppose Harbor Point decides ny-db-3 should be replaced by sg-db-1. The unsafe shortcut would be to let the leader locally stop counting New York and start counting Singapore immediately. That creates a window where different nodes can believe in different quorums, which is exactly how split-brain behavior appears even when every individual node thinks it is being sensible.
Safe systems instead attach membership to the replicated log or another authoritative configuration store. One common pattern is "add learner, catch up, then promote." Another is Raft joint consensus, where the cluster briefly enters a configuration whose commits must satisfy overlapping quorum rules from both the old and new membership sets. The invariant underneath both approaches is the same: any quorum that could commit before the change must intersect with any quorum that can commit after the change.
For Harbor Point, the sequence looks like this:
epoch 81: voters = {md-db-2, md-db-4, ny-db-3}
timeout on ny-db-3 -> suspect locally, but epoch 81 is still authoritative
epoch 82: learner = sg-db-1 added and begins catch-up
epoch 83: joint or equivalent transitional config is committed
epoch 84: voters = {md-db-2, md-db-4, sg-db-1}
Only after the new epoch is committed do client routing, repair scheduling, and quorum math switch over. That matters directly to the previous lesson. A Merkle tree comparison is meaningful only when both sides agree they are still responsible for the same ranges under the same membership epoch. Reconfiguration is therefore not just about availability; it is part of keeping every later consistency mechanism honest.
Concept 3: Returning and replacement nodes are dangerous until they are fenced and caught up
Now imagine ny-db-3 was only slow, not dead, and comes back after epoch 84 has committed. Its local disk may still contain shard data, and it may still believe it belongs to the replica set. If Harbor Point lets that node continue serving reads or participating in write coordination, an old configuration has just leaked back into the live system. This is why returning members need fencing: every RPC, lease, and client route must be validated against the current term and configuration epoch, and removed nodes must refuse service or shut themselves down.
Node identity also needs a generation boundary. Gossip-based systems often use incarnation numbers so that old membership messages cannot resurrect a superseded process. Consensus systems use term numbers and committed membership state for the same reason. The underlying problem is identical: once a node has been removed, "same hostname" is not enough to trust it again.
Catch-up is the second half of the story. Harbor Point does not want sg-db-1 to become a full voter while it is missing committed history, because that would increase quorum size without increasing useful fault tolerance. If the gap is small, the leader can stream the missing log entries. If the gap is large or the replacement starts from an empty disk, shipping a snapshot is cheaper and safer than replaying an unbounded log. That handoff from membership change to state transfer is exactly why 041.md follows this lesson.
The trade-off is between fast automation and cluster stability. Aggressive auto-remove and auto-add policies minimize operator involvement, but they can cause flapping during intermittent network loss and force repeated full-state catch-up. Conservative policies reduce churn and leader load, but they keep suspected nodes in quorum longer and slow down real failovers. Good systems make that policy explicit instead of pretending the detector answered the whole question.
Troubleshooting
-
Issue: The cluster keeps removing and re-adding the same replica during WAN jitter.
- Why it happens: Detector thresholds are tight enough that transient delay looks like failure, and the automation treats first suspicion as permission to reconfigure.
- Clarification / Fix: Add a suspicion grace period, use extra probes or a stronger evidence threshold before eviction, and require a stable interval before reversing membership again.
-
Issue: Adding a replacement node causes write latency and leader CPU to spike.
- Why it happens: The new member is being counted as a voter before it has caught up, so the leader is doing quorum-critical replication and full-state transfer at the same time.
- Clarification / Fix: Add the node as a learner or non-voter first, let it catch up through log replay or snapshot, and promote it only after the cluster verifies it is current enough to vote safely.
-
Issue: A removed node comes back and serves stale reads through an old client route.
- Why it happens: The data plane is not checking the current configuration epoch or leader term, so an obsolete member still looks superficially healthy.
- Clarification / Fix: Fence removed nodes with epoch-aware routing and RPC validation, and quarantine them until they rejoin through the normal membership workflow.
Advanced Connections
Connection 1: 039.md assumed stable ownership; this lesson explains how the system decides who owns a shard after a suspected failure
Merkle trees localize divergence only if both replicas are comparing the same logical range under the same repair epoch. Failure detectors and membership changes provide that precondition. If topology is still in dispute, anti-entropy can waste work or compare the wrong peers entirely.
Connection 2: 041.md is the catch-up mechanism that makes safe membership changes practical
Replacing ny-db-3 with sg-db-1 is not finished when the configuration commits. The new member still needs the cluster's state. When the missing history is too large for ordinary log replay, snapshot installation becomes the bridge between safe reconfiguration and full usefulness.
Resources
Optional Deepening Resources
- [PAPER] In Search of an Understandable Consensus Algorithm (Extended Version)
- Focus: Read the cluster membership section and notice how joint consensus preserves quorum intersection during reconfiguration.
- [DOC] Design of runtime reconfiguration
- Focus: Compare the lesson's "suspicion versus committed configuration" split with etcd's two-phase member change design.
- [DOC] Consul Gossip
- Focus: See how a production system separates gossip-based failure detection from the control-plane state that must remain authoritative.
- [PAPER] SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
- Focus: Pay attention to how scalable suspicion dissemination differs from committing a new quorum-bearing membership.
Key Insights
- Suspicion is not membership - A timeout only says communication is missing; it does not by itself change who may vote or own data.
- Safe reconfiguration preserves quorum intersection - Add, catch up, and promote in a way that guarantees old and new commit quorums still overlap.
- Rejoining nodes must be fenced before they are useful - Old epochs, stale leases, and missing history turn a returning node into a risk until the cluster catches it up explicitly.