Membership Changes and Replica Set Evolution

LESSON

Consistency and Replication

017 30 min advanced

Membership Changes and Replica Set Evolution

The core idea: Failure detectors turn missing communication into suspicion, and membership changes turn that suspicion into a committed cluster configuration; treating those as the same step is how failovers become split-brain or unnecessary data movement.

Core Insight

During market open, Harbor Point's shard 184 is running as a three-voter group: md-db-2, md-db-4, and ny-db-3. Then New York stops answering heartbeats during an Atlantic congestion burst. From Madrid, the silence is real. The interpretation is not. ny-db-3 may have crashed, packets may be delayed, or the process may be alive but paused long enough to look dead.

That uncertainty is the raw material of every practical failure detector. A detector does not output truth. It outputs suspicion strong enough to trigger policy: probe again, mark a replica unhealthy, start failover preparation, or propose a membership change. If Harbor Point ejects ny-db-3 the moment one timeout fires, a merely slow node can be replaced and repair can begin against the wrong topology. If Harbor Point waits too long, a genuinely dead voter keeps counting in quorum calculations and write availability collapses.

Membership changes solve a different problem. They turn local suspicion into an authoritative cluster view with an epoch number, a quorum rule, and an explicit catch-up path for any replacement node. A node is not "out of the cluster" because one peer is nervous. It is out only after the reconfiguration is committed.

The important design move is to keep those steps separate. Failure detection decides what the system suspects. Reconfiguration decides what the cluster is allowed to believe. Fencing and catch-up decide when old or replacement members may safely serve traffic again.

Suspicion Is Evidence, Not Authority

Harbor Point's leader expects regular heartbeats from ny-db-3. When four heartbeat intervals pass with no acknowledgment, the replica becomes suspicious. That sounds simple until the team remembers that a paused process, a congested router, a garbage-collection pause, and a power-off event can all look identical from the outside for a while. A timeout is evidence about communication, not proof about machine death.

That is why failure-detector design is usually described in terms of completeness and accuracy. A useful detector should eventually suspect nodes that really crashed, but a practical timeout-based detector can also suspect healthy nodes when the network or host behaves badly enough. Production systems hide that theory behind fixed heartbeat thresholds, phi-accrual suspicion scores, or gossip probes, but the mechanism is the same: delayed messages become a suspicion level.

For Harbor Point, the tuning problem is concrete. A 500 ms timeout reacts quickly when a voter actually dies, but it may treat a brief transatlantic stall as failure and trigger an unnecessary election. A 3 s timeout reduces false positives, but a real failure now blocks quorum decisions for longer. The trade-off is fast recovery versus false-positive resistance, and the right point depends on tail latency, packet loss bursts, host pauses, and the business cost of delayed writes.

The operational rule is to keep detector output separate from authority. Suspicion may trigger extra probes, lease reevaluation, or a reconfiguration proposal, but it should not by itself authorize a new node to vote, own data, or serve writes. That boundary is what membership machinery is for.

Reconfiguration Is Replicated State

Suppose Harbor Point decides ny-db-3 should be replaced by warm standby sg-db-1. The unsafe shortcut would be to let the leader locally stop counting New York and start counting Singapore immediately. That creates a window where different machines believe in different quorums, which is exactly how split-brain behavior appears even when each individual node thinks it is being sensible.

Safe systems instead attach membership to the replicated log or another authoritative configuration store. One common pattern is "add learner, catch up, then promote." Another is Raft joint consensus, where the cluster briefly enters a configuration whose commits must satisfy overlapping quorum rules from both the old and new membership sets. The invariant underneath both approaches is the same: any quorum that could commit before the change must intersect with any quorum that can commit after the change.

For Harbor Point, the sequence looks like this:

epoch 81: voters   = {md-db-2, md-db-4, ny-db-3}
timeout on ny-db-3 -> suspect locally, but epoch 81 is still authoritative
epoch 82: learner  = sg-db-1 added and begins catch-up
epoch 83: joint or equivalent transitional config is committed
epoch 84: voters   = {md-db-2, md-db-4, sg-db-1}

Only after the new epoch is committed do client routing, repair scheduling, and quorum math switch over. That matters to every nearby mechanism. Lease reads need to know which term and configuration are current. Backpressure policies need to know which follower is optional and which one is quorum-critical. Repair work is meaningful only when replicas agree on the membership epoch for the data they are comparing.

The practical lesson is that reconfiguration is not just an availability trick. It is replicated state. If it is not committed with the same discipline as user data, the control plane can undermine the safety of the data plane.

Returning Nodes Need Fencing And Catch-Up

Now imagine ny-db-3 was only slow, not dead, and comes back after epoch 84 has committed. Its local disk may still contain shard data, and it may still believe it belongs to the replica set. If Harbor Point lets that node continue serving reads or participating in write coordination, an old configuration has just leaked back into the live system. This is why returning members need fencing: every RPC, lease, and client route must be validated against the current term and configuration epoch, and removed nodes must refuse service or shut themselves down.

Node identity also needs a generation boundary. Gossip-based systems often use incarnation numbers so old membership messages cannot resurrect a superseded process. Consensus systems use terms and committed membership state for the same reason. The underlying problem is identical: once a node has been removed, "same hostname" is not enough to trust it again.

Catch-up is the second half of the story. Harbor Point does not want sg-db-1 to become a full voter while it is missing committed history, because that would increase quorum size without increasing useful fault tolerance. If the gap is small, the leader can stream missing log entries. If the gap is large or the replacement starts from an empty disk, sending a snapshot is cheaper and safer than replaying an unbounded log.

The trade-off is between fast automation and cluster stability. Aggressive auto-remove and auto-add policies minimize operator involvement, but they can cause flapping during intermittent network loss and repeated full-state catch-up. Conservative policies reduce churn and leader load, but they keep suspected nodes in quorum longer and slow down real failovers. Good systems make that policy explicit instead of pretending the detector answered the whole question.

Operational Failure Modes

The cluster keeps removing and re-adding the same replica during WAN jitter. Detector thresholds are tight enough that transient delay looks like failure, and the automation treats first suspicion as permission to reconfigure. Add a suspicion grace period, use extra probes or a stronger evidence threshold before eviction, and require a stable interval before reversing membership again.

Adding a replacement node causes write latency and leader CPU to spike. The new member is being counted as a voter before it has caught up, so the leader is doing quorum-critical replication and full-state transfer at the same time. Add the node as a learner or non-voter first, let it catch up through log replay or snapshot, and promote it only after the cluster verifies it is current enough to vote safely.

A removed node comes back and serves stale reads through an old client route. The data plane is not checking the current configuration epoch or leader term, so an obsolete member still looks superficially healthy. Fence removed nodes with epoch-aware routing and RPC validation, and quarantine them until they rejoin through the normal membership workflow.

Connections

Resources

Key Takeaways

  1. Suspicion is not membership. A timeout says communication is missing; it does not by itself change who may vote, own data, or serve traffic.
  2. Safe reconfiguration is replicated state: add, catch up, and promote in a way that preserves quorum intersection across old and new configurations.
  3. Rejoining nodes need fencing and catch-up before they are useful, because old epochs, stale leases, and missing history can leak obsolete authority into the live system.
PREVIOUS Clocks, Leases, and Safe Reads NEXT Replication Flow Control and Backpressure