Day 471: Replication Topologies and Failure Domains

The core idea: A replication topology is a statement about which racks, zones, regions, and network paths must stay healthy for normal writes, failover, and recovery; if those paths overlap the same failure domain, "three replicas" can still behave like one fragile system.

Today's "Aha!" Moment

In 041.md, Harbor Point used snapshot installation to bring learner sg-db-1 back into sync for reservation shard 184. The transfer succeeded, but it exposed the deeper question the team had avoided: why was the recovery path forced to cross the same transoceanic link that had already been unstable during the incident? The real design problem was not replica count. It was topology.

That correction matters because a topology is not just a picture of where copies live. It defines the steady-state commit path, the leader election path, and the repair path. A replica in Singapore is not merely "far away." If it votes, normal writes may wait on it. If it does not vote, disaster recovery semantics change. If it is the only intact copy after a regional event, every catch-up stream depends on the same long-haul network that just failed.

Teams usually get this wrong by counting servers instead of counting independent failure domains. Three replicas that share the same control plane, network fabric, or metro fiber cut can fail together. Three replicas spread across genuinely separate regions can survive more correlated damage, but then latency, quorum behavior, and operator runbooks all change. This lesson makes those dependencies explicit so that the next lesson, 043.md, can talk about leader-based replication mechanics on top of a topology that actually fits the product's risk model.

Why This Matters

Harbor Point's reservation API serves mostly East Coast traffic, so the natural impulse is to keep quorum local: leader md-db-4 in Maryland, one follower in another Maryland availability zone, and a third nearby replica in New York. That layout gives low write latency and survives machine or zone loss well. But it does not buy zero-data-loss regional recovery if the whole East Coast footprint disappears, and it still assumes Maryland and New York do not share the same dominant network or control-plane risks.

The opposite extreme is to stretch the quorum across Maryland, New York, and Singapore so any one site can disappear without losing a committed copy. That improves region-level durability, but now every write, election timeout, leader lease, and snapshot catch-up is entangled with WAN jitter. A transient Singapore path issue is no longer only a disaster-recovery concern; it becomes part of normal quorum behavior.

Production systems live between those extremes. Harbor Point has to decide whether the reservation product values low-latency local commits with a non-zero remote recovery point objective, or whether it needs geographically durable commits badly enough to pay transoceanic latency on the write path. Topology is where that business decision becomes concrete infrastructure behavior.

Learning Objectives

By the end of this session, you will be able to:

Explain what a replication topology really encodes - Distinguish replica count from the failure domains that actually determine survivability.
Trace how topology changes commit, election, and repair paths - Show how placement decisions alter both steady-state performance and failure recovery.
Evaluate common topology patterns against product requirements - Choose between local quorum, stretched quorum, and relayed replication using concrete latency and durability trade-offs.

Core Concepts Explained

Concept 1: Topology is a mapping from replicas to correlated failure domains

Harbor Point initially described shard 184 as "triple replicated," but that sentence hid the real engineering question. Triple replicated across what? Three processes on one rack, three availability zones in one region, or three regions on different network backbones are all very different systems even when the replica count is identical. A topology becomes meaningful only when each replica is attached to a failure domain model.

For this lesson, the relevant domains are hierarchical. A single machine can fail by itself, but so can an entire rack because of top-of-rack switching, an availability zone because of power or networking loss, a region because of control-plane or backbone events, or a wide-area path because the long-haul network is congested or cut. The key production insight is that these failures are not independent by default. Two replicas in different zones may still share the same DNS dependency, storage control plane, or metro fiber route.

Harbor Point's placement discussion therefore starts with the failure tree, not the replica roles:

reservation shard 184
├── Maryland region
│   ├── az-a: md-db-4
│   └── az-b: md-db-2
├── New York region
│   └── az-a: ny-db-3
└── Singapore region
    └── az-a: sg-db-1

Once the topology is drawn this way, the trade-off becomes explicit. Keeping two or three voting replicas inside Maryland gives strong protection against host or zone loss, but a Maryland control-plane event still threatens the quorum. Spreading votes across Maryland, New York, and Singapore improves region diversity, but the quorum now depends on WAN links and remote timing behavior that were irrelevant in the single-region design. "More spread out" is not automatically better; it simply shifts which failures are cheap and which failures are on the normal path.

Concept 2: Every topology creates three different traffic paths that must all be survivable

Engineers often judge a topology by the commit path alone, but Harbor Point learned from the previous lesson that the repair path matters just as much. A useful way to reason about topology is to ask three questions for the same shard: where does a write have to travel before it is durable enough to acknowledge, which nodes participate in election and lease decisions, and where does a lagging or replaced replica fetch state during recovery?

Consider two candidate layouts for shard 184:

Topology A: local voting quorum + remote async copy
clients -> md-db-4 (leader)
             |-- sync --> md-db-2
             `-- async -> sg-db-1

Topology B: stretched voting quorum
clients -> md-db-4 (leader)
             |-- sync --> ny-db-3
             `-- sync --> sg-db-1

In Topology A, Harbor Point can acknowledge once Maryland has a quorum, so steady-state latency stays close to local network round trips. Elections are also mostly local. The cost shows up during regional disaster recovery: if Maryland is lost, Singapore may have the newest surviving copy, but some acknowledged writes may still be missing there because the remote leg was asynchronous. Recovery is fast for local faults and weaker for region loss.

In Topology B, the durability story changes. If a write is acknowledged only after one remote voter responds, a full Maryland outage does not erase committed data. But that guarantee is purchased with normal-path WAN exposure. Tail latency, timeout tuning, lease durations, and snapshot recovery now have to tolerate New York or Singapore behavior continuously, not only during disasters. A topology review that ignores the repair path will miss another effect: once sg-db-1 falls behind, the same intercontinental path must carry snapshot traffic before the node becomes useful again.

This is why topology work belongs in production engineering, not only in whiteboard architecture. Commit, election, and repair traffic have different shapes, and a design is only as strong as the weakest of those three paths under the failures you actually expect.

Concept 3: Common topology patterns encode different product promises

Harbor Point eventually narrowed the design space to three patterns, each with a defensible use case.

The first pattern is single-region quorum with remote disaster recovery. Put the leader and voting followers in separate local zones, then maintain one or more remote asynchronous replicas. This is often the right answer when the product needs fast writes and can tolerate a bounded remote RPO during a full-region event. The system pays for remote bandwidth and operational complexity, but normal commits stay local and easier to reason about.

The second pattern is stretched quorum across regions. This is appropriate when acknowledged writes must survive a regional loss with little or no data loss. The hidden cost is that remote uncertainty moves into the hot path: leader failover is slower, lease safety is trickier, and transient WAN noise can look like replica failure. Teams choose this pattern when the durability promise is truly worth permanent cross-region coordination cost.

The third pattern is relayed or chained replication, where Maryland sends a committed log to New York and New York forwards it to Singapore. This reduces leader fan-out and can keep long-haul bandwidth concentrated on fewer nodes, but it creates a different fragility: Singapore is now downstream of New York's health. If New York is partitioned, Singapore can stop advancing even while Maryland remains healthy. That can be a good trade when read replicas are mostly regional caches, and a bad trade when the far replica is supposed to be an independent disaster-recovery anchor.

The right choice depends on the product promise in plain language. "Users in Virginia need sub-20 ms writes and we can lose at most 30 seconds during a regional disaster" points toward local quorum plus remote async replication. "No acknowledged reservation may be lost even if an entire region disappears" pushes toward multi-region synchronous participation despite the latency cost. Topology is where those product statements stop being slogans and become measurable engineering commitments.

Troubleshooting

Issue: A topology that looked "multi-zone" still loses quorum during one provider incident.
- Why it happens: The replicas were placed in different labels, but they still depended on the same regional control plane, network edge, or storage service.
- Clarification / Fix: Model failure domains from real operational dependencies, not provider marketing names. Include shared DNS, identity, control-plane APIs, and backbone paths in the review.
Issue: Write latency is stable in tests but spikes badly after adding a remote voter.
- Why it happens: The remote replica moved WAN tail latency and retransmits onto the synchronous commit path.
- Clarification / Fix: Decide explicitly whether the remote site should vote. If not, keep it asynchronous. If it must vote, retune timeouts, leases, and SLOs around WAN behavior rather than local medians.
Issue: After a failover, snapshot repair saturates the same link that user traffic needs.
- Why it happens: The topology review accounted for quorum placement but ignored the bandwidth and routing of the recovery path.
- Clarification / Fix: Plan where large state transfers originate, rate-limit them separately from client traffic, and consider seeding remote replicas from object storage or a nearer relay instead of always from the leader.

Advanced Connections

Connection 1: 041.md showed how a lagging replica catches up; this lesson explains why that catch-up path may be cheap or painful

Snapshot installation is mechanically the same protocol whether the learner sits one rack away or on another continent. The topology determines whether that repair traffic is a background detail or a first-order availability risk during recovery.

Connection 2: 043.md will focus on leader-based replication once the placement choice is fixed

This lesson answers "where do replicas live, and which failures must the system survive?" The next one answers "how does the leader move ordered writes through that placement?" Ack policy, follower lag, and promotion rules only make sense after the topology is chosen.

Resources

Optional Deepening Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Read the replication chapters with attention to multi-datacenter trade-offs, not only the basic leader/follower mechanics.
[PAPER] Spanner: Google's Globally-Distributed Database
- Focus: Notice how replica placement decisions affect commit latency, quorum behavior, and the product promise of externally consistent transactions.
[DOC] PostgreSQL Documentation: Synchronous Replication
- Focus: See how quorum versus priority synchronous standbys map directly to topology decisions and failure-domain assumptions.
[PAPER] Dynamo: Amazon's Highly Available Key-value Store
- Focus: Pay attention to preference lists and how replica placement across failure domains is part of availability engineering, not a separate afterthought.

Key Insights

Replica count is not the same as failure tolerance - The real question is which failures are independent enough that the replicas will not disappear together.
Topology changes more than the write path - It also changes elections, repair bandwidth, snapshot cost, and the blast radius of a degraded link.
Each topology pattern encodes a product promise - Low-latency local quorum, region-surviving synchronous quorum, and relayed replication are all valid only when they match the required durability and latency semantics.

← Back to Consistency and Replication

← Back to Learning Hub