Replication Topologies and Failure Domains

LESSON

Consistency and Replication

003 30 min advanced

Replication Topologies and Failure Domains

The core idea: A replication topology is not "how many copies exist"; it is a map of which failure domains, network paths, and replica roles must survive for writes, failover, and recovery to keep the promised consistency contract.

Core Insight

Imagine Harbor Point, a reservation platform for high-value municipal bond orders. The product team has already named its API contracts: an accepted reservation must not disappear or be double-booked, while regional dashboards may lag by a few seconds. The next design question sounds simple: where should the replicas live?

The tempting answer is "three replicas." That answer is almost empty. Three processes on the same rack, three zones behind the same regional control plane, and three regions connected by a jittery wide-area network all have different latency, durability, and failover behavior.

Topology is the missing layer between a client-visible guarantee and the replication mechanism that will implement it. If the reservation write requires a strong story, then the topology must say which replicas vote before success. If dashboards may be stale, then their replicas can sit on an asynchronous path. If disaster recovery matters, then the design must say where a surviving copy can be found and how it catches up.

The trade-off is that every added failure domain changes the hot path somewhere. Moving a voter across a region boundary may improve disaster tolerance, but it can put WAN tail latency into normal commits and elections. Keeping all voters local may make writes fast, but it can leave the business with a non-zero recovery point after a regional event.

Failure Domains Before Replica Count

A failure domain is a boundary inside which failures are likely to correlate. A single machine can fail. So can a rack, a power zone, a provider region, a control plane, a fiber path, a DNS dependency, or an identity service that all replicas use to start and authenticate.

Harbor Point's first topology sketch said this:

reservation service: 3 replicas

That is not enough to review. A useful sketch attaches each replica to the domains it depends on:

reservation shard 184
├── iad region
│   ├── zone a: iad-db-1, leader, voter
│   ├── zone b: iad-db-2, voter
│   └── zone c: iad-db-3, voter
└── dub region
    └── zone a: dub-db-1, async replica

shared dependencies:
- global DNS
- cloud identity control plane
- private backbone between iad and dub
- object storage bucket used for base backups

Now the review can ask real questions. Does the write path survive one zone loss? Yes, because the local quorum still has two voters. Does it survive the whole iad region with zero acknowledged data loss? No, because the Dublin copy is asynchronous. Can the Dublin replica be rebuilt if the object storage bucket is unavailable? Maybe not, because backup seeding is a separate dependency from steady-state replication.

Counting replicas hides those answers. Modeling domains exposes them.

Three Paths Hidden in One Topology

Every topology creates at least three paths, and they do not fail in the same way.

The commit path is the route a write must travel before the client receives success. If Harbor Point requires a reservation to be durable on two local voters before acknowledging, normal latency stays low but regional data loss remains possible. If one remote voter must confirm, regional durability improves but every write now depends on a wider network.

The election or failover path is the set of replicas that can decide who is authoritative after a leader failure. A remote replica that receives data asynchronously may be useful for recovery but may not be eligible to become leader without manual checks. A remote voter can participate automatically, but then timeouts, leases, and quorum behavior must be tuned for remote delay.

The repair path is how a lagging or replaced replica gets back to a complete prefix of history. This path is easy to ignore until a replica falls behind and the catch-up stream saturates the same link needed by user traffic.

Two candidate layouts make the difference visible:

Topology A: local voting quorum + remote async copy

client write
   |
   v
iad-db-1 leader
   |-- sync --> iad-db-2 voter
   |-- sync --> iad-db-3 voter
   `-- async -> dub-db-1 disaster-recovery replica

Topology B: stretched voting quorum

client write
   |
   v
iad-db-1 leader
   |-- sync --> iad-db-2 voter
   `-- sync --> dub-db-1 voter

Topology A is attractive when Harbor Point needs very low write latency and can tolerate a bounded recovery point objective for full-region loss. Topology B is attractive when an acknowledged reservation must survive the loss of the primary region. The cost is permanent exposure to the remote path: if Dublin is slow, normal writes, leader decisions, or failover behavior may become slow too.

The topology is not just a diagram. It is a performance and failure contract.

Worked Example: Matching Topology to the Promise

Harbor Point writes down four product requirements:

Requirement                                      Topology implication
-----------------------------------------------  --------------------------------------
accepted reservations must not double-book       one authority path for reservation writes
East Coast writes should usually finish < 30 ms   keep normal quorum near East Coast users
regional dashboard may lag up to 5 seconds        async read replica is acceptable there
full-region disaster may lose at most 60 seconds  remote async copy must stay within RPO

Those requirements point toward local voting plus remote asynchronous disaster recovery, not a stretched global quorum. The reservation write still needs one clear authority path, but the business has not demanded zero-data-loss regional failover. Paying remote quorum latency on every write would buy a stronger guarantee than the stated product promise.

The resulting design might be:

write authority:
- leader and voters in three independent zones in `iad`
- acknowledge after local majority has durably accepted the write

remote recovery:
- stream ordered changes to `dub`
- alert if remote lag exceeds 30 seconds
- declare regional RPO breach if lag approaches 60 seconds

read scaling:
- serve Dublin dashboards from `dub-db-1`
- show `last_replicated_at`
- route critical reservation reads back to the authority path

This design has a clear trade-off. It protects the strongest API contract for normal zone failures and gives fast writes. It does not claim zero-data-loss regional failover. That limitation is not a bug if the business accepts the RPO and the system monitors the async replica aggressively.

If the business later changes the promise to "no acknowledged reservation may be lost even if iad disappears," the topology must change. A remote voter, a multi-region quorum system, or a different authority model becomes necessary. That new design would spend more latency and operational complexity to buy a stronger guarantee.

Topology Review Checklist

Before a replication design is approved, the team should be able to answer these questions without hand-waving:

Question                                      Why it matters
--------------------------------------------  -----------------------------------------
Which replicas vote on the write path?         defines commit latency and durability
Which domains can fail independently?          separates real resilience from labels
Which replicas can become authority?           shapes failover and split-brain risk
How does a lagging replica catch up?           exposes repair bandwidth and retention
What is the accepted RPO and RTO?              connects topology to business promise
Which reads may use stale replicas?            prevents weak replicas from serving strong APIs

The last question ties this lesson back to consistency contracts. A replica topology can support multiple API semantics only if the routing layer respects the contract. A dashboard endpoint may read from a lagging replica. A reservation confirmation endpoint may need to read from the leader, a quorum, or a replica known to have applied a required version.

Topology and API semantics must be reviewed together. Otherwise the system can be technically replicated and still violate the promise clients were told to trust.

Failure Modes

Counting replicas instead of independent domains. Three replicas behind one shared storage service, identity dependency, or network edge can fail like one system. Review correlated dependencies, not only provider zone names.

Putting a remote voter on the hot path by accident. A configuration that improves regional durability may also import WAN latency into every commit and election. That trade-off is acceptable only when the stronger durability promise is intentional.

Treating async disaster recovery as zero-data-loss failover. A remote async replica can be excellent for bounded RPO, but it is not proof that every acknowledged write survived a primary-region loss.

Ignoring repair traffic. A topology can pass steady-state tests and still fail operationally when a replacement replica needs a large catch-up stream over a congested link.

Resources

Key Takeaways

  1. Replica count is not a resilience claim unless each replica is mapped to real failure domains and shared dependencies.
  2. A topology defines commit, election, and repair paths; all three must match the promised consistency and recovery behavior.
  3. Local quorum plus remote async replication buys low latency and bounded RPO, while stretched quorum buys stronger regional durability at normal-path cost.
  4. API routing must respect topology: stale replicas are fine for stale endpoints and dangerous for endpoints that promised a stronger story.
PREVIOUS Consistency Contracts and API Semantics NEXT Log Shipping and Ordered Apply