Replication Failure Mode Check

LESSON

Consistency and Replication

023 30 min advanced REVIEW

Replication Failure Mode Check

The core idea: A geo-distributed database is not "one writable database everywhere." It is a contract about shard ownership, read guarantees, and failover. If those three choices do not agree, extra regions only multiply the ways the service can be wrong.

Core Insight

Harbor Point has learned how to test a distributed database the hard way: not by admiring a clean failover demo, but by reconstructing the history of real reservation requests and checking whether the cluster kept its promises. The next design question is what architecture is worth testing now that the reservation platform is expanding beyond one region.

The tempting answer is "put writable replicas in Madrid and New York so every desk can write locally." For Harbor Point's workload, that is the wrong instinct. A reservation is not an append-only social post. Each approval also updates the issuer's remaining exposure. If Madrid and New York can independently approve reservations for issuer MUNI-77, the system has to reconcile not just rows, but a financial limit that may already have been oversold. After two desks have both been told "confirmed," there is no harmless merge strategy.

The right design move is to place the invariant before placing the replicas. Harbor Point shards by issuer, keeps issuer_exposure and the issuer's live reservations on the same shard, and gives each shard one home region with one leader at a time. Madrid and New York can both serve the product globally, but they do not both invent the order of writes for the same issuer. Remote replicas, CDC-fed read models, and failover procedures exist to extend visibility and resilience around that ordered write path, not to replace it.

That is the review insight: replication, quorums, sharding, read replicas, CDC, backups, observability, and Jepsen-style validation are not separate tricks. They are pieces of one service contract. If shard ownership, read semantics, and failover rules do not agree, the system can pass individual component checks and still fail the business contract.

Design Review Pressure

Harbor Point is launching follow-the-sun reservation intake for bond desks in Madrid and New York. Traders in both cities need fast confirmations. Support needs to look up any reservation globally. Compliance wants a cross-region search surface within seconds, not hours. Operations also wants a credible story for a regional outage. Those are all reasonable requirements until they are forced into the same database design.

If Harbor Point solves the problem with a single slogan, the service will fail in production. "Strong consistency everywhere" turns every write and many reads into cross-ocean coordination, which pushes ordinary p99 latency into the product budget. "Local writes in every region" preserves latency, but it breaks the issuer-exposure invariant unless the business accepts conflicts that can no longer be repaired after confirmation. "Just use read replicas" helps dashboards and does nothing for shard ownership or disaster recovery.

The review is therefore about choosing a design that is explicit about what is local, what is global, and what is allowed to lag. A geo-distributed database becomes workable only when those categories are tied to specific endpoints, replication paths, validation checks, and failure runbooks.

Check 1: Shard Authority

Harbor Point's critical transaction is still the same: lock issuer_exposure, check remaining headroom, insert the new reservation, and commit. The first design check asks whether those rows stay together physically. If the system shards by trader region, client office, or random hash unrelated to the issuer, the approval path turns into a cross-shard transaction exactly where the business can least afford it. The issuer limit would then depend on distributed transactions or asynchronous compensation in the hot path of every reservation.

Instead, Harbor Point shards by issuer and assigns each shard a home region based on the book that owns the issuer. EU issuers are homed in Madrid. US issuers are homed in New York. Both regions can accept API traffic, but a router first looks up the shard map and forwards the write to the shard leader in its home region. That keeps the approval invariant local to one Raft group or leader-based replication group at a time.

The normal path looks like this:

client in Madrid or New York
          |
          v
regional API gateway
          |
          v
metadata lookup: issuer MUNI-77 -> shard 184 -> home region New York
          |
          v
shard leader in New York
          |
          +--> sync replica in New York AZ-2
          |
          +--> async replica in Madrid

The important point is not the city names. The important point is that only one place is allowed to order writes for shard 184. A Madrid trader reserving a US issuer pays the WAN hop to New York because that hop is cheaper than making the issuer-limit invariant ambiguous. Harbor Point is explicitly choosing single-writer shards over active-active conflict resolution for the OLTP system of record.

That choice also defines the trade-off. Local writers for home-region issuers get predictable commit latency because the acknowledgment depends on local WAL durability plus a synchronous replica in another availability zone, not on a global quorum. Cross-region writers pay extra latency. Regional failover is possible, but it is not zero-cost magic because the remote replica is not in the commit quorum. Harbor Point accepts that trade because the business already has an upstream order journal and can tolerate a tightly bounded RPO more easily than a permanently inflated write path.

Check 2: Read Paths

Once shard ownership is clear, Harbor Point stops pretending every endpoint needs the same consistency contract. The reservation system actually serves three different questions.

The first question is transactional: "did reservation R-88421 commit?" That path must hit the shard leader or a read path that is provably at least as fresh as the caller's last acknowledged write. Harbor Point therefore returns a session token containing the commit LSN or log index with every successful write. If the client later reads from its local region and the replica has not replayed that token yet, the gateway routes the read back to the leader instead of serving a stale answer.

The second question is operational: "show me the newest open reservations for this issuer." That dashboard can tolerate a small lag as long as the contract is explicit. Harbor Point serves it from a region-local follower when replay lag is under two seconds; otherwise it falls back to the leader or marks the view as stale. This uses the read-replica contract from Replication Lag and Read-Your-Writes without lying about read-your-writes.

The third question is global and analytical: "search all reservations matching a compliance filter across regions." Harbor Point should not answer that with a synchronous fan-out query across every shard leader. Instead it streams CDC from each shard's WAL into a global reservation_search index and a same-day compliance view in both regions. That view is seconds behind by design, but it avoids coupling every compliance search to the OLTP commit path.

One way to summarize the service is:

question                               source of truth                    contract
confirmation / token status            shard leader                       linearizable for that shard
issuer dashboard                       regional follower or leader        bounded staleness + session fallback
global compliance search               CDC-fed search/index service       eventual, usually within seconds

CDC, materialized views, follower reads, and global indexes are not separate features to bolt on later. They are how Harbor Point avoids forcing a single "strong everywhere" contract onto endpoints that do not need it. The trade-off is operational clarity: APIs and dashboards have to expose which read mode they use, or developers will accidentally depend on freshness that the system never promised.

Check 3: Failover Semantics

Harbor Point's design deliberately does not put the remote region in the synchronous write quorum. That keeps the normal path fast, but it means the service must declare a real disaster-recovery contract. The production target is explicit: per-shard RPO of at most five seconds, RTO of under ten minutes, and no silent duplicate reservations after failover.

Those numbers are enforced operationally, not by wishful thinking. The gateway watches remote replay lag and WAL archive age for each shard. If the remote copy falls more than five seconds behind, Harbor Point stops accepting new reservations for that shard and returns a retryable failure rather than allowing the recovery promise to drift invisibly. This is the same operational principle as Observability for Replicated Data Systems: internal state, not API latency, defines whether the storage contract is still true.

Every write also carries a client-generated reservation_token. That token is the bridge between the OLTP path and the failover runbook. If a regional outage happens after a client saw a timeout or even after it saw success during the uncertain replication window, the client can ask the promoted shard whether that token committed. The system never resolves ambiguity by replaying the business action blindly. It resolves ambiguity by turning idempotency into an addressable record.

The failover sequence is therefore mechanical:

1. Freeze routing for shard 184 at configuration generation N.
2. Promote the remote replica only up to its last durable replayed position.
3. Publish generation N+1 in the metadata service.
4. Resume writes through the new leader.
5. Reconcile tokens that clients reported during the outage window.

This is exactly where failure testing matters. Harbor Point can and should inject partitions between regions, pause the old leader, delay replication, and verify that one token never becomes two committed reservations. The team should also test that stale followers are never allowed to serve a "strong" read and that routing changes happen once per generation, not in split-brain fashion. A geo-distributed database design is production-ready only when those tests pass repeatedly.

The final trade-off is honesty. Harbor Point is not claiming the same thing as a system that synchronously replicates a write quorum across continents. If the business later requires zero-RPO regional loss for the reservation path, this architecture is no longer sufficient. The team would need to move remote replicas into the commit rule and accept the latency cost directly. The review answer is not universal. It is correct because it matches Harbor Point's stated workload, failure budget, and operational maturity.

Operational Failure Modes

A trader submits a reservation in Madrid, gets success, then immediately refreshes the local dashboard and does not see the reservation. The dashboard is reading from a regional follower whose replay position is behind the writer's commit token. The write is valid, but the read path is weaker than the user assumes. Carry the last-seen commit token through the session. If the local follower has not reached that position, route the read to the shard leader or label the view as stale instead of pretending the reservation disappeared.

A regional failover completes, but support cannot tell whether a just-submitted reservation committed before the outage. The system acknowledged writes from the home region's local quorum, while the promoted remote replica only knows its own replayed prefix. Without an idempotency token, the operator has no reliable way to separate "reply lost" from "write never committed remotely." Require a durable client token for every reservation and make token-status lookup part of the public contract. During failover, resolve uncertainty by token status, not by replaying the business request manually.

Compliance search disagrees with the OLTP record for several seconds after a burst of writes. The search surface is CDC-fed derived state, not the transactional source of truth. WAL shipping or CDC apply can lag temporarily under burst load even while the shard leader remains current. Publish the freshness contract for the search index explicitly, monitor CDC lag as a first-class SLI, and send investigators to the leader-backed path when they need authoritative status during an incident.

Connections

Resources

Key Takeaways

  1. Geo-distribution starts with shard authority, not replica count; if the invariant has no single ordering point, every extra region increases conflict risk.
  2. Confirmations, dashboards, and compliance search deserve different read paths because they make different freshness and correctness promises.
  3. Failover is part of the product contract: RPO, RTO, replica lag gates, and token-based uncertainty handling must be designed before the service can truthfully call itself multi-region.
PREVIOUS Guarantee Matrix Design Review NEXT Replicated Data Service Capstone