LESSON

033 30 min advanced CAPSTONE

Day 432: Monthly Capstone: Geo-Distributed Database Service

The core idea: A geo-distributed database is not "one writable database everywhere." It is a contract about shard ownership, read guarantees, and failover. If those three choices do not agree, extra regions only multiply the ways the service can be wrong.

Today's "Aha!" Moment

In 15.md, Harbor Point learned how to test a distributed database the hard way: not by admiring a clean failover demo, but by reconstructing the history of real reservation requests and checking whether the cluster kept its promises. The capstone question is what architecture is worth testing in the first place now that the reservation platform is expanding beyond one region.

The tempting answer is "put writable replicas in Madrid and New York so every desk can write locally." For Harbor Point's workload, that is the wrong instinct. A reservation is not an append-only social post. Each approval also updates the issuer's remaining exposure. If Madrid and New York can independently approve reservations for issuer MUNI-77, the system has to reconcile not just rows, but a financial limit that may already have been oversold. After two desks have both been told "confirmed," there is no harmless merge strategy.

The right design move is to place the invariant before placing the replicas. Harbor Point shards by issuer, keeps issuer_exposure and the issuer's live reservations on the same shard, and gives each shard one home region with one leader at a time. Madrid and New York can both serve the product globally, but they do not both invent the order of writes for the same issuer. Remote replicas, CDC-fed read models, and failover procedures exist to extend visibility and resilience around that ordered write path, not to replace it.

That is the capstone insight for the whole month. Replication, quorums, sharding, read replicas, CDC, backups, observability, and Jepsen-style validation are not separate tricks. They are pieces of one service contract. Once Harbor Point makes that contract explicit, the next lesson, ../28/01.md, becomes inevitable: some metadata service has to publish which region owns each shard and which leader generation is currently authoritative.

Why This Matters

Harbor Point is launching follow-the-sun reservation intake for bond desks in Madrid and New York. Traders in both cities need fast confirmations. Support needs to look up any reservation globally. Compliance wants a cross-region search surface within seconds, not hours. Operations also wants a credible story for a regional outage. Those are all reasonable requirements until they are forced into the same database design.

If Harbor Point solves the problem with a single slogan, the service will fail in production. "Strong consistency everywhere" turns every write and many reads into cross-ocean coordination, which pushes ordinary p99 latency into the product budget. "Local writes in every region" preserves latency, but it breaks the issuer-exposure invariant unless the business accepts conflicts that can no longer be repaired after confirmation. "Just use read replicas" helps dashboards and does nothing for shard ownership or disaster recovery.

The capstone is therefore about choosing a design that is explicit about what is local, what is global, and what is allowed to lag. A geo-distributed database becomes workable only when those categories are tied to specific endpoints, replication paths, and failure runbooks.

Learning Objectives

By the end of this session, you will be able to:

Choose a shard-placement strategy from the real invariant - Explain why Harbor Point homes each issuer shard in one region instead of allowing the same issuer to be written independently in multiple regions.
Separate read paths by guarantee level - Map confirmations, trader dashboards, and global compliance search onto different data paths with different freshness contracts.
Defend a failover design with concrete budgets - Reason about RPO, RTO, idempotency, and validation gates instead of treating "multi-region" as automatic safety.

Core Concepts Explained

Concept 1: Put the invariant on one shard and give that shard one region of authority

Harbor Point's critical transaction is still the one from earlier in the month: lock issuer_exposure, check remaining headroom, insert the new reservation, and commit. The capstone decision is that those rows must stay together physically. If the system shards by trader region, client office, or random hash unrelated to the issuer, the approval path turns into a cross-shard transaction exactly where the business can least afford it. The issuer limit would then depend on 2PC or asynchronous compensation in the hot path of every reservation.

Instead, Harbor Point shards by issuer and assigns each shard a home region based on the book that owns the issuer. EU issuers are homed in Madrid. US issuers are homed in New York. Both regions can accept API traffic, but a router first looks up the shard map and forwards the write to the shard leader in its home region. That keeps the approval invariant local to one Raft group or leader-based replication group at a time.

The normal path looks like this:

client in Madrid or New York
          |
          v
regional API gateway
          |
          v
metadata lookup: issuer MUNI-77 -> shard 184 -> home region New York
          |
          v
shard leader in New York
          |
          +--> sync replica in New York AZ-2
          |
          +--> async replica in Madrid

The important point is not the city names. The important point is that only one place is allowed to order writes for shard 184. A Madrid trader reserving a US issuer pays the WAN hop to New York because that hop is cheaper than making the issuer-limit invariant ambiguous. Harbor Point is explicitly choosing single-writer shards over active-active conflict resolution for the OLTP system of record.

That choice also defines the trade-off. Local writers for home-region issuers get predictable commit latency because the acknowledgment depends on local WAL durability plus a synchronous replica in another availability zone, not on a global quorum. Cross-region writers pay extra latency. Regional failover is possible, but it is not zero-cost magic because the remote replica is not in the commit quorum. Harbor Point accepts that trade because the business already has an upstream order journal and can tolerate a tightly bounded RPO more easily than a permanently inflated write path.

Concept 2: Use different data paths for different questions instead of one impossible promise

Once shard ownership is clear, Harbor Point stops pretending every endpoint needs the same consistency contract. The reservation system actually serves three different questions.

The first question is transactional: "did reservation R-88421 commit?" That path must hit the shard leader or a read path that is provably at least as fresh as the caller's last acknowledged write. Harbor Point therefore returns a session token containing the commit LSN or log index with every successful write. If the client later reads from its local region and the replica has not replayed that token yet, the gateway routes the read back to the leader instead of serving a stale answer.

The second question is operational: "show me the newest open reservations for this issuer." That dashboard can tolerate a small lag as long as the contract is explicit. Harbor Point serves it from a region-local follower when replay lag is under two seconds; otherwise it falls back to the leader or marks the view as stale. This uses the read-replica mechanics from 11.md without lying about read-your-writes.

The third question is global and analytical: "search all reservations matching a compliance filter across regions." Harbor Point should not answer that with a synchronous fan-out query across every shard leader. Instead it streams CDC from each shard's WAL into a global reservation_search index and a same-day compliance view in both regions. That view is seconds behind by design, but it avoids coupling every compliance search to the OLTP commit path.

One way to summarize the service is:

question                               source of truth                    contract
confirmation / token status            shard leader                       linearizable for that shard
issuer dashboard                       regional follower or leader        bounded staleness + session fallback
global compliance search               CDC-fed search/index service       eventual, usually within seconds

This is the architectural payoff of 09.md, 10.md, 11.md, and 12.md. CDC, materialized views, follower reads, and global indexes are not separate features to bolt on later. They are how Harbor Point avoids forcing a single "strong everywhere" contract onto endpoints that do not need it. The trade-off is operational clarity: APIs and dashboards have to expose which read mode they use, or developers will accidentally depend on freshness that the system never promised.

Concept 3: A geo-distributed service is only real when failover, lag budgets, and uncertain writes are all specified

Harbor Point's design deliberately does not put the remote region in the synchronous write quorum. That keeps the normal path fast, but it means the service must declare a real disaster-recovery contract. The production target is explicit: per-shard RPO of at most five seconds, RTO of under ten minutes, and no silent duplicate reservations after failover.

Those numbers are enforced operationally, not by wishful thinking. The gateway watches remote replay lag and WAL archive age for each shard. If the remote copy falls more than five seconds behind, Harbor Point stops accepting new reservations for that shard and returns a retryable failure rather than allowing the recovery promise to drift invisibly. This is the same lesson as 14.md: internal state, not API latency, defines whether the storage contract is still true.

Every write also carries a client-generated reservation_token. That token is the bridge between the OLTP path and the failover runbook. If a regional outage happens after a client saw a timeout or even after it saw success during the uncertain replication window, the client can ask the promoted shard whether that token committed. The system never resolves ambiguity by replaying the business action blindly. It resolves ambiguity by turning idempotency into an addressable record.

The failover sequence is therefore mechanical:

1. Freeze routing for shard 184 at configuration generation N.
2. Promote the remote replica only up to its last durable replayed position.
3. Publish generation N+1 in the metadata service.
4. Resume writes through the new leader.
5. Reconcile tokens that clients reported during the outage window.

This is exactly where 15.md matters. Harbor Point can and should inject partitions between regions, pause the old leader, delay replication, and verify that one token never becomes two committed reservations. The team should also test that stale followers are never allowed to serve a "strong" read and that routing changes happen once per generation, not in split-brain fashion. A geo-distributed database design is production-ready only when those tests pass repeatedly.

The final trade-off is honesty. Harbor Point is not claiming the same thing as a system that synchronously replicates a write quorum across continents. If the business later requires zero-RPO regional loss for the reservation path, this architecture is no longer sufficient. The team would need to move remote replicas into the commit rule and accept the latency cost directly. The capstone answer is not universal. It is correct because it matches Harbor Point's stated workload, failure budget, and operational maturity.

Troubleshooting

Issue: A trader submits a reservation in Madrid, gets success, then immediately refreshes the local dashboard and does not see the reservation.

Why it happens / is confusing: The dashboard is reading from a regional follower whose replay position is behind the writer's commit token. The write is valid, but the read path is weaker than the user assumes.

Clarification / Fix: Carry the last-seen commit token through the session. If the local follower has not reached that position, route the read to the shard leader or label the view as stale instead of pretending the reservation disappeared.

Issue: A regional failover completes, but support cannot tell whether a just-submitted reservation committed before the outage.

Why it happens / is confusing: The system acknowledged writes from the home region's local quorum, while the promoted remote replica only knows its own replayed prefix. Without an idempotency token, the operator has no reliable way to separate "reply lost" from "write never committed remotely."

Clarification / Fix: Require a durable client token for every reservation and make token-status lookup part of the public contract. During failover, resolve uncertainty by token status, not by replaying the business request manually.

Issue: Compliance search disagrees with the OLTP record for several seconds after a burst of writes.

Why it happens / is confusing: The search surface is CDC-fed derived state, not the transactional source of truth. WAL shipping or CDC apply can lag temporarily under burst load even while the shard leader remains current.

Clarification / Fix: Publish the freshness contract for the search index explicitly, monitor CDC lag as a first-class SLI, and send investigators to the leader-backed path when they need authoritative status during an incident.

Advanced Connections

Connection 1: 15.md tells Harbor Point how to validate this architecture instead of trusting the diagram

The capstone chooses one-writer shards, bounded-lag followers, and token-based failover reconciliation. Jepsen-style validation is what proves those choices actually preserve the business contract under partitions, delayed replication, and leader changes rather than merely sounding disciplined in a design document.

Connection 2: ../28/01.md starts where this capstone leaves off: the metadata plane becomes a system of record too

Once each issuer shard has a home region, leader generation, and failover state, Harbor Point needs a durable place to publish that mapping. The data plane can no longer be understood without the control plane that tells routers which shard owner is authoritative at any given moment.

Resources

Optional Deepening Resources

[PAPER] Spanner: Google's Globally-Distributed Database
- Focus: Study a system that does choose synchronous cross-region replication, then compare its guarantees and costs with Harbor Point's lower-latency single-writer shard design.
[DOC] Topology Patterns Overview
- Focus: Compare geo-partitioned low-latency placement with multi-region topologies that trade more latency for stronger regional resilience.
[DOC] Spanner: TrueTime and external consistency
- Focus: Understand why a "strong everywhere" database needs an explicit ordering model for reads and writes across regions.
[DOC] Jepsen Analyses
- Focus: Read real failure reports and map them to the split-brain, stale-read, and lost-acknowledgment cases Harbor Point must rule out.
[BOOK] Designing Data-Intensive Applications
- Focus: Revisit the chapters on partitioning, replication, and derived data with this capstone architecture in mind.

Key Insights

Geo-distribution starts with shard authority, not with replica count - If the invariant does not have one place that orders it, every additional region increases the chance of unrecoverable conflicts.
Different questions deserve different read paths - Confirmations, dashboards, and compliance search should not all pay for the same consistency contract.
Failover is part of the product contract - RPO, RTO, replica lag gates, and token-based uncertainty handling must be designed before the service can truthfully call itself multi-region.

← Back to Consistency and Replication

← Back to Learning Hub