Day 480: Module Capstone: Replication + Partitioning Plan

The core idea: A production scale-out plan starts by deciding which operations must stay under one owner, then chooses replication, shard keys, lag budgets, and rebalancing rules that preserve that contract under load and failure.

Today's "Aha!" Moment

After 050.md, Harbor Point can move hot bucket 173 without corrupting inventory, but leadership is asking for a broader answer than "we can rebalance safely." Summer traffic is projected to triple, the Madrid and New York sales desks will both be active during overlap hours, and POST /bookings/confirm still has to mean one cabin on one trip is sold exactly once. The tempting answer is to turn every region into a writable replica and shard by route_date because most dashboards group work that way.

That answer breaks in two different places. A popular route such as BCN-JFK-2026-07-14 still hotspots one shard if the partition key follows the route, and active-active writes make the decisive inventory check depend on conflict resolution after the user has already been told "confirmed." The important insight is that replication and partitioning are not separate tuning knobs. The shard key decides which state must be decided together; replication decides how that owner survives failure and how weaker read paths are served around it.

Harbor Point's workable plan is narrower and more explicit. The sellable unit, (trip_instance_id, cabin_id), hashes to a logical bucket. Each bucket belongs to one replica group leader at a time. The leader commits only after a same-region follower acknowledges the write, remote replicas trail asynchronously, and the search and itinerary views are fed from change streams instead of sharing the decisive write path. Rebalancing changes bucket placement only through the versioned cutover mechanism from 050.md. Once those pieces are tied together, the trade-off becomes honest instead of aspirational: definitive booking stays safe and explainable, while weaker reads and disaster recovery are served through paths that admit their lag budget up front.

Why This Matters

This capstone matters because a replication strategy that ignores partition boundaries is incomplete, and a partitioning scheme that ignores replication contracts is dangerous. Harbor Point does not need a diagram with more boxes. It needs an operating plan that answers concrete production questions: which endpoint can read from a follower, how much lag is acceptable before a region stops advertising itself as failover-ready, which state is allowed to move during rebalancing, and which operations must never span shards on the hot path.

Before that plan exists, engineering discussions collapse into slogans. One team asks for multi-leader writes to reduce latency. Another asks for a larger primary database to postpone repartitioning. Product assumes every successful write will be visible everywhere immediately because nobody has said otherwise. After the plan exists, each API is mapped to a guarantee and a data path: final confirmation is a single-shard linearizable decision, hold details use read-your-write routing, itinerary views wait for causal dependencies, and search is explicitly bounded-stale derived state. The trade-off is visible too. Harbor Point accepts coordination cost on the narrow path that protects inventory ownership so it can avoid paying that same cost on every browse query and dashboard read.

The production consequence is that scale events stop being correctness experiments. Capacity expansion, zone maintenance, lag spikes, and hot-bucket migration become operational exercises with known guardrails. That is what a real replication-plus-partitioning plan buys: fewer hidden promises, clearer runbooks, and a system whose failure behavior matches what the product actually claims.

Core Walkthrough

Part 1: Start from the contract Harbor Point is actually selling

Harbor Point begins the design review by listing the user-facing operations that matter and the guarantees each one really needs:

Endpoint	Required guarantee	Why it needs that guarantee	Chosen serving path
`POST /bookings/confirm`	Linearizable single-shard commit	A cabin can only be sold once, and a success response must mean the decision is final	Route to the authoritative inventory leader for that bucket
`GET /holds/H-8821` after the same agent updates it	Read-your-write plus monotonic reads	The agent cannot safely see hold expiry move backward in the same session	Read from a replica only if it has replayed the caller's observed version
`GET /customer-trips` immediately after confirmation	Causal visibility	The customer should not see payment or confirmation without the trip record that depends on it	Use a dependency token and wait for the trip projection to catch up
`GET /search/cabins?route=BCN-JFK`	Bounded staleness, up to `2s`	Search is advisory and will be revalidated before booking	Serve from a CDC-fed search index with a freshness budget

That table comes before any topology diagram because it tells Harbor Point where coordination is mandatory and where it is wasteful. The decisive inventory write cannot be left to multi-leader reconciliation. The browse surface does not deserve the same commit rule as a booking confirmation. Once the API contract is explicit, the rest of the architecture becomes a cost-optimization problem with hard correctness boundaries.

This is also where the capstone pulls together the preceding lessons. 048.md defined the endpoint semantics, 049.md showed that the shard key has to keep the must-decide-together state local, and 050.md showed how that local state can move without splitting ownership. The plan is therefore not "pick a database mode." It is "make each guarantee cheap enough to keep."

Part 2: Choose placement and replication that keep the strict path local

Harbor Point's authoritative key is the inventory unit itself: inventory_id = trip_instance_id + cabin_id. That key hashes into 2048 logical buckets, and the control plane maps those buckets onto 16 physical replica groups. A logical bucket is small enough to move later, but large enough to avoid managing millions of independent placements.

One replica group for bucket 173 looks like this:

bucket 173 -> replica group B

group B
  leader          madrid-a
  sync follower   madrid-b
  async follower  newyork-a

The commit rule is intentionally narrow. POST /bookings/confirm reaches the leader for the bucket that owns cabin C14, checks that the matching hold is still active, appends the mutation to the partition log, and waits for the same-region follower to acknowledge durability. The remote follower is not in the normal commit quorum. Harbor Point is choosing lower normal-path latency and a non-zero cross-region RPO over continent-spanning synchronous consensus on every booking.

def confirm_booking(cmd, observed_tokens):
    bucket = inventory_bucket(cmd.trip_instance_id, cmd.cabin_id)
    group = partition_map.lookup(bucket)

    return group.leader.commit_if(
        key=cmd.inventory_id,
        predicate=lambda row: row.status == "held" and row.hold_id == cmd.hold_id,
        writes=[
            ("inventory", cmd.inventory_id, {"status": "booked"}),
            ("holds", cmd.hold_id, {"status": "confirmed"}),
        ],
        sync_replicas=1,
        emit_token={"inventory": group.last_log_index()},
    )

The choice of shard key is doing most of the work here. By keeping the inventory row and authoritative hold state on the same partition boundary, Harbor Point avoids turning the booking decision into a cross-shard transaction. Sharding by customer_id would make the "my trips" page convenient but would scatter two customers competing for the same cabin. Sharding by route_date would make search locality convenient but would concentrate a flash sale on one route into one hot shard. The plan keeps the decisive ownership check local on purpose and pays for convenience reads somewhere else.

Those convenience reads are explicit secondary paths, not accidental side effects. The leader's log feeds a search index keyed by route and departure date, and it feeds a trip projection keyed by customer. Session-bound reads carry an observed-version token so the gateway can reject a stale follower and fall back to the leader when necessary. That is what lets Harbor Point mix strong and weak paths without lying about which one the caller is on.

Harbor Point also rejects two designs on purpose. It does not use multi-leader replication for the authoritative inventory path because "merge after conflict" is not a real answer once two agents believe cabin C14 is theirs. It does not use leaderless quorum writes for confirmation because the conflict classes from 047.md are acceptable for some distributed stores and derived views, but not for a one-seat inventory invariant with immediate user-visible confirmation.

Part 3: Turn the architecture into operating rules before traffic arrives

The design is only real once Harbor Point can say what happens during lag, failover, and capacity movement.

First, lag budgets become release criteria, not dashboard trivia. Search may serve results up to 2s stale. Session-bound hold reads may use a follower only if it has replayed the client's observed token. The remote async follower may lag by at most 5s if Harbor Point wants to claim an RPO of 5s for regional loss. If the lag exceeds that threshold, the system does not quietly keep selling tickets while pretending disaster recovery is unchanged. It pages the on-call team, slows non-essential background work, and can pause planned topology changes until the budget is back inside bounds.

Second, rebalancing is part of the replication plan, not a separate maintenance trick. Harbor Point watches per-bucket write concentration and storage skew. When one bucket starts consuming a disproportionate share of a replica group's capacity, the control plane copies the bucket to a new group, streams the delta, and cuts over only after the target has replayed through the barrier sequence from 050.md. The partition-map generation is part of every routed write, so stale routers are forced to refresh instead of mutating the old owner after cutover.

Third, failover needs an explicit uncertainty story. Because the remote replica is asynchronous, an outage can leave clients unsure whether a timed-out confirmation actually committed before the region failed. Harbor Point therefore treats the idempotency token on POST /bookings/confirm as part of the storage contract. After promotion, support and clients can ask the new leader whether token BK-2026-88421 committed, rather than retrying the business action blindly.

The operating sequence during a regional event is mechanical:

1. Freeze routing at partition-map generation N.
2. Promote only the replica's durable replayed prefix.
3. Publish generation N+1 with the new leader.
4. Resume writes through the promoted owner.
5. Resolve ambiguous client retries by idempotency token.

This is also where the boundary to the next lesson appears. Harbor Point's plan is strongest when the decisive operation stays inside one shard: inventory state plus hold state plus the booking record that belongs to that inventory unit. If the product later insists that booking confirmation must atomically reserve a loyalty upgrade, decrement a global promotion budget, and write to a payment ledger on another partition, the topology plan alone is no longer enough. That is the handoff to 052.md: transaction semantics become the next design problem once the business action stops fitting inside the carefully chosen partition boundary.

Failure Modes and Misconceptions

Issue: "Make every region writable and let conflict resolution sort out the rare race."
- Why it is tempting: Median latency improves for remote writers, and the topology looks symmetrical on a slide.
- Corrective mental model: Final booking is not a reconcilable preference merge. Once two agents see "confirmed," the business has already incurred the error.
- Operational fix: Keep one authoritative writer per inventory bucket and move low-value reads, not decisive writes, onto weaker paths.
Issue: "Shard by the query that appears most often in dashboards."
- Why it is tempting: Search and reporting usually dominate request counts.
- Corrective mental model: The shard key should make the strictest invariant cheap, not the most convenient query local.
- Operational fix: Partition authoritative inventory by the sellable unit and build route- or customer-oriented projections from the change stream.
Issue: "Async followers only matter during disaster recovery, so lag is an operations metric, not a product contract."
- Why it is tempting: Lag often stays low enough that nobody notices until an outage.
- Corrective mental model: Lag is the distance between the durability you advertise and the durability you actually have.
- Operational fix: Put concrete thresholds on search freshness, session reads, and remote RPO, and gate planned changes or failover claims when those thresholds are violated.
Issue: "Rebalancing, failover, and routing are separate concerns."
- Why it is tempting: Different teams often own the balancer, the database, and the gateway.
- Corrective mental model: All three rely on the same rule: one partition has one current owner, and that ownership changes only through a versioned control-plane update.
- Operational fix: Carry partition-map generations in routed requests and make both cutover and promotion publish a new generation before traffic resumes.

Connections

Connection 1: 048.md explains why the capstone starts with endpoint guarantees instead of database slogans

Harbor Point's topology choices only make sense because the platform already decided which endpoints need linearizable confirmation, which need session guarantees, and which can tolerate bounded staleness.

Connection 2: 050.md supplies the safe movement mechanism that this plan depends on

Logical buckets and versioned cutover are what let Harbor Point change capacity distribution without redefining the shard key or briefly allowing two owners for the same inventory unit.

Connection 3: 052.md begins exactly where this plan stops being sufficient

Once one business action must span more than one shard or more than one authoritative subsystem, the next problem is no longer placement. It is transaction semantics and which anomalies the system is willing to rule out.

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Revisit the chapters on partitioning, replication, and transactions as one design space rather than separate implementation topics.
[PAPER] Spanner: Google's Globally-Distributed Database
- Focus: Compare Harbor Point's local-quorum commit choice with a system that pays for stronger cross-region ordering directly.
[DOC] Vitess Resharding
- Focus: Study how online resharding combines copy, catch-up, cutover, and rollback into one controlled ownership-transfer process.
[DOC] Amazon DynamoDB: Best practices for choosing a partition key
- Focus: Use it as a concrete checklist for cardinality, request concentration, and hot-partition risk when evaluating candidate shard keys.
[DOC] CockroachDB Replication Layer
- Focus: Pay attention to how leases, replicas, and range movement interact; it is a concrete production example of replication and partition placement being one system.

Key Takeaways

A replication-plus-partitioning plan starts with the user-visible invariant and the endpoint contract, not with a preferred topology slogan.
Harbor Point keeps definitive booking on one shard leader, then uses follower reads, CDC projections, and bounded staleness only where the product can tolerate them.
Lag budgets, idempotency tokens, and partition-map generations are what turn replication and partitioning from theory into a production operating model.
The plan is intentionally optimized so the critical booking path stays single-shard; the moment the business action crosses that boundary, transaction semantics become the next design problem.

← Back to Consistency and Replication

← Back to Learning Hub