Module Capstone: Consistency and Coordination Design

LESSON

Consistency and Replication

065 30 min advanced CAPSTONE

Day 496: Module Capstone: Consistency and Coordination Design

The core idea: A production consistency design works when it narrows expensive coordination to the business decisions that must be final, then serves everything else from replayable logs and bounded-stale projections with explicit lag budgets.

Today's "Aha!" Moment

Suppose Harbor Point opens the last wheelchair-accessible cabin on sailing HP-2026-07-14 to three channels at once: a Madrid agent desk, a New York call center, and the passenger self-service site. At the same time, embarkation staff need the boarding manifest to reflect cleared passport holds within seconds, and finance needs payment captures reconciled without double charging after retries.

The capstone question is not "should Harbor Point use strong consistency or eventual consistency?" That is too coarse to help. The real question is where the system is willing to pay for coordination. If every read and integration waits on a global decision path, latency and availability collapse under traffic. If none of the paths coordinate, the last accessible cabin can be sold twice and no amount of projection cleanup will repair the original promise made to the customer.

The design becomes defendable once Harbor Point names its invariants first and then assigns a mechanism to each one. The cabin sale is a single authoritative shard decision. Immediate user confirmation is protected by idempotency keys and leader-routed writes. Search and boarding dashboards are derived from the committed log and allowed to be slightly stale within declared budgets. The trade-off is intentional: the system pays coordination cost only where the product cannot tolerate ambiguity.

Why This Matters

Production systems do not fail because engineers forget the vocabulary of linearizability or eventual consistency. They fail because a product promise crosses paths with an architecture that never said which state is authoritative, which retries are safe, or which read paths are allowed to lag. For Harbor Point, overselling an accessible cabin is not a harmless stale-read bug. It triggers manual relocations, support escalations, and potential regulatory trouble once the passenger arrives at the port.

Before the design is explicit, teams compensate with local patches. Search is refreshed more aggressively. Support is given a "fix booking" admin tool. Payment retries are tuned. None of those patches answer the core coordination question. After the design is explicit, Harbor Point can defend each path: POST /holds and POST /bookings/confirm route to the authoritative inventory owner, the payment gateway is integrated through idempotent confirmation rather than fantasy distributed transactions, and every downstream view can explain exactly how fresh it is and how it recovers from replay.

The production relevance is that incidents become diagnosable by contract instead of by folklore. When a user says "I got a timeout but my card was charged," the team knows which idempotency record and commit index to inspect. When the manifest lags, the team knows whether to delay boarding or fail over to the leader-backed read path. That is what a capstone consistency design should buy: a small set of hard guarantees, a larger set of honest weaker guarantees, and clear operating rules between them.

Core Walkthrough

Part 1: Start with the product promises Harbor Point cannot break

Keep one flow in view: customer BK-88421 is trying to confirm the last wheelchair-accessible cabin C14 on sailing HP-2026-07-14 while another agent is attempting the same sale from another region.

Harbor Point writes down the contract before it chooses infrastructure:

User-facing action Required guarantee Chosen mechanism What Harbor Point is willing to pay
POST /holds for cabin C14 One current hold owner at a time for the inventory unit Route to the shard that owns (sailing_id, cabin_id) and commit through its replicated write log Cross-node coordination inside one replica group
POST /bookings/confirm Success means the cabin is definitively booked once Same shard transaction validates hold, payment authorization, and idempotency key before commit Leader-routed write, retry bookkeeping, and synchronous local replica acknowledgment
GET /bookings/BK-88421 right after confirm Read-your-write for the caller Return an observed commit token and route immediate reads to a caught-up replica or the leader Occasional leader reads when followers lag
GET /search/cabins?route=BCN-JFK Bounded staleness, up to 2s Serve from a projection fed by committed events Slightly stale browse results, followed by decisive revalidation at hold time
GET /manifest/HP-2026-07-14 Near-real-time operational correctness, up to 5s lag Consume the event log into a boarding projection with lag alarms Projection infrastructure and operational lag monitoring

This table is the real design artifact. It prevents Harbor Point from pretending that every endpoint deserves the same consistency cost. It also prevents the opposite mistake: hiding critical user-visible decisions behind "eventual" language when the business action is irreversible.

Part 2: Keep the decisive write path narrow and explicit

Harbor Point chooses inventory_id = sailing_id + cabin_id as the authoritative key. That key maps to a single shard owner at any moment, and that owner is a small replica group rather than a globally coordinated cluster:

gateway
  -> partition map service (generation 118)
  -> inventory shard 173
       leader: madrid-a
       sync follower: madrid-b
       async follower: newyork-a

The write path for BK-88421 looks like this:

  1. The self-service site asks search for available cabins and sees C14 in a projection that may be up to 2s old.
  2. POST /holds routes to shard 173, which verifies that C14 is still free, writes hold H-9917, replicates to the synchronous follower, and returns hold version v44.
  3. The payment service obtains an external authorization and returns auth_7QK2.
  4. POST /bookings/confirm sends hold_id=H-9917, hold_version=v44, payment_auth=auth_7QK2, and idempotency_key=confirm-BK-88421.
  5. The shard leader checks that the hold is still active, that the payment authorization has not already been consumed, and that the idempotency key has not been committed before. It then writes the booking, marks the hold consumed, stores the idempotency record, and emits an outbox event in the same durable commit.
  6. The response includes commit token inventory:173@982441 so immediate follow-up reads can avoid stale replicas.

A simplified storage routine makes the boundary concrete:

def confirm_booking(cmd):
    return inventory_shard(cmd.inventory_id).commit(
        idempotency_key=cmd.idempotency_key,
        check=lambda state: (
            state.cabin_status == "held"
            and state.hold_id == cmd.hold_id
            and state.hold_version == cmd.hold_version
            and state.payment_auth == cmd.payment_auth
        ),
        writes=[
            ("inventory", cmd.inventory_id, {"cabin_status": "booked"}),
            ("booking", cmd.booking_id, {"status": "confirmed"}),
            ("idempotency", cmd.idempotency_key, {"booking_id": cmd.booking_id}),
            ("outbox", next_event(), {"type": "booking-confirmed", "booking_id": cmd.booking_id}),
        ],
    )

The mechanism matters more than the syntax. Harbor Point is deliberately avoiding a global transaction with the payment provider or the search index. Payment authorization happens before confirmation, but the irreversible local decision is made once, on the inventory owner, with an idempotency record that survives retries and failover. Search is advisory. Booking confirmation is authoritative.

The trade-off is visible here. Harbor Point accepts extra latency and operational machinery on a tiny fraction of traffic so it can keep the critical invariant local and explainable. It refuses to pay the same coordination cost on every search query, customer history page, or boarding dashboard refresh.

Part 3: Push everything else through committed logs, projections, and control-plane rules

Once BK-88421 commits, Harbor Point stops pretending that all consumers belong on the write path. The shard's outbox is captured into the event backbone and turned into specialized read models:

inventory shard commit
  -> outbox / WAL
  -> booking-events topic
       -> search availability projector
       -> customer trips projector
       -> boarding manifest projector
       -> finance reconciliation stream

This is where 064.md becomes operationally central. The search index can lag by 2s because any final sale still revalidates against the authoritative shard. The boarding manifest can lag by 5s because port staff can fall back to a leader-backed check for edge cases during active boarding. The finance stream can be replayed because it is derived from committed events, not from side effects observed out of band. Harbor Point gains scale and failure isolation by letting each read model speak honestly about freshness instead of forcing every consumer into the critical write quorum.

The control plane also needs explicit rules. The partition map is versioned, so every routed write carries a generation number. Rebalancing cabin C14's shard to a new replica group requires copy, catch-up, cutover, and generation change, not an ad hoc router update. Failover promotes only the durable replicated log prefix, and ambiguous client retries are resolved through the stored idempotency key rather than by replaying the business action blindly. If the remote region was behind by three seconds at failure time, Harbor Point admits that recovery point explicitly instead of claiming imaginary zero-loss failover.

That produces a complete coordination design rather than a pile of mechanisms. One narrow path decides inventory. A durable log exports committed truth. Projections materialize the views humans and neighboring systems need. The control plane governs who owns each shard and when that ownership is allowed to move. The whole design stays coherent because every part answers the same recurring scenario: what happens when two actors race for the last cabin and the system must still explain its answer after retries, lag, and failover?

Failure Modes and Misconceptions

Connections

Connection 1: 064.md shows how Harbor Point serves search, manifest, and finance without dragging those consumers into the booking quorum

This capstone uses projections as a deliberate escape hatch from over-coordination. The log is not an implementation detail; it is the boundary that lets strong writes and weaker reads coexist honestly.

Connection 2: 060.md and 061.md explain why the booking path stores idempotency outcomes instead of trusting transport slogans

Retries, duplicates, and partial failures are normal on the authoritative path. The capstone turns those lessons into a concrete write contract that survives timeouts and failover.

Connection 3: 048.md and 058.md provide the vocabulary for the guarantees Harbor Point assigns to each endpoint

The capstone is the final assembly step for the track: endpoint semantics, shard ownership, replicated commits, retry safety, and derived views all have to agree on what "correct" means for one business action.

Resources

Key Takeaways

  1. A consistency design is not one global mode; it is a mapping from user-visible invariants to specific coordination points, lag budgets, and recovery rules.
  2. Harbor Point keeps the irreversible cabin sale on one authoritative shard with idempotent confirmation, then exports committed truth through logs and projections for everything else.
  3. Bounded-stale reads are safe only when the decisive write path revalidates against the source of truth and when each projection has explicit replay and lag semantics.
  4. Failover, rebalancing, and retries are part of the consistency contract itself, because a design is only real if it can explain ambiguous outcomes after something breaks.
PREVIOUS Event Logs, Projections, and Stream Tables

← Back to Consistency and Replication

← Back to Learning Hub