LESSON
Day 496: Module Capstone: Consistency and Coordination Design
The core idea: A production consistency design works when it narrows expensive coordination to the business decisions that must be final, then serves everything else from replayable logs and bounded-stale projections with explicit lag budgets.
Today's "Aha!" Moment
Suppose Harbor Point opens the last wheelchair-accessible cabin on sailing HP-2026-07-14 to three channels at once: a Madrid agent desk, a New York call center, and the passenger self-service site. At the same time, embarkation staff need the boarding manifest to reflect cleared passport holds within seconds, and finance needs payment captures reconciled without double charging after retries.
The capstone question is not "should Harbor Point use strong consistency or eventual consistency?" That is too coarse to help. The real question is where the system is willing to pay for coordination. If every read and integration waits on a global decision path, latency and availability collapse under traffic. If none of the paths coordinate, the last accessible cabin can be sold twice and no amount of projection cleanup will repair the original promise made to the customer.
The design becomes defendable once Harbor Point names its invariants first and then assigns a mechanism to each one. The cabin sale is a single authoritative shard decision. Immediate user confirmation is protected by idempotency keys and leader-routed writes. Search and boarding dashboards are derived from the committed log and allowed to be slightly stale within declared budgets. The trade-off is intentional: the system pays coordination cost only where the product cannot tolerate ambiguity.
Why This Matters
Production systems do not fail because engineers forget the vocabulary of linearizability or eventual consistency. They fail because a product promise crosses paths with an architecture that never said which state is authoritative, which retries are safe, or which read paths are allowed to lag. For Harbor Point, overselling an accessible cabin is not a harmless stale-read bug. It triggers manual relocations, support escalations, and potential regulatory trouble once the passenger arrives at the port.
Before the design is explicit, teams compensate with local patches. Search is refreshed more aggressively. Support is given a "fix booking" admin tool. Payment retries are tuned. None of those patches answer the core coordination question. After the design is explicit, Harbor Point can defend each path: POST /holds and POST /bookings/confirm route to the authoritative inventory owner, the payment gateway is integrated through idempotent confirmation rather than fantasy distributed transactions, and every downstream view can explain exactly how fresh it is and how it recovers from replay.
The production relevance is that incidents become diagnosable by contract instead of by folklore. When a user says "I got a timeout but my card was charged," the team knows which idempotency record and commit index to inspect. When the manifest lags, the team knows whether to delay boarding or fail over to the leader-backed read path. That is what a capstone consistency design should buy: a small set of hard guarantees, a larger set of honest weaker guarantees, and clear operating rules between them.
Core Walkthrough
Part 1: Start with the product promises Harbor Point cannot break
Keep one flow in view: customer BK-88421 is trying to confirm the last wheelchair-accessible cabin C14 on sailing HP-2026-07-14 while another agent is attempting the same sale from another region.
Harbor Point writes down the contract before it chooses infrastructure:
| User-facing action | Required guarantee | Chosen mechanism | What Harbor Point is willing to pay |
|---|---|---|---|
POST /holds for cabin C14 |
One current hold owner at a time for the inventory unit | Route to the shard that owns (sailing_id, cabin_id) and commit through its replicated write log |
Cross-node coordination inside one replica group |
POST /bookings/confirm |
Success means the cabin is definitively booked once | Same shard transaction validates hold, payment authorization, and idempotency key before commit | Leader-routed write, retry bookkeeping, and synchronous local replica acknowledgment |
GET /bookings/BK-88421 right after confirm |
Read-your-write for the caller | Return an observed commit token and route immediate reads to a caught-up replica or the leader | Occasional leader reads when followers lag |
GET /search/cabins?route=BCN-JFK |
Bounded staleness, up to 2s |
Serve from a projection fed by committed events | Slightly stale browse results, followed by decisive revalidation at hold time |
GET /manifest/HP-2026-07-14 |
Near-real-time operational correctness, up to 5s lag |
Consume the event log into a boarding projection with lag alarms | Projection infrastructure and operational lag monitoring |
This table is the real design artifact. It prevents Harbor Point from pretending that every endpoint deserves the same consistency cost. It also prevents the opposite mistake: hiding critical user-visible decisions behind "eventual" language when the business action is irreversible.
Part 2: Keep the decisive write path narrow and explicit
Harbor Point chooses inventory_id = sailing_id + cabin_id as the authoritative key. That key maps to a single shard owner at any moment, and that owner is a small replica group rather than a globally coordinated cluster:
gateway
-> partition map service (generation 118)
-> inventory shard 173
leader: madrid-a
sync follower: madrid-b
async follower: newyork-a
The write path for BK-88421 looks like this:
- The self-service site asks search for available cabins and sees
C14in a projection that may be up to2sold. POST /holdsroutes to shard173, which verifies thatC14is still free, writes holdH-9917, replicates to the synchronous follower, and returns hold versionv44.- The payment service obtains an external authorization and returns
auth_7QK2. POST /bookings/confirmsendshold_id=H-9917,hold_version=v44,payment_auth=auth_7QK2, andidempotency_key=confirm-BK-88421.- The shard leader checks that the hold is still active, that the payment authorization has not already been consumed, and that the idempotency key has not been committed before. It then writes the booking, marks the hold consumed, stores the idempotency record, and emits an outbox event in the same durable commit.
- The response includes commit token
inventory:173@982441so immediate follow-up reads can avoid stale replicas.
A simplified storage routine makes the boundary concrete:
def confirm_booking(cmd):
return inventory_shard(cmd.inventory_id).commit(
idempotency_key=cmd.idempotency_key,
check=lambda state: (
state.cabin_status == "held"
and state.hold_id == cmd.hold_id
and state.hold_version == cmd.hold_version
and state.payment_auth == cmd.payment_auth
),
writes=[
("inventory", cmd.inventory_id, {"cabin_status": "booked"}),
("booking", cmd.booking_id, {"status": "confirmed"}),
("idempotency", cmd.idempotency_key, {"booking_id": cmd.booking_id}),
("outbox", next_event(), {"type": "booking-confirmed", "booking_id": cmd.booking_id}),
],
)
The mechanism matters more than the syntax. Harbor Point is deliberately avoiding a global transaction with the payment provider or the search index. Payment authorization happens before confirmation, but the irreversible local decision is made once, on the inventory owner, with an idempotency record that survives retries and failover. Search is advisory. Booking confirmation is authoritative.
The trade-off is visible here. Harbor Point accepts extra latency and operational machinery on a tiny fraction of traffic so it can keep the critical invariant local and explainable. It refuses to pay the same coordination cost on every search query, customer history page, or boarding dashboard refresh.
Part 3: Push everything else through committed logs, projections, and control-plane rules
Once BK-88421 commits, Harbor Point stops pretending that all consumers belong on the write path. The shard's outbox is captured into the event backbone and turned into specialized read models:
inventory shard commit
-> outbox / WAL
-> booking-events topic
-> search availability projector
-> customer trips projector
-> boarding manifest projector
-> finance reconciliation stream
This is where 064.md becomes operationally central. The search index can lag by 2s because any final sale still revalidates against the authoritative shard. The boarding manifest can lag by 5s because port staff can fall back to a leader-backed check for edge cases during active boarding. The finance stream can be replayed because it is derived from committed events, not from side effects observed out of band. Harbor Point gains scale and failure isolation by letting each read model speak honestly about freshness instead of forcing every consumer into the critical write quorum.
The control plane also needs explicit rules. The partition map is versioned, so every routed write carries a generation number. Rebalancing cabin C14's shard to a new replica group requires copy, catch-up, cutover, and generation change, not an ad hoc router update. Failover promotes only the durable replicated log prefix, and ambiguous client retries are resolved through the stored idempotency key rather than by replaying the business action blindly. If the remote region was behind by three seconds at failure time, Harbor Point admits that recovery point explicitly instead of claiming imaginary zero-loss failover.
That produces a complete coordination design rather than a pile of mechanisms. One narrow path decides inventory. A durable log exports committed truth. Projections materialize the views humans and neighboring systems need. The control plane governs who owns each shard and when that ownership is allowed to move. The whole design stays coherent because every part answers the same recurring scenario: what happens when two actors race for the last cabin and the system must still explain its answer after retries, lag, and failover?
Failure Modes and Misconceptions
-
"We should make search strongly consistent with booking so users never see stale availability." That is tempting because stale search results are visible, but it spends coordination on browse traffic instead of on the decisive sale. Harbor Point accepts bounded-stale search and revalidates at hold creation, which keeps the invariant safe without putting the search index in the booking quorum.
-
"A confirmation timeout means the booking probably failed, so retry the whole action." Timeouts create uncertainty, not a clean rollback signal. Without an idempotency record, Harbor Point risks double booking or double charging. The operational fix is to persist the idempotency key with the authoritative commit and resolve retries against the stored outcome.
-
"If the manifest is wrong, operators can patch the projection directly." Direct projection edits create drift from the committed log. The durable fix is to emit the corrective event or repair and replay the projector so the derived state converges from the source of truth.
-
"Exactly-once delivery from the broker means downstream correctness is solved." Delivery semantics do not replace application invariants. Harbor Point still needs deduplication, checkpoint discipline, and reducers that tolerate replay, because correctness lives at the sink and the business rule, not at the transport label.
-
"Cross-region failover is just leader election on a replica." Promotion is only safe for the log prefix that is durably replicated and fenced by the control plane. If Harbor Point ignores lag, shard ownership generations, or ambiguous retries, failover turns a regional outage into a consistency incident.
Connections
Connection 1: 064.md shows how Harbor Point serves search, manifest, and finance without dragging those consumers into the booking quorum
This capstone uses projections as a deliberate escape hatch from over-coordination. The log is not an implementation detail; it is the boundary that lets strong writes and weaker reads coexist honestly.
Connection 2: 060.md and 061.md explain why the booking path stores idempotency outcomes instead of trusting transport slogans
Retries, duplicates, and partial failures are normal on the authoritative path. The capstone turns those lessons into a concrete write contract that survives timeouts and failover.
Connection 3: 048.md and 058.md provide the vocabulary for the guarantees Harbor Point assigns to each endpoint
The capstone is the final assembly step for the track: endpoint semantics, shard ownership, replicated commits, retry safety, and derived views all have to agree on what "correct" means for one business action.
Resources
-
[BOOK] Designing Data-Intensive Applications
- Focus: Revisit the chapters on transactions, replication, logs, and derived data as one connected design space rather than isolated topics.
-
[PAPER] Spanner: Google's Globally-Distributed Database
- Focus: Compare Harbor Point's narrow local coordination choice with a system that pays for stronger cross-region ordering directly.
-
[PAPER] In Search of an Understandable Consensus Algorithm (Raft)
- Focus: Pay attention to leader ownership, log replication, and why failover safety depends on a well-defined committed prefix.
-
[DOC] Apache Kafka Streams Core Concepts
- Focus: Study stream-table duality and replay-backed state reconstruction, which is exactly how the capstone keeps projections out of the booking quorum.
-
[DOC] PostgreSQL Logical Decoding Concepts
- Focus: Use it to ground the outbox or WAL-capture portion of the design in a concrete production implementation.
Key Takeaways
- A consistency design is not one global mode; it is a mapping from user-visible invariants to specific coordination points, lag budgets, and recovery rules.
- Harbor Point keeps the irreversible cabin sale on one authoritative shard with idempotent confirmation, then exports committed truth through logs and projections for everything else.
- Bounded-stale reads are safe only when the decisive write path revalidates against the source of truth and when each projection has explicit replay and lag semantics.
- Failover, rebalancing, and retries are part of the consistency contract itself, because a design is only real if it can explain ambiguous outcomes after something breaks.