Day 477: Consistency Spectrum and API Semantics

The core idea: A consistency model matters only when an API states what a caller may observe after a write, because the real trade-off is coordination cost versus which anomalies the product is willing to expose.

Today's "Aha!" Moment

In 047.md, Harbor Point taught its reservation store how to resolve concurrent branches of hold H-8821. That still left support with a harder question. After a Lisbon agent extends the hold, which screen is allowed to show the old expiry time? Can the search grid lag for a second? Can the itinerary page lag after payment succeeds? Can the final booking confirmation ever lag? Those are not storage-engine questions by themselves. They are API questions.

At 14:05:02, POST /holds/H-8821/extend succeeds and returns version 918447. A follow-up GET /search/cabins?route=BCN-JFK can tolerate a slightly stale answer because search is advisory and the booking flow revalidates inventory before commit. A follow-up GET /holds/H-8821 for the same agent cannot safely return version 918442 without making the UI look broken. A follow-up POST /bookings/confirm is stricter still: it must not tell two regions that cabin C14 is theirs just because replicas have not converged yet.

The non-obvious insight is that "eventually consistent" is not a user-facing contract. It says replicas will converge at some point, but it does not tell a caller whether they will see their own write, whether reads can move backward, or whether a confirmation response represents a globally decisive check. If the API does not define those semantics explicitly, every client quietly assumes the strongest promise and discovers the weaker reality in production.

Why This Matters

By the time a system reaches Harbor Point's scale, one database label is not enough. The same replicated data set often serves at least four different jobs: browse availability quickly, show an agent the hold they just changed, sequence related side effects such as booking plus itinerary publication, and enforce a no-double-booking invariant. Treating all of those as "the consistency level of the database" forces one of two bad outcomes. Either every path pays the cost of the strongest guarantee, or some path silently inherits a weaker guarantee than the product can actually survive.

Making API semantics explicit fixes that. Search can declare bounded staleness. Session-bound views can promise read-your-write and monotonic reads. Cross-service workflows can preserve causality so that if the customer sees "payment succeeded," the trip page cannot omit the booking event that caused it. Final confirmation can use linearizable or transactional coordination and admit that it may be slower or temporarily unavailable during quorum loss. The trade-off becomes visible: stronger guarantees spend coordination budget, while weaker guarantees spend anomaly budget and require compensation logic.

This matters operationally because incidents rarely say "the system violated causal consistency." They show up as "I just extended the hold and the screen moved backward," "the email confirmation arrived before the itinerary page updated," or "two agents both thought cabin C14 was still free." Good API semantics turn those complaints into designed behaviors or clear bugs instead of surprises.

Core Walkthrough

Part 1: Start with the user promise, not the storage slogan

Harbor Point writes down the contract for each API surface instead of assigning one adjective to the whole platform:

API surface	Consistency contract	What the caller is allowed to assume
`GET /search/cabins`	Bounded staleness up to `2s`	Results may lag slightly, but the booking flow will revalidate before committing
`GET /holds/H-8821` after the same agent wrote to it	Read-your-write plus monotonic reads	Once the agent sees version `918447`, later reads in that session cannot go backward
`GET /customer-trips` immediately after payment success	Causal consistency across booking and trip projection	If the caller has observed the payment-confirmed event, the trip view must include effects caused by that event
`POST /bookings/confirm`	Linearizable check-and-commit on inventory ownership	The success response means Harbor Point has definitively assigned the cabin, not merely queued an eventual reconciliation

This is the spectrum in practice. Eventual consistency is the weakest meaningful end of it: the system converges eventually, but the caller gets no bound and no session guarantee. Bounded staleness adds a limit on how old the answer may be. Session guarantees such as read-your-write and monotonic reads narrow the anomalies one caller can observe. Causal consistency preserves order for related actions across services. Linearizability makes a single operation look as if it happened at one globally agreed point in time. If Harbor Point ever needs a multi-row invariant such as "confirm cabin and decrement upgrade inventory atomically," it may need a transaction boundary stronger than a linearizable single-key read.

The important point is that these are not academic labels to paste into docs. Each one answers a different product question. "Can the agent trust the hold details page right after editing?" is a session-guarantee question. "Can payment success race ahead of the itinerary projection?" is a causal-consistency question. "Can two regions both commit the same cabin?" is a linearizability or transaction-boundary question.

Part 2: The API needs mechanism, not just terminology

Once Harbor Point defines the contracts, each one needs an implementation path.

For the hold-details endpoint, the server returns a session token with the successful write:

{
  "hold_id": "H-8821",
  "version": 918447,
  "session_observed": {
    "holds": 918447
  }
}

Later reads carry that observed version implicitly in the agent session or explicitly in a header. The read router chooses a replica only if two conditions are true:

the replica is fresh enough for the endpoint's staleness budget, and
the replica has applied at least version 918447.

If no nearby replica qualifies, the server must route to a fresher replica or leader. Silently serving an older version would violate the API's advertised semantics even if the database itself is healthy.

For the trip view, Harbor Point does something slightly different. The itinerary page is built from a projection service fed by booking events. After POST /bookings/confirm succeeds, the response includes a dependency token representing the confirmed booking event. When the customer immediately requests GET /customer-trips, the gateway waits until the projection has applied that dependency or routes the request to a view that has. That is causal consistency operationalized: effects must not appear before their causes, and a caller who has already observed the cause must not be sent to a read model that predates it.

Final confirmation uses the strongest path. Harbor Point performs a conditionally guarded write against the authoritative inventory shard:

def confirm_booking(cabin_id, hold_id, payment_id):
    return linearizable_transaction(
        read_key=("inventory", cabin_id),
        assert_predicate=lambda row: row.hold_id == hold_id and row.status == "held",
        writes=[
            ("inventory", cabin_id, {"status": "booked", "booking_payment": payment_id}),
            ("bookings", hold_id, {"status": "confirmed"}),
        ],
    )

That flow is slower and less available during quorum loss than a nearby stale read, but it buys the semantic guarantee Harbor Point needs: a success response means the cabin is no longer merely "likely booked once replicas catch up." It is booked.

Mechanism is why API semantics must be scoped endpoint by endpoint. The same replicated storage layer can support multiple contracts, but only if routing, tokens, retries, and write paths enforce the right one each time.

Part 3: Weakening or strengthening semantics must be explicit

The easiest way to break trust is to promise one consistency level and quietly deliver another under stress. Harbor Point therefore makes degradation rules part of the API design:

If GET /search/cabins has no replica inside the 2s budget, it can either pay more latency for a fresher read or return a response marked as temporarily unavailable. It may not claim "up-to-date results" while serving older data.
If GET /holds/H-8821 cannot satisfy the session token from a nearby replica, it falls back to the home region. It does not drop the token and hope the agent will not notice.
If GET /customer-trips cannot satisfy the dependency token yet, the system returns a 202 Accepted style "still materializing" state instead of rendering an itinerary that contradicts the confirmation the customer already saw.
If POST /bookings/confirm cannot get the coordination it needs, it should fail or time out explicitly. Quietly downgrading to an eventually consistent write would convert an availability problem into a correctness incident.

This is where API semantics shape client code. Stronger contracts often need extra metadata such as ETag, observed-version tokens, or explicit pending states. Weaker contracts need compensating product behavior such as revalidation before commit, freshness indicators, or UI copy that frames a view as advisory. The trade-off is not only infrastructure cost. It is also how much uncertainty the product and client code must absorb.

These choices are also the bridge to partitioning. In 049.md, Harbor Point will split data across shards. Once that happens, the cost of a strong guarantee depends heavily on whether the operation stays inside one shard or spans several. The API contracts defined here are the reason shard-key design matters in the next lesson.

Failure Modes and Misconceptions

Issue: "Our database offers strong consistency, so every endpoint built on it is effectively strong as well."
- Why it is tempting: Vendor terminology sounds global and authoritative.
- Corrective mental model: Caches, projections, async consumers, and read-routing policy can weaken the end-to-end contract long before the data reaches the caller.
- Operational fix: Document consistency per endpoint and test the full request path, not just the storage engine in isolation.
Issue: "Eventual consistency is acceptable if the system usually converges quickly."
- Why it is tempting: Median lag looks small, so teams assume users will not notice.
- Corrective mental model: Fast convergence does not tell a caller what they may observe right after a write or during failover; missing semantics are still missing even when the common case is fast.
- Operational fix: Replace vague "eventual" wording with a bounded staleness figure, a session guarantee, or an explicit "pending" state.
Issue: "Read-your-write on one endpoint automatically gives us causal behavior everywhere."
- Why it is tempting: Both ideas feel like "the user sees what they just did."
- Corrective mental model: Session guarantees protect one caller's sequence of reads; causal consistency protects related events across services and projections.
- Operational fix: Propagate dependency tokens or equivalent metadata across service boundaries when one response causes the next read.
Issue: "If linearizable confirmation is expensive, we can confirm optimistically and repair duplicates later."
- Why it is tempting: Availability improves immediately and the happy path looks faster.
- Corrective mental model: Some invariants, such as one cabin sold to one party, are not safely recoverable after the fact without refunds, legal exposure, and trust damage.
- Operational fix: Keep decisive operations on the strongest path their invariant requires, and move lower-value reads to weaker contracts instead of weakening the invariant itself.

Connections

Connection 1: 046.md gave Harbor Point a way to talk about bounded staleness

Lag budgets turned replica freshness into an explicit number. This lesson widens that idea into a full API contract: freshness bounds are only one point on the spectrum.

Connection 2: 047.md showed what happens when semantics are too weak for the write path

Conflict resolution exists because the system accepted concurrent branches. API semantics decide when that is an acceptable choice, when a client must retry, and when the operation must take a stronger path up front.

Connection 3: 049.md will make these contracts more expensive or cheaper depending on shard boundaries

Once data is partitioned, "strong enough" can no longer be discussed without asking whether the relevant read or write stays on one shard or fans out across many.

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Revisit the chapters on consistency models, linearizability, and transactions with attention to what clients can and cannot assume after a write.
[PAPER] Probabilistically Bounded Staleness for Practical Partial Quorums
- Focus: Use this to connect bounded-staleness claims to measurable time and version lag instead of vague "eventual" language.
[DOC] Amazon DynamoDB Developer Guide: Read Consistency
- Focus: Compare eventually consistent and strongly consistent reads as an example of how a database exposes different API-facing semantics for the same data.
[PAPER] Spanner: Google's Globally-Distributed Database
- Focus: Pay attention to externally consistent transactions and why stronger guarantees require tighter coordination across replicas and time.

Key Takeaways

A consistency model becomes useful only when the API says what a successful write lets the caller observe next.
Different endpoints over the same replicated data can legitimately need different guarantees, from bounded staleness to causal consistency to linearizable confirmation.
Tokens, routing rules, and explicit fallback behavior are the mechanisms that turn consistency vocabulary into an enforceable contract.
The stronger the guarantee, the more coordination cost you pay, which is why the next step is designing shard boundaries that keep expensive guarantees local.

← Back to Consistency and Replication

← Back to Learning Hub