Guarantee Matrix Design Review
LESSON
Guarantee Matrix Design Review
The core idea: A production scale-out plan starts by deciding which operations must stay under one owner, then chooses replication, shard keys, lag budgets, and rebalancing rules that preserve that contract under load and failure.
Core Insight
Harbor Point has enough individual mechanisms to build something dangerous if it connects them casually. It has issuer sharding, leader leases, global secondary indexes, async replicas, WAL archives, backpressure, membership changes, observability, and adversarial failure tests. Each mechanism can be correct in isolation while the service contract remains confused.
The design review starts with a concrete pressure: Madrid and New York desks both want fast reservation workflows, support wants global lookup, compliance wants search within seconds, and risk systems need issuer exposure to stay authoritative. The wrong move is to answer with a database slogan such as "strong consistency everywhere" or "active-active writes." Those slogans hide the real question: which operation is allowed to be stale, which operation must be final, and which component is the authority when the system is under stress?
A guarantee matrix is the artifact that makes those choices inspectable. It maps each endpoint or workflow to its authoritative state, required consistency, allowed lag, serving path, failure behavior, and validation evidence. The matrix is not paperwork. It is how Harbor Point prevents an optimized read path, a rebalancing plan, or a regional failover runbook from quietly changing the promise made to traders and compliance.
The trade-off becomes visible once the matrix exists. Harbor Point pays coordination cost on the narrow path that protects issuer exposure and confirmed reservations. It deliberately allows bounded staleness for dashboards and derived search. It accepts non-zero remote RPO in exchange for lower normal-path latency, but only while the system can measure and enforce that recovery budget.
The Matrix Comes Before The Diagram
Harbor Point begins by listing the operations that matter most and refusing to treat them as one generic "database read/write" surface.
| Workflow | Authority | Required guarantee | Allowed lag | Serving path |
|---|---|---|---|---|
POST /reservations for issuer MUNI-77 |
Issuer shard leader | Linearizable per issuer shard | None after success | Route to the shard leader that owns issuer_exposure and live reservations |
GET /reservation-tokens/{token} after a timeout |
Same issuer shard | Idempotent token status | None for the token record | Leader or follower that has replayed the token's commit index |
GET /issuers/{issuer_id}/open-reservations |
Base shard plus index | Bounded-stale with session fallback | Usually under 2s |
Regional follower or global secondary index, then validate when needed |
| Compliance search across issuers | CDC-derived search view | Eventually consistent, freshness published | Usually under 5s |
CDC-fed search/index service with lag watermark |
| Regional failover readiness | Remote replica and WAL archive | RPO no worse than 5s, RTO under 10m |
Explicit recovery window | Promote only durable replayed prefix and reconcile tokens |
That table forces useful disagreements early. If product wants compliance search to be fully linearizable, the team can say what it would cost: synchronous fan-out or a very different indexing architecture. If operations wants to advertise five-second RPO, the team can point to remote replay lag and WAL archive continuity as release-blocking signals. If a dashboard reads from a follower, the UI and API contract must admit when the result is stale or route through a stronger path.
The matrix also separates correctness from convenience. Sharding by issuer makes the exposure invariant local. A global secondary index makes issuer and compliance lookup cheaper. CDC makes derived surfaces scalable. Those are different paths with different promises, and the matrix keeps them from being mistaken for one another.
Turn Guarantees Into Architecture
Once Harbor Point has the matrix, the topology choices become much less arbitrary. The decisive write path is organized around issuer authority:
reservation request
|
v
gateway resolves issuer_id -> shard_id -> home region -> leader
|
v
leader updates issuer_exposure + reservation row atomically
|
+--> same-region sync follower for normal durability
|
+--> remote async follower for disaster recovery
|
+--> CDC stream for search, compliance, and dashboards
The shard key is doing real work here. If Harbor Point shards by trader office, client account, or route through a random active-active write surface, the issuer exposure check can no longer be decided in one place. The system would have to reconcile financial limits after multiple desks may already have seen "confirmed." That is not a tolerable merge. The matrix marks the reservation write as linearizable per issuer shard, so the architecture keeps that state under one leader.
The same matrix allows weaker paths without shame. Compliance search does not need to block every reservation commit on a global query service. It can read from a CDC-fed index as long as the freshness watermark is visible and investigators know when to fall back to the authoritative shard. A trader's session-sensitive read can use a local follower only if that follower has replayed the caller's observed commit token. Otherwise the gateway routes to the leader. The system is not "eventually consistent" or "strongly consistent" as a whole. Each path has a named promise.
Failure handling also becomes part of the architecture rather than a separate operations appendix. If the remote replica falls more than five seconds behind, Harbor Point can no longer honestly claim a five-second RPO for that shard. If WAL archive continuity breaks, point-in-time recovery is no longer trustworthy. If a removed member returns with an old configuration epoch, it must be fenced before it can serve traffic. These are not optional alerts; they are the evidence that the matrix is still true.
Review Checks
The design review asks a small set of hard questions for every row in the matrix.
1. What state is authoritative for this operation?
2. Which replica or derived view is allowed to serve it?
3. What lag, if any, is part of the public contract?
4. What happens when the serving path cannot prove the guarantee?
5. Which metric, test, or runbook shows that the guarantee still holds?
For POST /reservations, the answer is strict. The authority is the issuer shard leader, the success response means the exposure update and reservation row committed together, and the fallback when the guarantee cannot be proven is to fail or retry with a token, not to accept a weaker write.
For compliance search, the answer is different. The authority is still the base reservation data, but the normal serving path is derived. The guarantee is freshness within a published lag budget, not immediate linearizability. If CDC lag exceeds the budget, the search surface should report degraded freshness or direct investigators to a stronger shard-backed lookup.
For regional failover, the answer is operational. Promotion can only use the remote replica's durable replayed prefix, not whatever the old region might have accepted during the outage window. Ambiguous writes are resolved by reservation token status. If the team cannot answer token status after promotion, the runbook does not actually satisfy the storage contract.
This is where the failure-testing discipline matters. Every high-value row needs an adversarial test: partitions around leader changes, delayed CDC, stale follower reads, lagging remote replicas, and retries after timeouts. A matrix without tests is just a clearer promise. A matrix with tests becomes an acceptance criterion.
Operational Failure Modes
A team says "the database is strongly consistent" without naming the endpoint. The phrase hides too much. The reservation write, token lookup, follower dashboard, and CDC search surface have different guarantees. Split the claim into matrix rows and require each row to name authority, serving path, lag budget, and fallback behavior.
A dashboard starts depending on a follower read as if it were authoritative. The follower path may be correct for bounded-stale operational views, but the caller is treating it like a leader-backed read. Carry observed commit tokens through session-sensitive flows, expose freshness watermarks, and route to the leader when the follower cannot prove the required version.
Regional failover is declared ready while remote replay lag is outside the RPO budget. The system may still be accepting local writes, but the advertised recovery promise is already false. Gate failover readiness and planned topology changes on replay lag, WAL archive continuity, and restore rehearsal evidence.
The architecture diagram shows ownership, but the runbook changes ownership through a different path. Rebalancing, promotion, and membership changes all need a versioned authority update. Require partition-map or configuration generations in routed requests, and make cutover or promotion publish a new generation before traffic resumes.
Connections
- Failure Testing Replication Claims supplies the validation method: every important guarantee row should map to a workload, fault model, and checker.
- Replication Failure Mode Check applies the matrix to a concrete geo-distributed architecture and asks whether shard authority, read paths, and failover semantics agree.
- Observability for Replicated Data Systems explains why lag, replay position, WAL continuity, and restore proof need to be measured directly instead of inferred from API health.
Resources
- [BOOK] Designing Data-Intensive Applications
- Focus: Revisit partitioning, replication, transactions, and derived data as one design space instead of separate implementation topics.
- [PAPER] Spanner: Google's Globally-Distributed Database
- Focus: Compare Harbor Point's local-quorum design with a system that pays for stronger cross-region ordering directly.
- [DOC] CockroachDB Multi-Region Overview
- Focus: Study how production systems expose locality, survival goals, and stale-read trade-offs as explicit schema and topology choices.
- [DOC] Jepsen Analyses
- Focus: Read failure reports as examples of guarantees that were claimed informally but broken under specific workloads and faults.
- [DOC] Amazon DynamoDB: Best practices for designing and using partition keys effectively
- Focus: Use it as a concrete checklist for cardinality, request concentration, and hot-partition risk when evaluating candidate shard keys.
Key Takeaways
- A guarantee matrix starts with user-visible operations and authoritative state, not with a preferred replication slogan.
- Harbor Point keeps issuer exposure and confirmed reservations on one shard leader, then uses follower reads, indexes, and CDC only where their weaker guarantees are explicit.
- Lag budgets, token status, WAL continuity, and configuration generations turn replication choices into an operating contract the team can test and enforce.