LESSON
Day 428: Global Secondary Indexes Across Shards
The core idea: Once a table is sharded by one key and queried by another, a global secondary index becomes its own distributed data structure, with separate shards, separate write amplification, and separate correctness hazards that must stay aligned with the base rows.
Today's "Aha!" Moment
In 11.md, Harbor Point learned that a replica can answer quickly and still be too old for a session-sensitive read. Now the team has a different pressure. The bond_reservations table is sharded by desk_id so the 09:30 market burst no longer crushes a single primary, but compliance still needs a fast answer to: "show every open reservation for issuer CA-MUNI across all desks."
The first instinct is usually wrong: "add an index on issuer_id." A local secondary index on each shard helps only after the router has already fanned out to every shard. The expensive part is still there. Harbor Point has not solved the lookup problem; it has only made each shard's local scan cheaper.
The real design shift is to treat the index as another table with another partitioning rule. Harbor Point creates gsi_open_by_issuer, sharded by issuer_id, where each entry points to the base reservation's home shard and primary key. Now an issuer lookup can go straight to the relevant index shard, find the matching reservation IDs, and then fetch the base rows that are still needed.
That is the important mental model: a global secondary index is not a schema decoration. It is a second distributed write path. When reservation R-88421 changes from open to released on data shard D5, the system must also delete or invalidate the corresponding entry on index shard I3. If those state transitions drift apart, the query is wrong even when every shard is healthy in isolation. That is why this lesson sits naturally between 11.md, which focused on read freshness, and 13.md, which will ask how much of this derived state must survive restore and point-in-time recovery.
Why This Matters
Harbor Point's sharding decision is sensible. Partitioning bond_reservations by desk_id spreads write traffic, keeps desk-local workflows fast, and avoids one giant write hotspot during trading hours. The problem is that production systems are rarely queried by exactly the same key they use for partitioning. Risk, compliance, and search workflows nearly always want alternate access paths.
Without a global secondary index, the query WHERE issuer_id = 'CA-MUNI' AND status = 'open' becomes a scatter-gather operation over every desk shard. That may still work at small scale. It gets expensive once the system holds millions of reservations, each shard has its own backlog profile, and the endpoint has a real latency objective during the busiest minute of the day. Tail latency is now tied to the slowest shard in the fan-out.
A global secondary index fixes the read shape by precomputing the alternate path. The endpoint can route by issuer_id, not by desk_id, and fetch only the reservations that actually match. The cost is that every relevant write now has at least two storage responsibilities: maintain the source row and maintain the index row. Harbor Point is trading read fan-out for write amplification and cross-shard consistency work.
That trade is often worth it, but only if the team is honest about what the index is. It is not free performance. It is derived state with its own shards, failure modes, lag surface, and repair plan. If compliance treats the index answer as authoritative, operations must be able to say whether the index is current, how stale it can become, and how it is repaired after crashes, retries, or rebalancing.
Learning Objectives
By the end of this session, you will be able to:
- Explain why local indexes stop helping once a query crosses shard boundaries - Distinguish scatter-gather with per-shard indexes from a true global secondary index that is partitioned by the alternate lookup key.
- Trace how a base-row write turns into index mutations on different shards - Follow inserts, updates, deletes, retries, and stale-entry cleanup for Harbor Point's issuer lookup path.
- Evaluate the production trade-offs of global secondary indexes - Decide when the read-latency win justifies the extra write coordination, hotspot risk, and repair machinery.
Core Concepts Explained
Concept 1: A global secondary index is a separately sharded lookup structure
Harbor Point's base table is partitioned by desk_id, which is a good fit for the write path because most reservation updates originate from desk-local workflows. That layout says nothing about issuer-centric queries. If the router receives GET /issuers/CA-MUNI/open-reservations, it cannot infer which desk shards might contain matching rows, because issuer_id is not the partition key.
This is where local and global indexes diverge. A local secondary index on each desk shard can speed up the search inside that shard, but the router still has to ask every shard whether it owns any CA-MUNI rows. The cluster has simply become "many efficient local lookups plus one expensive fan-out." A global secondary index changes the shape of the problem by materializing a second access path that is itself partitioned by the alternate key.
Harbor Point's index entries might look like this:
index key: (issuer_id, status, reservation_id)
payload: home_shard, row_pk, row_version, desk_id, notional
index shard: hash(issuer_id)
With that structure, every CA-MUNI lookup routes to the same index shard family. The read path becomes:
issuer lookup for CA-MUNI
|
v
index shard I3 for hash("CA-MUNI")
|
v
reservation ids + home shards for matching open rows
|
v
fetch base rows from D1, D5, D7 as needed
The payload choice matters. A pointer-style index stores only enough to find the base row, which keeps the index smaller and cheaper to update. A covering index duplicates read-mostly fields such as desk_id, notional, or reservation_ts, which can avoid the second hop for some endpoints. Harbor Point must decide whether the saved read latency is worth duplicating more bytes and widening every index update.
The trade-off is now concrete. The system gains a targeted lookup path for non-shard-key queries, but it also creates a second partition map and a second place where skew can appear. If one issuer becomes disproportionately hot, the problem moves from "all desk shards are queried" to "one index shard is overloaded." Global indexes remove one form of fan-out while potentially concentrating another.
Concept 2: Correct index maintenance means transforming row changes into cross-shard mutations
Suppose desk ALPHA creates reservation R-88421 for issuer CA-MUNI, and the base row lands on shard D5 because desk_id='ALPHA'. Harbor Point also wants the row to appear in gsi_open_by_issuer on index shard I3. That means one logical user action produces at least two physical state changes:
- insert or update the base reservation on
D5 - insert or update the issuer index entry on
I3
The same issue shows up on updates. If R-88421 changes from open to released, the old row contributed to the index and the new row no longer should. If the issuer changes from CA-MUNI to NY-MUNI, Harbor Point must delete the old index key and insert a new one on a different index shard. If the row moves during resharding, the index payload's home_shard must change too. Global indexes are therefore maintained from before and after images, not from the SQL verb alone.
Harbor Point's mutation builder can be reasoned about like this:
def issuer_index_mutations(before, after):
if before and before.status == "open":
yield Delete(
key=(before.issuer_id, "open", before.reservation_id),
row_version=before.row_version,
)
if after and after.status == "open":
yield Put(
key=(after.issuer_id, "open", after.reservation_id),
home_shard=after.home_shard,
row_pk=after.reservation_id,
row_version=after.row_version,
desk_id=after.desk_id,
notional=after.notional,
)
The hard part is not producing those mutations. It is deciding how strongly they must line up with the base commit. A synchronous design updates the base row and the index entry in one distributed transaction. That gives Harbor Point the cleanest semantics: once the write commits, issuer queries see the new state immediately. The price is higher write latency, more coordination, and more distributed transaction failure handling.
An asynchronous design commits the base row first, then derives index mutations from the ordered log or CDC stream. That usually scales writes better and isolates the index pipeline operationally, but it creates a visibility gap. The index may temporarily miss a new reservation or still expose a released one. If Harbor Point chooses that route, it needs explicit freshness metadata on the index path, the same way 11.md needed explicit replay positions on replica reads.
Retries make the maintenance contract even stricter. If the system replays the same index mutation after a worker restart, the index must not silently duplicate entries. That is why practical designs keep stable row identifiers and versions in the index payload. A Put should overwrite the logical entry for (issuer_id, status, reservation_id) at a known version, and a Delete should be idempotent. Otherwise, normal retry behavior turns into correctness drift.
Concept 3: Operationally safe global indexes need hotspot control, observability, and repair paths
Once Harbor Point adds gsi_open_by_issuer, the index becomes part of the production surface, not just part of the schema. The team now needs index-specific metrics: write amplification per base mutation, lag between base commits and index application, query hit rate by index key, index-shard p99 latency, and mismatch counts from consistency checks. Without those signals, the team can see "issuer search is slow" but cannot tell whether the cause is a hot key, a lagging indexer, or stale payloads that trigger too many second-hop reads.
Skew is the first operational hazard. If CA-MUNI is the busiest issuer on the platform, hashing only on issuer_id can make one index shard much hotter than the others. The fix depends on the query contract. Harbor Point can pre-split the issuer's index range, bucket the key with an extra dimension such as date window, or store multiple partitions for very hot issuers and accept a bounded fan-out across those buckets. None of these are free. They trade a single lookup for a slightly wider query in order to stop one shard from melting.
The second hazard is divergence. The index can contain orphaned entries after failed deletes, stale home_shard pointers after row moves, or missing entries after log gaps. The only durable answer is repair tooling that treats the base table as the source of truth, scans a slice of reservations, regenerates the expected index entries, and compares them to what the index currently stores. In other words, Harbor Point must plan for index reconciliation before the first incident, not after one.
This also explains why teams often classify a global secondary index as rebuildable derived state. If Harbor Point loses an index shard but still has the base rows plus a durable commit history, it can usually rebuild the index. The rebuild cost may be painful, but the data model remains recoverable. That classification matters because 13.md is next: backup and point-in-time recovery are much simpler when the system is clear about which structures are authoritative and which ones can be regenerated to a specific log position.
Finally, the query semantics still matter after the index exists. A fast issuer lookup is not automatically a trustworthy one. If Harbor Point serves the result from an asynchronously maintained index and then from a lagging replica of that index shard, staleness compounds. The read contract must say whether issuer lookup is fully current, session-consistent, or merely bounded-stale. A global index makes the query cheaper. It does not eliminate the need to state what "correct enough right now" means.
Troubleshooting
Issue: A released reservation still appears in GET /issuers/CA-MUNI/open-reservations for several seconds after the base write commits.
Why it happens / is confusing: The base shard is current, but the global index is maintained asynchronously and has not yet applied the delete for the old (issuer_id, status, reservation_id) entry. If the read then goes to a replica of the index shard, the visibility gap is even larger.
Clarification / Fix: Expose an index freshness watermark, route session-sensitive issuer lookups through a path that can prove the required commit position, and fall back to the base shard or a stronger consistency mode when the index is behind.
Issue: One index shard shows far higher CPU and p99 latency than the rest, even though the base-table shards look balanced.
Why it happens / is confusing: The alternate lookup key is skewed. Sharding the base table by desk_id balanced writes, but the global index is partitioned by issuer_id, and a few issuers dominate query and update volume.
Clarification / Fix: Measure cardinality and request rate by index key, not just by shard. For hot issuers, pre-split or bucket the index key and let the router query a small bounded set of buckets instead of forcing one shard to absorb all traffic.
Issue: Index lookups return reservation IDs that no longer exist on the referenced data shard after a rebalance.
Why it happens / is confusing: The row moved, but the old index payload still points to the previous home_shard, or a retry reinserted an old pointer after the move had already completed.
Clarification / Fix: Store row version and shard epoch information with the index entry, make move operations update index payloads as first-class mutations, and run reconciliation jobs that compare current base locations with index pointers.
Advanced Connections
Connection 1: 11.md taught that freshness needs proof; a global secondary index is another place where that proof can be lost
Replica lag showed Harbor Point that a fast answer can still be stale when the serving node has not replayed far enough. A global secondary index adds another replay surface. Even if the base row committed correctly, the query is wrong until the index entry representing that row is updated and visible on the chosen read path.
Connection 2: 13.md will turn global index design into a recovery question
Backups and point-in-time recovery force a decision: is the global index part of the authoritative snapshot, or is it derived state that should be rebuilt to the restored log position? That choice changes backup size, restore time, and the operational steps after a partial data loss event.
Resources
Optional Deepening Resources
- [DOC] Secondary indexes in Cloud Spanner
- Focus: See how a distributed SQL system exposes secondary indexes, backfill behavior, and the cost of maintaining alternate access paths.
- [DOC] Using Global Secondary Indexes in DynamoDB
- Focus: Compare a production system where alternate-key lookups are explicit, eventually consistent by default, and paid for through extra write and storage cost.
- [DOC] Lookup Vindexes in Vitess
- Focus: Study a concrete sharded-database mechanism where a separate lookup table maps alternate keys to the correct shard.
- [DOC] Indexes | CockroachDB Docs
- Focus: Review how covering indexes, stored columns, and distributed execution affect the read-versus-write trade-off.
Key Insights
- A local index does not become global just because the table is sharded - The router still needs a separately partitioned access path if the query key differs from the data shard key.
- Global secondary indexes are maintained state, not free metadata - Every base-row change can require cross-shard index mutations, idempotent retries, and explicit freshness handling.
- The hardest part is operational, not syntactic - Hotspot control, divergence repair, and recovery strategy determine whether the index is safe to trust in production.