Secondary Indexes Across Shards
LESSON
Secondary Indexes Across Shards
The core idea: Once a table is sharded by one key and queried by another, a global secondary index becomes its own distributed data structure, with separate shards, separate write amplification, and separate correctness hazards that must stay aligned with the base rows.
Core Insight
Harbor Point now has a sharded reservation service whose strongest writes route by allocation_bucket_id. That is the right authority boundary for hold, release, extend, and confirm decisions: one shard owns the scarce allocation bucket and can decide the invariant locally. The trouble starts when compliance asks a different question during the market open: "show every open reservation for issuer CA-MUNI, across every desk and allocation bucket."
The first instinct is usually too small: "add an index on issuer_id." A local index on each allocation shard helps only after the router has already asked that shard to search. The expensive part is still there. The router does not know which allocation-bucket owners contain open CA-MUNI rows, so it must fan out across the cluster and wait for the slowest useful answer.
The real design shift is to treat the index as another sharded table with another authority rule. Harbor Point creates gsi_open_by_issuer, partitioned by (issuer_id, status), where each entry points back to the base reservation's allocation bucket, row key, version, and current owner epoch. Now an issuer lookup can route to the relevant index shard first, find matching reservation IDs, and fetch or validate only the base rows it needs.
That is the important mental model: a global secondary index is not a schema decoration. It is a second distributed write path. When reservation R-88421 changes from open to released on the base shard, the system must also delete or invalidate the corresponding entry on index shard I3. If those state transitions drift apart, the issuer query is wrong even when every shard is healthy in isolation.
From Local Index to Global Lookup
Harbor Point's primary layout is still sensible. The base table is partitioned so decisive reservation writes stay close to the allocation bucket they modify. The problem is that production systems are rarely queried only by the key that defines write authority. Risk, compliance, search, and operations workflows ask for alternate access paths.
Without a global secondary index, this query becomes scatter-gather over every allocation shard:
SELECT *
FROM reservations
WHERE issuer_id = 'CA-MUNI'
AND status = 'open';
Each shard can use its local issuer_id index to search quickly inside its own slice, but the router still has to ask all of them. The cluster has become "many efficient local lookups plus one expensive fan-out." That may work at small scale. It gets fragile once each shard has a different backlog, the endpoint has a real latency objective, and tail latency is tied to the slowest shard in the fan-out.
A global secondary index changes the read shape by materializing a separately partitioned lookup structure:
index key: (issuer_id, status, reservation_id)
payload: allocation_bucket_id, row_pk, row_version, owner_epoch, desk_id, notional
index shard: hash(issuer_id, status)
With that structure, the issuer query routes to the index shard family for CA-MUNI/open:
GET /issuers/CA-MUNI/open-reservations
|
v
index shard I3 for hash("CA-MUNI", "open")
|
v
reservation ids + allocation buckets + owner epochs
|
v
fetch or validate base rows from current bucket owners
The payload choice matters. A pointer-style index stores only enough to find the base row, which keeps the index smaller and cheaper to update. A covering index duplicates read-mostly fields such as desk_id, notional, reservation_ts, or limit_bucket, which can avoid the second hop for some endpoints. Harbor Point has to decide whether the saved read latency is worth duplicating more bytes and widening every index update.
The trade-off is concrete. The system gains a targeted lookup path for non-shard-key queries, but it also creates a second partition map and a second place where skew can appear. If one issuer dominates the morning, the problem moves from "all base shards are queried" to "one index shard is overloaded." Global indexes remove one form of fan-out while potentially concentrating another.
Maintaining the Index
Suppose desk ALPHA creates reservation R-88421 for issuer CA-MUNI, and the base row lands on the owner for allocation bucket 173. Harbor Point also wants that row to appear in gsi_open_by_issuer on index shard I3. One logical user action now produces at least two physical state changes:
- insert or update the base reservation on the owner of bucket
173 - insert or update the issuer index entry on
I3
The same issue appears on updates. If R-88421 changes from open to released, the old row contributed to the index and the new row no longer should. If the issuer changes from CA-MUNI to NY-MUNI, Harbor Point must delete the old index key and insert a new one on a different index shard. If bucket 173 moves from shard B to shard F during rebalancing, the index payload's owner information must change or be validated through a fresh partition map.
That is why global indexes are maintained from before and after images, not from the SQL verb alone:
def issuer_index_mutations(before, after):
if before and before.status == "open":
yield Delete(
key=(before.issuer_id, "open", before.reservation_id),
base_version=before.row_version,
)
if after and after.status == "open":
yield Put(
key=(after.issuer_id, "open", after.reservation_id),
allocation_bucket_id=after.allocation_bucket_id,
row_pk=after.reservation_id,
base_version=after.row_version,
owner_epoch=after.owner_epoch,
desk_id=after.desk_id,
notional=after.notional,
)
The hard part is deciding how strongly those mutations must line up with the base commit. A synchronous design updates the base row and the index entry in one distributed transaction. That gives Harbor Point the cleanest semantics: once the write commits, issuer queries see the new state immediately. The price is higher write latency, more coordination, and more distributed transaction failure handling.
An asynchronous design commits the base row first, then derives index mutations from the ordered log or CDC stream. That usually scales writes better and isolates the index pipeline operationally, but it creates a visibility gap. The index may temporarily miss a new reservation or still expose a released one. If Harbor Point chooses that route, it needs explicit freshness metadata on the index path.
Retries make the maintenance contract even stricter. If the system replays the same index mutation after a worker restart, the index must not silently duplicate entries. That is why practical designs keep stable row identifiers and versions in the index payload. A Put should overwrite the logical entry for (issuer_id, status, reservation_id) at a known version, and a Delete should be idempotent. Otherwise, normal retry behavior turns into correctness drift.
Freshness, Hotspots, and Repair
Once Harbor Point adds gsi_open_by_issuer, the index becomes part of the production surface, not just part of the schema. The team now needs index-specific metrics: write amplification per base mutation, lag between base commits and index application, query hit rate by index key, index-shard p99 latency, and mismatch counts from consistency checks. Without those signals, the team can see "issuer search is slow" but cannot tell whether the cause is a hot key, a lagging indexer, or stale payloads that trigger too many second-hop reads.
Skew is the first operational hazard. If CA-MUNI is the busiest issuer on the platform, hashing only on issuer_id can make one index shard much hotter than the others. The fix depends on the query contract. Harbor Point can pre-split the issuer's index range, bucket the key with an extra dimension such as date window, or store multiple partitions for very hot issuers and accept a bounded fan-out across those buckets. None of these are free. They trade a single lookup for a slightly wider query in order to stop one shard from melting.
The second hazard is divergence. The index can contain orphaned entries after failed deletes, stale owner-epoch payloads after row moves, or missing entries after log gaps. The only durable answer is repair tooling that treats the base table as the source of truth, scans a slice of reservations, regenerates the expected index entries, and compares them to what the index currently stores. In other words, Harbor Point must plan for index reconciliation before the first incident, not after one.
This also explains why teams often classify a global secondary index as rebuildable derived state. If Harbor Point loses an index shard but still has the base rows plus a durable commit history, it can usually rebuild the index. The rebuild cost may be painful, but the data model remains recoverable. That classification matters for backup and restore: the system must know whether the index is part of the authoritative snapshot or whether it should be regenerated to a specific log position.
Finally, the query semantics still matter after the index exists. A fast issuer lookup is not automatically a trustworthy one. If Harbor Point serves the result from an asynchronously maintained index and then from a lagging replica of that index shard, staleness compounds. The read contract must say whether issuer lookup is fully current, session-consistent, or merely bounded-stale. A global index makes the query cheaper. It does not eliminate the need to state what "correct enough right now" means.
Operational Failure Modes
A released reservation still appears in GET /issuers/CA-MUNI/open-reservations. The base shard is current, but the global index is maintained asynchronously and has not yet applied the delete for the old (issuer_id, status, reservation_id) entry. If the read then goes to a replica of the index shard, the visibility gap is even larger. Expose an index freshness watermark, route session-sensitive issuer lookups through a path that can prove the required commit position, and fall back to the base owner or a stronger consistency mode when the index is behind.
One index shard is hot even though the base shards look balanced. The alternate lookup key is skewed. Sharding the base table by allocation bucket balanced the decisive write path, but the global index is partitioned by issuer_id, and a few issuers dominate query and update volume. Measure cardinality and request rate by index key, not just by shard. For hot issuers, pre-split or bucket the index key and let the router query a small bounded set of buckets instead of forcing one shard to absorb all traffic.
Index lookups return pointers that no longer match the base owner after a rebalance. The row moved, but the index payload still points to the previous owner epoch, or a retry reinserted an old pointer after the move had already completed. Store row version and owner epoch information with the index entry, make move operations update or invalidate index payloads as first-class mutations, and run reconciliation jobs that compare current base locations with index pointers.
Connections
- Rebalancing Partitions Under Live Traffic explains why index payloads need owner versions or fresh routing when base rows move.
- Consistency Contracts and API Semantics explains why a fast index answer still needs a stated freshness contract.
- Clocks, Leases, and Safe Reads continues the same pattern from a different angle: optimized reads are safe only when the system can prove the serving state is current enough.
Resources
- [DOC] Secondary indexes in Cloud Spanner
- Focus: See how a distributed SQL system exposes secondary indexes, backfill behavior, and the cost of maintaining alternate access paths.
- [DOC] Using Global Secondary Indexes in DynamoDB
- Focus: Compare a production system where alternate-key lookups are explicit, eventually consistent by default, and paid for through extra write and storage cost.
- [DOC] Lookup Vindexes in Vitess
- Focus: Study a concrete sharded-database mechanism where a separate lookup table maps alternate keys to the correct shard.
- [DOC] Indexes | CockroachDB Docs
- Focus: Review how covering indexes, stored columns, and distributed execution affect the read-versus-write trade-off.
Key Takeaways
- A local index does not become global just because the table is sharded; the router still needs a separately partitioned access path when the query key differs from the authority key.
- A global secondary index is maintained distributed state, so every relevant base-row change can create cross-shard index mutations, retries, and freshness questions.
- Async indexes can be excellent for read latency, but only when their staleness contract is explicit and visible.
- Hotspot control, divergence repair, and rebuild strategy determine whether the index is safe to trust in production.