Distributed Query Routing and Shard Targeting

LESSON

Database Engine Internals and Implementation

040 30 min advanced

Day 441: Distributed Query Routing and Shard Targeting

The core idea: A distributed SQL router is only as smart as the placement facts it can prove. If a query carries the shard key, the router can hit one shard and let local indexes do their job. If it does not, the system pays for fan-out, lookup indirection, or both.

Today's "Aha!" Moment

In 039.md, Harbor Point chose a page-oriented MVCC engine, a small index portfolio, and an issuer_exposure row that every approval transaction must lock. That design made the single primary coherent. It did not make the primary infinite. At the market open, one machine now spends too much time on WAL flushes, buffer churn, and hot issuer traffic, so Harbor Point splits the reservation system into eight shards.

The crucial shift is that sharding does not turn local indexes into global knowledge. Each shard still has its own B-trees, MVCC snapshots, and lock table. The new component is a router that has to decide, before execution, which shard or shards own the rows a query might touch. If the router can target exactly one shard, the old single-node reasoning mostly survives. If it cannot, even a well-indexed query turns into eight remote queries plus merge work.

That is why query routing is not just a latency optimization. It is also part of the correctness contract. Harbor Point's issuer-limit invariant only remains safe if every reservation for one issuer and that issuer's summary row land on the same shard. Resharding only remains safe if routers notice metadata epochs changing and stop sending writes to the old owner. The next lesson on 041.md will assume the target shard set is already known; this lesson is about how the system earns that knowledge.

Why This Matters

Harbor Point's workload has three very different query shapes. Traders ask for "open reservations for issuer MUNI-77, newest first." Incident responders fetch a reservation by reservation_id when an order is disputed. Compliance runs same-day scans by risk_bucket and trading_day, which may span the whole book. A sharded system only feels fast when those shapes are treated differently instead of being pushed through one generic "distributed SQL" abstraction.

Without deliberate shard targeting, the failure mode is subtle. The dashboard query fans out to every shard even though the data for one issuer is colocated. The reservation lookup becomes a broadcast because the identifier does not reveal ownership. A reshard operation quietly leaves some routers with stale placement metadata, so retries bounce between shards and p99 spikes during the busiest part of the day. None of those incidents are fixed by adding another local index. They are routing failures.

Good routing turns scale-out from "more nodes" into "smaller search spaces." The router extracts the shard key when possible, consults a versioned placement map, and sends the query only to the shards that can possibly match. When the query shape does not allow that, the system should admit it honestly and choose a controlled fan-out plan. Production relevance comes from knowing which case you are in before the query hits the network.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain what a query router must know before execution - Describe the role of shard keys, placement metadata, and metadata epochs in directing a query to the correct shard set.
  2. Distinguish exact targeting from lookup-based routing and scatter-gather - Predict how Harbor Point's main query shapes map to one shard, a small shard set, or the entire cluster.
  3. Reason about routing failures during resharding and replica changes - Explain why stale route caches, global secondary lookups, and fan-out queries create different operational risks.

Core Concepts Explained

Concept 1: Routing metadata is the control plane for data ownership

Harbor Point shards both reservations and issuer_exposure by a stable function of issuer, so every issuer's hot rows and summary row stay together. The router does not discover that ownership by scanning shards. It keeps a placement map from key range to shard, plus role and health information for each shard replica. A simplified routing entry looks like this:

epoch 18
[00-1f] -> shard-a  primary=db-a-p  replicas=[db-a-r1, db-a-r2]
[20-3f] -> shard-b  primary=db-b-p  replicas=[db-b-r1, db-b-r2]
...

When the query arrives, the router normalizes the SQL into something it can reason about: predicates, parameter values, statement type, and transaction context. For SELECT ... WHERE issuer = 'MUNI-77' AND status = 'open' ORDER BY submitted_at DESC LIMIT 50, the router applies the same partitioning function used when the row was inserted, finds the owning key range, and targets exactly one shard. Only then do the local indexes from 039.md matter.

This explains a common misconception in distributed databases: a local secondary index is not a cluster-wide directory. Every shard may have an index on (issuer, submitted_at DESC), but the router still needs the shard key to know which shard's index to use. If the query arrives without an issuer predicate, the router cannot infer ownership from thin air. It either needs a separate lookup structure or it must ask multiple shards.

The main trade-off is granularity. More shards let Harbor Point spread write load and isolate failures, but they also increase the routing-table size, metadata churn, and the chance that one query spans multiple key ranges. Sharding makes ownership explicit; it does not make ownership free.

Concept 2: Query shape determines whether targeting is exact, lookup-based, or scatter-gather

Harbor Point's three important query families show why "distributed query routing" is really a collection of routing modes.

For dashboard reads filtered by issuer, targeting is exact. The router hashes or ranges on issuer, picks one shard, and sends down a normal local SQL plan. For writes that update issuer_exposure and insert a reservation, exact targeting is even more important because the transaction must remain local to preserve the invariant from 039.md.

For incident lookups by reservation_id, Harbor Point has a design choice. It can encode shard ownership into the identifier, or it can maintain a lookup table such as reservation_locator(reservation_id -> issuer) or reservation_id -> shard. That turns the route into a two-step process: first resolve ownership, then target the shard. The operational cost is that the lookup structure itself must stay consistent with the base row. A missing or stale lookup entry makes a point lookup fail even though the reservation still exists.

For compliance scans like WHERE trading_day = CURRENT_DATE AND risk_bucket >= 4, the query does not contain the shard key at all. Harbor Point should not pretend otherwise. The router must fan the query out to all shards, push filters down, and merge the results:

def target_shards(predicates, route_cache):
    if "issuer" in predicates:
        return [route_cache.owner_for_issuer(predicates["issuer"])]
    if "reservation_id" in predicates:
        return [lookup_locator(predicates["reservation_id"])]
    return route_cache.all_shards()

That scatter-gather path is sometimes the right answer, but it should be treated as expensive by design. It consumes network bandwidth, opens more remote executors, and turns tail latency into "slowest shard wins." This is the exact handoff into 041.md: once a router targets many shards, the engine needs exchange operators and merge logic to produce one result set.

The trade-off is honest and production-relevant. Lookup-based routing reduces fan-out for non-shard-key queries, but it adds write amplification and another consistency surface. Pure scatter-gather keeps writes simpler, but it makes certain analytical or operational queries scale with the number of shards instead of the selectivity of the predicate.

Concept 3: The hard part is staying correct while ownership changes

Routing is easiest when the cluster is static. Production clusters are not static. Harbor Point will split a hot shard when one issuer range grows too fast, drain replicas during maintenance, and fail over a primary after a crash. Every one of those actions changes the answer to "where should this query go?" while clients are still sending traffic.

Suppose shard b is split at metadata epoch 19 into b1 and b2. A router that still caches epoch 18 may send issuer = 'MUNI-77' to the old shard. The old owner should not silently execute the write if ownership has moved. A safer pattern is to reject with a stale-route signal, including the newer epoch or a config reference, so the router refreshes placement metadata and retries. For retried writes, Harbor Point also needs idempotency keys; otherwise the refresh path can create duplicate reservations if the first attempt partially succeeded before the topology change was noticed.

Replica choice is part of shard targeting too. Once the router knows the shard, it still has to choose a primary or follower. Harbor Point's approval path and trader dashboard both want current open state, so they normally target the primary or a follower that meets a strict staleness budget. A low-priority compliance scan may accept a replica that is a few seconds behind. That means the routing layer consumes placement metadata and freshness metadata together.

The production trade-off is cache aggressiveness versus safety. Caching route maps and health data in the router reduces per-query control-plane chatter, which matters at high QPS. The price is a staleness window during resharding or failover. Forcing the router to revalidate on every query shrinks that window, but it pushes topology latency into the hot path. Mature systems usually mix both approaches: cache aggressively, version everything, and make stale-route retries explicit and observable.

Troubleshooting

Issue: A query that should be single-shard suddenly fans out to every shard.

Why it happens / is confusing: The application stopped carrying the shard key explicitly, or wrapped it in a transformation the router cannot analyze, such as WHERE lower(issuer) = 'muni-77'. The local indexes still exist, so teams often blame the database engine instead of the targeting logic.

Clarification / Fix: Canonicalize shard-key values before the query reaches the router and keep the predicate directly visible. If the product truly needs shard-key-free lookups, add a lookup structure deliberately instead of hoping the router will infer ownership.

Issue: Resharding causes bursts of wrong shard or stale metadata errors even though the new shards are healthy.

Why it happens / is confusing: The data move succeeded, but one or more routers are still serving traffic from an older placement epoch. The cluster looks healthy at the storage layer while the routing layer is temporarily inconsistent.

Clarification / Fix: Version route metadata, make stale-route retries automatic, and log the retry count by query shape. During shard moves, keep the old owner in a draining state long enough for routers to refresh instead of dropping traffic immediately.

Issue: Total cluster CPU is low, but one shard is saturated every market open.

Why it happens / is confusing: The shard key preserves locality for a hot issuer, which is useful for correctness, but it also preserves that tenant's traffic hotspot. Sharding by issuer solved cross-issuer routing but not within-issuer skew.

Clarification / Fix: Decide whether the workload can tolerate a finer-grained shard key, such as issuer plus time bucket, or whether the hot issuer deserves a dedicated shard. Do not split the key blindly if it would break the single-shard transaction boundary for issuer_exposure.

Advanced Connections

Connection 1: 039.md still determines what happens after routing succeeds

The previous capstone chose the local engine, index structures, and isolation boundary inside one database node. This lesson does not replace that work. It decides which node gets to apply it. Distributed routing is therefore a multiplier on local design, not an alternative to it.

Connection 2: 041.md begins once the shard set is known

This lesson stops at target selection. The next lesson assumes the router has already chosen one shard or many and asks a different question: how do you run a distributed plan, exchange rows between workers, and merge partial results without letting the slowest shard dominate the whole query?

Resources

Optional Deepening Resources

Key Insights

  1. Shard targeting starts before execution - The router needs explicit ownership information from shard keys, lookup structures, or placement metadata before local indexes can help.
  2. Query shape determines cluster cost - A shard-key predicate gives Harbor Point a local query on one shard; a missing shard key turns the same logical request into cluster-wide work.
  3. Topology changes are routing events, not just storage events - Resharding and failover are safe only when route metadata is versioned, stale caches are expected, and retries are idempotent.
PREVIOUS Monthly Capstone: Pick Engine, Indexes, and Isolation NEXT Parallel Query Execution and Exchange Operators

← Back to Database Engine Internals and Implementation

← Back to Learning Hub