Partitioning Strategies and Shard Keys
LESSON
Partitioning Strategies and Shard Keys
The core idea: A shard key is a workload argument encoded in data layout: it decides which operations stay local, which invariants remain cheap, and where real traffic will concentrate.
Core Insight
Harbor Point has accepted the lesson from sharding: the reservation system needs multiple authority domains. The next question is sharper. Which key should decide those domains?
The tempting answer is "pick the field we query most." For Harbor Point, that might be desk_id, because traders often view their own reservations, or issuer_id, because risk teams inspect exposure by issuer. But the strongest path is not a dashboard. It is the decision that one scarce allocation bucket can move from held to confirmed exactly once. If two desks race for the same bucket, the system wants one shard to decide that invariant, not a distributed workflow by accident.
That is the non-obvious role of a shard key. It is not just a spreading function. It says which facts belong under the same owner. A key that makes convenient reads local can make decisive writes cross-shard. A key that keeps the invariant local may force search, reporting, and history screens into derived indexes or projections.
For Harbor Point, the strongest candidate is an allocation-centric key such as allocation_bucket_id, derived from the issuer, instrument, settlement window, and reservation bucket. That does not make every query perfect. It makes the most important consistency boundary explicit.
Logical Partitions Before Physical Shards
Harbor Point separates three concepts that teams often blur together:
- A shard key is the business value used to choose an authority boundary.
- A logical partition is a stable slice of the keyspace, such as bucket
173out of4096. - A physical shard is the replica group currently serving one or more logical partitions.
That distinction lets the data model stay stable while placement changes. Harbor Point does not want application code to say "send this issuer to machine 7." It wants application code to compute a stable logical partition, then let the partition map decide which physical shard owns that partition today.
def allocation_partition(issuer_id: str, instrument_id: str, window: str) -> int:
key = f"{issuer_id}:{instrument_id}:{window}"
return murmur3(key) % 4096
def route_confirmation(command):
partition = allocation_partition(
command.issuer_id,
command.instrument_id,
command.settlement_window,
)
return partition_map.owner_for(partition)
With 4096 logical partitions on 12 physical shard groups, Harbor Point can later move partition 173 from shard B to shard F without redefining the application key. That is why mature sharded systems usually keep more logical partitions than machines. The key defines ownership; placement is allowed to evolve.
This also keeps the next lesson possible. Rebalancing can move logical partitions under live traffic only if those partitions already exist as stable units of ownership.
Choosing the Boundary
Harbor Point evaluates candidate keys by asking: "Which state must be decided together on the strongest path?"
| Candidate key | What becomes local | What breaks first |
|---|---|---|
desk_id |
Trader dashboards and desk-scoped history | Two desks competing for the same allocation cross shards |
issuer_id |
Issuer exposure checks and issuer-scoped reports | A hot issuer can overload one shard during volatile markets |
allocation_bucket_id |
Hold, release, extend, and confirm for one scarce allocation bucket | Desk history and issuer-wide reports need derived views |
The last option is usually the safest for the decisive write path. If bucket B-173 is the unit that can be reserved exactly once, then the rows that enforce that invariant should live with the owner of B-173. The confirmation path can stay single-shard:
def confirm_allocation(command):
shard = route_confirmation(command)
return shard.transaction(
assert_bucket_state(
bucket_id=command.bucket_id,
status="held",
hold_id=command.hold_id,
),
mark_bucket_confirmed(command.bucket_id),
append_audit_event(command.hold_id, "confirmed"),
)
That local transaction is the payoff. Harbor Point still replicates the shard for durability and failover, but it does not need cross-shard coordination merely to decide whether one allocation bucket is available.
The cost is that not every read is local. A dashboard that asks "show every reservation for desk 12" may need a desk-index projection. A risk report that asks "show issuer-wide exposure" may need an issuer-index projection. Those read models are not failures. They are the price of keeping the decisive write invariant cheap.
Strategy Trade-offs
Different partitioning strategies emphasize different risks.
Strategy Helps with Hurts with
----------------------- --------------------------------- -----------------------------------
hash by stable id even spread, simple ownership range scans and locality
range by business field locality, ordered scans hot ranges and monotonic writes
directory lookup custom placement, tenant control metadata complexity and routing bugs
compound key matching a domain invariant schema/API discipline
Hashing allocation_bucket_id spreads writes across many partitions, which is useful when no single issuer dominates. It also destroys natural range locality. If the risk team wants all buckets for CA-MUNI in one scan, the system may need an issuer projection or a scatter-gather query with a clear latency budget.
Range partitioning by issuer_id keeps issuer reports local, but it can create a hot shard exactly when the market focuses on one issuer. Splitting by issuer_id + instrument_id may help, but the design still has to ask whether the key has enough cardinality and whether the hot path carries the key at request time.
A directory strategy can place specific tenants, issuers, or buckets wherever the control plane chooses. That is flexible, especially for large customers or uneven workloads, but it turns routing metadata into a larger operational surface. The directory must be cached, versioned, monitored, and kept consistent during rebalancing.
No strategy wins universally. The shard key should be chosen with the system's strongest operations, hottest traffic, and most painful migrations in view.
Evaluating a Shard Key
A practical shard-key review for Harbor Point asks five questions:
- Is the key present at the start of every authoritative write?
- Does the key keep the strongest invariant inside one shard?
- Does the key have enough cardinality to spread real production traffic?
- Is the key stable, or would normal business edits move ownership?
- What read models, indexes, or projections are required because of this choice?
The answers should become metrics after launch. Average cluster QPS is not enough. Harbor Point watches per-shard write rate, hottest logical partition concentration, cross-shard transaction count, scatter-gather read count, routing-cache misses, and partition-size skew. Those metrics show whether the chosen key still matches the workload.
The lesson is not that allocation_bucket_id is magic. It is that the primary shard key should encode a deliberate claim: "This is the smallest state we need to decide together." Everything that does not fit that claim should be handled explicitly instead of smuggled into the strongest path.
Failure Modes
Choosing the friendliest read query as the primary key. This often makes dashboards pleasant while pushing critical writes across shards. Start with the strongest invariant, then build projections for convenience reads.
Using mutable business fields as shard keys. Region, desk ownership, priority tier, or routing label may be meaningful, but if the value changes, ownership changes. Use stable identifiers for primary ownership and expose mutable fields through indexes or derived views.
Assuming hashing eliminates hotspots. Hashing smooths load only when the input key has enough cardinality and traffic is not dominated by a tiny set of keys. Monitor hot logical partitions, not just physical shard averages.
Baking physical placement into the data model. If records know they live on shard_7, rebalancing becomes a schema problem. Keep logical partitions stable and map them to physical shards through versioned metadata.
Letting derived views pretend to be authoritative. Desk and issuer projections are useful, but the owner of the allocation bucket should remain the source of truth for hold, release, extend, and confirm decisions.
Connections
- The previous lesson defined sharding as splitting authority; this lesson chooses the key that defines those authority boundaries.
- The next lesson uses logical partitions as the unit that can be moved during live rebalancing.
- Bigtable, DynamoDB, Spanner, Vitess, and similar systems expose the same key-design pressure under different names: row keys, partition keys, primary keys, keyspaces, tablets, chunks, or ranges.
Resources
- [BOOK] Designing Data-Intensive Applications
- Focus: Revisit the partitioning and secondary-index chapters with attention to how access patterns and invariants drive primary-key choice.
- [PAPER] Bigtable: A Distributed Storage System for Structured Data
- Focus: Pay attention to row-key design and tablet hotspots; the same locality-versus-distribution trade-off appears in modern shard-key decisions.
- [DOC] Amazon DynamoDB: Best practices for choosing a partition key
- Focus: Use it as a concrete checklist for cardinality, hot partitions, and request concentration.
- [DOC] Cloud Spanner schema design best practices
- Focus: Look at primary-key ordering and hotspot guidance to see how key layout affects both transaction locality and write concentration.
Key Takeaways
- A shard key decides which work stays local, not merely how rows are distributed.
- The primary shard key should usually match the smallest invariant boundary that must be decided together.
- Hash, range, directory, and compound-key strategies all trade locality, spread, routing complexity, and hotspot risk differently.
- Many logical partitions on fewer physical shards make future rebalancing possible without redefining the data model.