Partitioning Fundamentals and Shard Keys

LESSON

Consistency and Replication

049 30 min advanced

Day 478: Partitioning Fundamentals and Shard Keys

The core idea: A shard key is not just a scaling knob. It decides which requests stay local, which invariants remain cheap to enforce, and where hotspots appear when real traffic arrives.

Today's "Aha!" Moment

Harbor Point left the last lesson with a clear API contract: POST /bookings/confirm must mean cabin C14 on trip TRIP-BCN-JFK-2026-07-14 is definitively assigned, not merely queued for eventual reconciliation. The contract is correct, but the implementation is now hitting a physical limit. Every hold, extend, release, and confirm still runs through one authoritative inventory leader. During a summer-sale spike, the queue behind that leader climbs, tail latency stretches, and the team is tempted to "just shard the table."

The non-obvious part is that partitioning is not merely slicing rows into smaller piles. The shard key is a statement about what belongs together. If Harbor Point shards inventory by customer_id, the "my trips" page becomes convenient, but two customers racing for the same cabin now drive the decisive ownership check across shards. If the team shards by route_date, search for BCN-JFK is easy, but one popular route can pin an entire sale to a single hot shard. The right key is the one that keeps the narrowest must-decide-together state local.

For Harbor Point, that narrow state is the sellable inventory unit itself: one cabin on one trip instance can be held, released, or booked exactly once. That pushes the authoritative store toward an inventory-centric key such as inventory_id or a compound key like (trip_instance_id, cabin_id). Search results and customer history may then need separate indexes or projections. That is the trade-off: make the invariant cheap on purpose, and move convenience reads to explicitly weaker or derived paths.

Why This Matters

Partitioning and replication solve different problems. Replication copies the same partition so the system survives machine loss and can serve more reads. Partitioning spreads ownership across nodes so write throughput, storage, and coordination load can grow past one machine. If you confuse the two, you keep adding replicas to a leader that is already saturated by authoritative writes and wonder why the bottleneck does not move.

This matters because the consistency promises from 048.md become much more expensive once data spans multiple partitions. A linearizable confirmation on one shard is often a normal leader transaction. The same confirmation across two shards becomes a distributed transaction or a compensating workflow with more latency, more failure states, and more operational risk. Shard-key design is therefore not an optimization pass after correctness. It is part of how correctness stays affordable at scale.

The production shift is concrete. Before partitioning, Harbor Point has one authoritative inventory path and one obvious hotspot. After partitioning, the system can spread load, but only if the request router, partition map, and schema all agree on which state lives together. A good shard key turns "scale out" into local decisions on many shards. A bad one turns every hot path into scatter-gather traffic or cross-shard coordination.

Core Walkthrough

Part 1: Partitioning means assigning ownership, not just storage

Harbor Point starts by separating three ideas that teams often blur together:

That distinction matters because the shard key should stay stable while placement changes over time. Harbor Point does not want "shard 7" baked into its schema. It wants a deterministic function from business data to a logical partition, and then a lookup from logical partition to the shard that currently owns it.

def inventory_bucket(trip_instance_id: str, cabin_id: str) -> int:
    logical_key = f"{trip_instance_id}:{cabin_id}"
    return murmur3(logical_key) % 1024


def route_request(trip_instance_id: str, cabin_id: str) -> str:
    bucket = inventory_bucket(trip_instance_id, cabin_id)
    return partition_map.lookup(bucket)

The key idea is simple: partition first, place later. With 1024 logical buckets spread over, say, 12 physical shards, Harbor Point can move bucket 173 from shard B to shard F later without redefining the key. That is why mature systems keep more logical partitions than physical machines.

The authoritative rows that protect the booking invariant need to follow the same partitioning boundary. If the inventory row for cabin C14 is keyed by inventory_id, but the hold row is keyed only by hold_id, then confirmation may already require two partitions. Production schemas usually keep the authoritative mutation path organized around the invariant boundary, even if that makes some reads less convenient.

Part 2: The shard key encodes which work stays local

Harbor Point evaluates candidate shard keys by asking a stricter question than "Which query is most common?" The real question is "Which state must be decided together on the strongest path?"

Candidate shard key What stays local What breaks first
customer_id Customer profile and trip history reads Two customers competing for the same cabin drive final confirmation across shards
trip_instance_id Availability for one trip and route-scoped reports A flash sale on one trip can overload a single shard
inventory_id = trip_instance_id + cabin_id Hold, release, and confirm for one sellable unit Search and customer-history reads need secondary indexes or projections

The last option is usually the safest for Harbor Point's decisive write path because the invariant lives at the inventory-unit boundary. Only one owner may successfully move TRIP-BCN-JFK-2026-07-14:C14 from held to booked. If that state lives on one shard, the strongest operation stays single-shard even while total platform load spreads out.

That does not mean the rest of the product becomes free. Search for "all cabins on BCN-JFK next Friday" no longer maps naturally to the same key if the authoritative store hashes by inventory unit. Harbor Point accepts that and serves browse traffic from a separate projection indexed by route and departure date. The customer timeline is another projection keyed by customer_id. No single shard key makes every read and write local. Good production design chooses the key that protects the invariant and then builds explicit read models for other access patterns.

The router and transaction path become straightforward once the key matches the invariant:

def confirm_booking(hold):
    shard = route_request(hold.trip_instance_id, hold.cabin_id)
    return shard.transaction(
        assert_inventory_state(
            inventory_id=hold.inventory_id,
            status="held",
            hold_id=hold.hold_id,
        ),
        mark_inventory_booked(inventory_id=hold.inventory_id),
        mark_hold_confirmed(
            inventory_id=hold.inventory_id,
            hold_id=hold.hold_id,
        ),
    )

The local transaction is the payoff. Harbor Point still replicates that shard for failover, but it does not need cross-shard coordination merely to decide whether one cabin is sold.

Part 3: Good shard keys trade fan-out for hotspot control in explicit ways

A useful shard key has four properties in practice:

  1. It is present on every authoritative write.
  2. It has enough cardinality to spread load.
  3. It stays stable for the life of the record.
  4. It matches the smallest state that must be updated atomically or under the same coordination boundary.

That still leaves trade-offs. Hashing inventory_id spreads load well, but it destroys natural ordering for range scans such as "all cabins on trip TRIP-BCN-JFK-2026-07-14." Range partitioning by trip_instance_id keeps those scans local, but it makes popular trips or launch-day inventory batches much easier to hotspot. Harbor Point chooses the hash-based inventory key for the authoritative OLTP path and pays the cost of maintaining separate read models because correctness during confirmation matters more than making search local in the same store.

This also changes what the team measures. Once the data is partitioned, average cluster QPS is not enough. Harbor Point watches per-shard write rate, hottest logical bucket concentration, cross-shard transaction count, scatter-gather read count, and partition-size skew. Those metrics reveal whether the chosen shard key still matches the workload or whether the system is drifting toward a future rebalancing problem.

The lesson is not that one shard key is universally best. It is that shard keys are workload arguments encoded in schema. If the argument is wrong, every scale fix after that becomes more expensive: caches hide fan-out for a while, bigger boxes delay hotspots for a while, and emergency compensations paper over correctness problems for a while. None of those change the ownership boundary that the shard key already defined.

Failure Modes and Misconceptions

Connections

Connection 1: 048.md defined the API semantics that the shard key now has to make affordable

Harbor Point decided that final booking confirmation needs a decisive, strong path. This lesson shows how partitioning can either preserve that as a single-shard decision or accidentally turn it into cross-shard coordination.

Connection 2: 050.md will use the logical-partition boundary introduced here

Once the shard key and logical buckets exist, the next operational problem is moving those buckets under live traffic without breaking routing or consistency promises.

Connection 3: Bigtable and Spanner expose the same row-key problem in different forms

Whether a system talks about tablets, splits, or shards, the underlying question is identical: which adjacent or hashed keys should one serving group own, and what workload becomes cheaper or more dangerous because of that choice?

Resources

Key Takeaways

  1. Partitioning decides ownership; replication copies that ownership for resilience and read distribution.
  2. A shard key should align with the smallest invariant boundary that must be decided together, not simply the most convenient read query.
  3. No single shard key makes every access pattern local, so production systems often keep authoritative writes on one key and serve other views from indexes or projections.
  4. Many logical partitions on fewer physical shards make future rebalancing possible without redefining the data model.
PREVIOUS Consistency Spectrum and API Semantics NEXT Rebalancing Partitions under Live Traffic

← Back to Consistency and Replication

← Back to Learning Hub