Rebalancing and Consistent Hashing

LESSON

Consistency and Replication

023 30 min advanced

Day 422: Rebalancing and Consistent Hashing

The core idea: Rebalancing is safe only when placement changes move a bounded slice of the keyspace and ownership changes are fenced so one shard hands off to another without creating two writers or losing recent history.

Today's "Aha!" Moment

In 05.md, Harbor Point chose a state-plus-hash layout so California traffic would not melt one giant shard. That solved the static placement problem, but it did not solve the dynamic one. By the next week, California bucket CA/3 was three times larger than CA/7, and two new storage nodes were waiting to be added before the morning market open. The tempting implementation was simple: change the router from hash(issuer_id) % 8 to hash(issuer_id) % 10 and let the fleet spread out.

That change would have been a production incident, not a scale-out. Changing the divisor in a direct modulo scheme remaps almost every issuer, which means almost every cache entry, ownership table, and background replica stream becomes stale at once. Harbor Point does not need "more even math" in the abstract. It needs a placement rule where adding one node mostly affects the data that should move to that node, not the entire California issuer population.

This is why consistent hashing matters. The point is not that the ring diagram looks elegant. The point is that capacity changes should have local consequences. A node join, failure, or decommission should move a bounded subset of logical shards, and the move should happen through a staged handoff: copy state, catch up on live writes, fence the old owner, switch routing, then drain the source. Once that mental model clicks, rebalancing stops sounding like "copy data somewhere else" and starts sounding like what it really is: a controlled transfer of authority.

That transfer-of-authority idea also sets up the next lesson, 07.md. Rebalancing can preserve single-shard correctness while ownership moves, but it does not define how one business operation commits across multiple shards. That is where distributed transactions enter.

Why This Matters

Harbor Point's reservation API cannot treat rebalancing as background housekeeping. At 09:28, traders are already reserving California paper for the 09:30 open. If the team adds capacity and accidentally remaps most issuers, the cluster pays for that mistake all at once: caches cold-start, follower catch-up spikes, and the risk dashboard begins timing out on exactly the data the business is watching most closely. A "balanced" result on paper is useless if the path to get there destabilizes the live system.

Safe rebalancing therefore has two concrete jobs. First, it must limit how much data changes owner when topology changes. Second, it must change ownership without ambiguity. For one issuer bucket, Harbor Point needs to know which replica group is authoritative right now, which one is still catching up, and what happens when a router still holds stale placement metadata. Those are routing and fencing questions, not just copy-speed questions.

This lesson follows naturally from sharding strategy. 05.md decided which workloads deserved locality. This lesson is about keeping that layout operable as load drifts, nodes fail, and new capacity arrives. The design trade-off is explicit: consistent hashing and virtual buckets reduce movement, but they add metadata, background streaming, and more control-plane logic to the storage layer.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain why direct modulo hashing makes rebalancing expensive - Show why a membership change remaps most keys and how consistent hashing or stable hash slots bound the blast radius.
  2. Trace a shard move from planning to cutover - Follow snapshot transfer, catch-up replication, routing-epoch changes, and source-drain behavior for one bucket.
  3. Evaluate the real trade-offs of consistent hashing in production - Reason about virtual nodes, hotspot limits, control-plane complexity, and why bounded movement is not the same as automatic balance.

Core Concepts Explained

Concept 1: Consistent hashing exists to keep topology changes local

Harbor Point's California partition currently uses eight logical buckets. Bucket CA/3 lives on replica group rg-ca-b, bucket CA/4 lives on rg-ca-c, and routers compute the bucket from the issuer identifier before sending the request onward. If the team directly uses hash(issuer_id) % 8, a later change to % 10 does not merely move "some" issuers to the two new nodes. It changes the bucket identity of almost every issuer because the divisor defines every bucket boundary.

That is the failure mode consistent hashing is trying to avoid. In the textbook ring form, both nodes and keys are hashed onto the same circular space, and a key belongs to the first node clockwise from its hash. When Harbor Point adds node node-e, only the interval between node-e and its predecessor changes owner. Keys outside that interval stay where they were. The operational win is not mystical; it is simply bounded movement under membership change.

Direct modulo resize

hash(MUNI-77) % 8  -> bucket 3
hash(MUNI-77) % 10 -> bucket 7   # denominator changed, so mapping jumps

Consistent-hash ring

... token 120 node-b ... token 170 node-e(new) ... token 240 node-c ...
keys in (120, 170] move to node-e
all other keys stay on their previous owners

In practice, Harbor Point would not hash every issuer directly to a small set of physical nodes. Real systems usually add indirection: many virtual tokens, hash slots, or fixed logical buckets mapped onto machines. That is how Redis Cluster can move a subset of hash slots, and how Dynamo-style systems use virtual nodes to avoid one huge interval per server. The principle is the same in every case: keep the key-to-logical-partition mapping stable enough that adding or removing capacity only relocates a fraction of the workload.

The trade-off is that bounded movement does not guarantee perfect balance. With too few tokens, one node may own a lopsided share of the ring. With heterogeneous hardware, equal token counts may still be wrong. Consistent hashing solves the "everything moves" problem much better than it solves the "every node is equally busy" problem, which is why production systems combine it with virtual nodes, weights, and separate hotspot handling.

Concept 2: Rebalancing is a staged handoff of state and authority

Suppose Harbor Point decides to move logical bucket CA/3 from replica group rg-ca-b to a new group rg-ca-e. The safe sequence is not "update the routing table and start copying." If routers switch first, fresh writes land on a destination that does not yet have the bucket's prior history. If copying finishes first but the source keeps accepting writes with no fencing, the two groups can diverge. Rebalancing works only when state transfer and authority transfer are coordinated deliberately.

The usual flow has four distinct phases. First, the allocator chooses the bucket based on size, QPS, disk pressure, or hotspot concentration. Second, the destination ingests a snapshot or SSTable copy representing bucket CA/3 at a specific log position. Third, the source streams every write after that position so the destination can catch up while traffic stays live. Fourth, once the lag is near zero, Harbor Point advances the placement epoch and gives the destination the right to serve the bucket.

For one move, the timeline looks like this:

1. plan move      CA/3 : rg-ca-b -> rg-ca-e
2. snapshot       rg-ca-e loads CA/3 at log position 9F/40
3. catch up       rg-ca-b streams updates after 9F/40
4. cut over       epoch 18 says owner=rg-ca-e
5. drain source   rg-ca-b forwards or rejects stale writes, then drops data

The fencing step is the part that protects correctness. Harbor Point's routers attach the placement epoch they believe is current. If a request reaches rg-ca-b after epoch 18 made rg-ca-e the owner, the old group must not quietly accept it. It should forward the request or reject it with enough metadata for the router to refresh. That turns stale routing into a retriable control-plane problem instead of a data-corruption bug.

def handle_write(request, local_epoch, owner):
    if request.placement_epoch < local_epoch:
        raise RetryWithNewOwner(epoch=local_epoch, owner=owner)
    apply_write(request)

This is why rebalancing metrics need more than "bytes copied." Harbor Point should watch snapshot age, catch-up lag in log positions or milliseconds, forwarded-write rate, and how long the bucket spends in the moving state. The move finishes when the destination is authoritative and current, not when the first bulk copy ends. The trade-off is straightforward: aggressive parallel moves shorten imbalance but consume network, disk, and compaction budget that the live workload also needs.

Concept 3: Production systems usually rebalance logical shards, not raw keys, and consistent hashing does not remove hotspot work

Harbor Point's best production design is therefore not "hash every issuer directly around a ring." It is "keep a stable set of logical buckets, then place those buckets with a bounded-movement policy." For example, California might have 256 virtual buckets even if only 12 replica groups exist. Rebalancing then means moving selected buckets, not rewriting the hash rule for every issuer. That keeps movement granular enough to smooth load without forcing cluster-wide remapping.

This indirection is also how the team handles asymmetric capacity. A newer node with more SSD bandwidth can own more virtual buckets than an older host. A rack-aware policy can require that bucket replicas land in distinct failure domains. A node drain can move buckets away gradually rather than as one giant operation. The consistent-hashing idea still matters here, but it has been adapted into a control plane that respects operational constraints the textbook ring does not model.

What consistent hashing does not do is split a single hot issuer automatically. If CA-MUNI-77 alone dominates traffic because every desk is trading the same bond family, keeping movement bounded will not make that one key cooler. Harbor Point may need to split the logical bucket around that issuer, introduce a more granular shard key, or create a specialized aggregation path for the hotspot. That is the same limitation from 05.md in a new form: distribution across many keys is different from relief for one extremely hot key.

The final trade-off is conceptual clarity versus operational flexibility. A direct ring is elegant and easy to explain. A production rebalancer with virtual buckets, weighted placement, epochs, and background catch-up is messier, but it lets Harbor Point scale without treating every node join as a fleet-wide cache invalidation event. Once the system reaches that point, the next hard question is no longer "who owns this bucket?" It is "what if one business operation must update two different buckets together?" That is the bridge to 07.md.

Troubleshooting

Issue: Harbor Point adds two nodes, and suddenly most California issuer caches miss even though only a small amount of capacity was added.

Why it happens / is confusing: The router changed a direct modulo rule such as % 8 to % 10, which remapped far more issuers than the team expected.

Clarification / Fix: Introduce stable logical buckets or a consistent-hash placement layer so topology changes move only the buckets assigned to the new capacity.

Issue: A rebalance spends hours in "catching up" and never reaches cutover.

Why it happens / is confusing: The destination can copy the snapshot, but live write traffic for that bucket is arriving faster than the catch-up stream can drain.

Clarification / Fix: Move a smaller bucket, throttle foreground traffic if the product allows it, or split the hotspot before moving it. A rebalance cannot complete if the destination never closes the log gap.

Issue: Some clients keep writing to the old owner after cutover and see intermittent retry storms.

Why it happens / is confusing: Routers are holding stale placement metadata, so they continue sending requests with an old epoch to the source group.

Clarification / Fix: Fence writes with placement epochs, return the new owner explicitly, and keep a short forwarding window instead of accepting ambiguous writes on both sides.

Issue: One node is still overloaded even after consistent hashing was introduced.

Why it happens / is confusing: Bounded movement is not the same as perfect balance, and a single hot issuer or too-few virtual buckets can still concentrate load.

Clarification / Fix: Increase virtual bucket granularity, use weighted placement for stronger hardware, and treat true hot keys as an application or data-model problem rather than a hashing problem.

Advanced Connections

Connection 1: 05.md chooses the locality shape; this lesson makes that shape evolvable over time

The previous lesson asked whether Harbor Point wanted range locality, hash spread, or a hybrid. Rebalancing adds the time dimension to that decision. A shard layout that looks good only on the day it was created is incomplete. Consistent hashing and stable buckets are the mechanisms that let the chosen layout survive growth, skew, and hardware replacement without rewriting the entire keyspace map.

Connection 2: 07.md starts where rebalancing stops

Rebalancing preserves the invariant that one logical shard has one authoritative owner at a time. It does not make a two-shard reservation update atomic. Once Harbor Point needs one workflow to commit across CA/3 and GLOBAL-EXPOSURE, the system leaves the world of ownership transfer and enters the world of distributed transaction coordination.

Resources

Optional Deepening Resources

Key Insights

  1. Bounded movement is the real value of consistent hashing - Capacity changes should relocate only a subset of logical shards, not force a near-total remap of the keyspace.
  2. Rebalancing is an ownership protocol, not a copy job - Snapshot transfer matters, but correctness depends on catch-up replication, fencing, and a clear cutover epoch.
  3. Consistent hashing reduces remapping, not all imbalance - Virtual buckets and weighted placement help, but hot keys and operational constraints still require separate design work.
PREVIOUS Sharding Strategies: Range, Hash, and Hybrid NEXT Distributed Transactions and Two-Phase Commit

← Back to Consistency and Replication

← Back to Learning Hub