Rebalancing Partitions Under Live Traffic
LESSON
Rebalancing Partitions Under Live Traffic
The core idea: Live rebalancing is an ownership handoff problem. Safe systems copy a partition, stream the delta, and cut traffic over with versioned routing so only one shard is authoritative for writes at any moment.
Core Insight
Harbor Point chose a shard key that makes scarce allocation decisions local. It also kept many more logical buckets than physical shard groups, so placement can change after launch. That preparation matters when bucket 173, which owns a busy slice of CA-MUNI allocation reservations, starts driving four times the write rate of neighboring buckets. Shard B is saturating while shard F is mostly idle, but desks are still extending holds and confirming reservations in real time.
The tempting plan is naive: copy bucket 173 to shard F, switch the routing table, and delete the old copy. That fails as soon as a trader extends hold H-8821 halfway through the copy. If that write lands only on B, the target comes online stale. If Harbor Point lets both shards accept writes "just for safety," it now has two places that can appear authoritative for the same allocation bucket.
The useful mental shift is to treat rebalancing as a controlled ownership transfer. The source shard stays authoritative while the target catches up, and the cutover is gated by an explicit routing version plus a concrete replay marker. The hard part is not moving bytes. It is preserving one authority boundary while traffic keeps flowing.
Design Pressure
Partitioning only solves scale if the cluster can keep adapting after launch. Traffic patterns move. One issuer becomes the market focus, one desk takes on a larger book, a node pool is retired, or a region absorbs failover traffic from somewhere else. If every load shift requires downtime or a full reshard, the system is horizontally scalable on paper and operationally rigid in production.
The API contract still applies during maintenance. A successful POST /reservations/confirm must still mean one allocation is definitively assigned, even while ownership of the underlying bucket is moving. The trade-off is direct: live rebalancing reduces the need for maintenance windows, but it raises the bar for routing metadata, replay logs, cutover discipline, and migration observability.
The production consequence is not abstract. Capacity expansion, hotspot relief, hardware evacuation, and zone maintenance all depend on moving partitions safely. Teams that understand the mechanism can add or drain nodes deliberately. Teams that treat rebalancing as "background copy plus hope" eventually discover that balanced storage and safe write ownership are different problems.
Migration Mechanism
Start from one bucket that is hot enough to matter
Harbor Point designed the cluster with many more logical buckets than physical shards precisely so it could move a narrow slice of the keyspace under load. Bucket 173 currently maps to shard B in partition-map version 104:
map v104
bucket 173 -> shard B
bucket 174 -> shard F
...
During the market burst, most of shard B's pressure comes from bucket 173. That makes the decision small and surgical: move that one bucket to shard F instead of revisiting the primary key or splitting the whole reservation catalog. The shard key from the previous lesson stays the same; only placement changes.
The online migration still has to answer three questions precisely:
- Which shard is authoritative for new writes right now?
- How does the target catch up while the source is still serving traffic?
- How do routers and shards agree on the instant ownership changes?
If any of those answers are fuzzy, rebalancing turns into split-brain risk disguised as an operations task.
Separate snapshot, catch-up, and cutover
For Harbor Point's authoritative reservation path, the source shard B remains the only writer during migration. The target shard F receives a baseline snapshot plus a stream of later changes. This is safer than dual authoritative writes because the platform never has to reconcile two live decision points for the same allocation.
Phase 1: Snapshot
router -> B (owner in map v104)
B ---- bulk copy bucket 173 ----> F
Phase 2: Catch-up
router -> B
B ---- change stream seq 918443..918499 ----> F
Phase 3: Cutover
pause new writes for bucket 173
wait until F has replayed through seq 918499
publish map v105: bucket 173 -> F
resume writes through F
The routing layer carries the partition-map version it used when it sent a request. Each shard checks that version before it accepts a write:
def apply_reservation_op(bucket_id, map_version, op):
if map_version != partition_map.current_version(bucket_id):
raise RetryWithFreshMap(bucket_id)
if not partition_map.is_owner(bucket_id, this_shard_id):
raise WrongOwner(bucket_id)
append_to_partition_log(bucket_id, op)
apply_to_local_state(bucket_id, op)
That version check is not defensive polish. It is the mechanism that prevents stale routers from writing to the old owner after cutover. If a router still believes bucket 173 -> shard B after map v105 is published, shard B rejects the write and forces a retry with fresh metadata. Harbor Point would rather surface a short retry than accept a decisive reservation mutation on the wrong shard.
The catch-up stream is what makes the cutover meaningful. Suppose a trader extends hold H-8821 after the snapshot started. Shard B assigns that mutation log sequence 918487, applies it locally, and streams it to F. At cutover time, Harbor Point does not ask "Did the copy finish?" It asks "Has the target replayed every mutation through the barrier sequence?" Only when the answer is yes does the control plane publish map v105.
For high-value writes, Harbor Point accepts a tiny write pause on one bucket during cutover. That brief barrier keeps the ownership handoff simple: no concurrent writers, no ambiguous order, no post-hoc merge logic. The trade-off is explicit. A short and scoped availability dip is cheaper than turning a definitive reservation path into a distributed reconciliation problem.
After the Ownership Flip
Once the new map is live, the move is not over. In-flight requests may still arrive at B with stale routing metadata for a few seconds. Those must be rejected or forwarded consistently. Read paths also need a policy. Harbor Point keeps advisory search reads flexible, but it does not let POST /reservations/confirm take a weaker path just because a migration is running. The consistency contract still governs what the target must know before it serves the critical endpoint.
Harbor Point also delays deleting the source copy. First it checks parity: row counts, highest replicated log sequence, and a sample of sensitive records such as active holds and newly confirmed reservations. Only after the drain period is clean does shard B garbage-collect bucket 173. Fast deletion feels efficient, but delayed cleanup preserves an audit path and a recovery option when migration bugs surface immediately after cutover.
This is why live rebalancing needs dedicated observability rather than a generic "migration succeeded" metric. Harbor Point tracks source-to-target catch-up lag, stale-route reject rate, cutover barrier duration, per-bucket error rate, and latency before and after the move. Those signals reveal whether the migration is merely slow or actually unsafe. They also tell the team when the deeper fix is a different partition layout rather than repeatedly moving the same hotspot around the cluster.
The broader lesson is that rebalancing is the operational continuation of shard-key design. Many logical buckets made motion possible. Versioned routing, delta replay, and disciplined cutover make that motion safe under production traffic.
Operational Failure Modes
-
Issue: "Once the bulk copy finishes, the partition has moved."
- Why it is tempting: The snapshot is the visibly expensive step, so teams treat it as the whole migration.
- Corrective mental model: The snapshot is only a starting point. Writes that landed after the snapshot must be replayed before the target can become authoritative.
- Operational fix: Gate cutover on a concrete catch-up marker such as a replicated log sequence, not on copy completion alone.
-
Issue: "Dual-writing to source and target is safer because both copies stay current."
- Why it is tempting: Mirroring writes sounds redundant and therefore safe.
- Corrective mental model: For authoritative reservations, dual writes create two live decision points and make ordering bugs harder to reason about.
- Operational fix: Prefer single-owner migrations with snapshot-plus-delta replay, and reserve dual-write patterns for derived or explicitly reconcilable data.
-
Issue: "Routers will refresh quickly enough, so owner-version checks are optional."
- Why it is tempting: Metadata propagation usually looks fast in staging.
- Corrective mental model: Even short-lived stale routes can send decisive writes to the wrong shard after cutover.
- Operational fix: Attach partition-map versions to routed requests and make old owners reject stale writes predictably.
-
Issue: "Delete the source copy immediately after cutover so nobody can route back."
- Why it is tempting: Fast cleanup feels like a clean finish.
- Corrective mental model: Immediate deletion removes the easiest rollback and audit path right when migration risk is highest.
- Operational fix: Keep the old copy read-only through a drain window, verify parity, then remove it after traffic and control-plane state stabilize.
Connections
- Partitioning Strategies and Shard Keys created the precondition for safe motion: logical partitions make it possible to move one bucket instead of redesigning the whole keyspace.
- Consistency Contracts and API Semantics explains why cutover rules differ by endpoint. An advisory query can tolerate a softer handoff than
POST /reservations/confirm. - Secondary Indexes Across Shards extends the same theme to derived lookup structures. Moving base rows is easier when the system also knows how index entries, pointers, and home-shard metadata change.
Resources
- [BOOK] Designing Data-Intensive Applications
- Focus: Revisit the partitioning and data-migration chapters with attention to how ownership boundaries and online change streams keep systems available during topology changes.
- [DOC] Vitess Resharding
- Focus: Study how Vitess stages copy, catch-up, traffic switching, and rollback rather than treating resharding as a single bulk move.
- [DOC] CockroachDB Replication Layer
- Focus: Pay attention to the snapshot-plus-log-replay description of replica rebalancing; it is a concrete example of the same catch-up-before-cutover pattern.
- [DOC] MongoDB Sharded Cluster Balancer
- Focus: Use the balancer guidance to see the operational costs, throttling, and cleanup behavior of live range movement.
Key Takeaways
- Live rebalancing is an ownership-transfer problem, not merely a file-copy problem.
- Safe migrations keep one authoritative writer, stream deltas after the snapshot, and cut over with versioned routing so stale owners reject writes.
- A brief, scoped write barrier at cutover is often a better trade-off than the complexity of dual authoritative writers.
- The same partitioning design that enables scale must also enable safe motion, or the cluster becomes operationally rigid as traffic shifts.