Day 479: Rebalancing Partitions under Live Traffic

The core idea: Live rebalancing is an ownership handoff problem. Safe systems copy a partition, stream the delta, and cut traffic over with versioned routing so only one shard is authoritative for writes at any moment.

Today's "Aha!" Moment

After 049.md, Harbor Point finally has room to move. The reservation platform keeps 1024 logical buckets spread across 12 shard replica groups, which means the team can shift one hot slice of inventory without redesigning the schema. The first hard case arrives during a summer-sale burst: bucket 173, which owns a dense slice of BCN-JFK cabin inventory, is driving four times the write rate of neighboring buckets. Shard B is saturating while shard F is mostly idle, but agents are still extending holds and confirming bookings in real time.

The tempting plan is naive: copy bucket 173 to shard F, switch the routing table, and delete the old copy. That fails as soon as agent Marta extends hold H-8821 halfway through the copy. If that write lands only on B, the target comes online stale. If Harbor Point lets both shards accept writes "just for safety," it now has two places that can appear authoritative for the same inventory. The hard part of rebalancing is therefore not moving bytes. It is preserving one ownership boundary while traffic keeps flowing.

The useful mental shift is to treat rebalancing as a controlled partition leadership transfer. The source shard stays authoritative while the target catches up, and the cutover is gated by an explicit routing version plus a concrete catch-up marker. That is what turns rebalancing from a risky operator maneuver into a repeatable production mechanism.

Why This Matters

Partitioning only solves scale if the cluster can keep adapting after launch. Traffic patterns move. One route goes viral, a tenant lands a promotion, a node pool is retired, or a region absorbs failover traffic from somewhere else. If every load shift requires downtime or a full reshard, the system is horizontally scalable on paper and operationally rigid in production.

This lesson matters because the API contract from 048.md still applies during maintenance. A successful POST /bookings/confirm must still mean one cabin is definitively assigned, even while ownership of the underlying bucket is moving. The trade-off is direct: live rebalancing reduces the need for maintenance windows, but it raises the bar for routing metadata, replay logs, cutover discipline, and migration observability.

The production consequence is not abstract. Capacity expansion, hotspot relief, hardware evacuation, and zone maintenance all depend on moving partitions safely. Teams that understand the mechanism can add or drain nodes deliberately. Teams that treat rebalancing as "background copy plus hope" eventually discover that balanced storage and safe write ownership are different problems.

Core Walkthrough

Part 1: Start from one bucket that is hot enough to matter

Harbor Point designed the cluster with many more logical buckets than physical shards precisely so it could move a narrow slice of the keyspace under load. Bucket 173 currently maps to shard B in partition-map version 104:

map v104
bucket 173 -> shard B
bucket 174 -> shard F
...

During the sale window, most of shard B's pressure comes from bucket 173. That makes the decision small and surgical: move that one bucket to shard F instead of revisiting the primary key or splitting the whole route catalog. The data model from 049.md stays the same; only placement changes.

The online migration still has to answer three questions precisely:

Which shard is authoritative for new writes right now?
How does the target catch up while the source is still serving traffic?
How do routers and shards agree on the instant ownership changes?

If any of those answers are fuzzy, rebalancing turns into split-brain risk disguised as an operations task.

Part 2: Safe migration separates snapshot, catch-up, and cutover

For Harbor Point's authoritative inventory path, the source shard B remains the only writer during migration. The target shard F receives a baseline snapshot plus a stream of later changes. This is safer than dual authoritative writes because the platform never has to reconcile two live decision points for the same cabin.

Phase 1: Snapshot
router -> B (owner in map v104)
B ---- bulk copy bucket 173 ----> F

Phase 2: Catch-up
router -> B
B ---- change stream seq 918443..918499 ----> F

Phase 3: Cutover
pause new writes for bucket 173
wait until F has replayed through seq 918499
publish map v105: bucket 173 -> F
resume writes through F

The routing layer carries the partition-map version it used when it sent a request. Each shard checks that version before it accepts a write:

def apply_booking_op(bucket_id, map_version, op):
    if map_version != partition_map.current_version(bucket_id):
        raise RetryWithFreshMap(bucket_id)

    if not partition_map.is_owner(bucket_id, this_shard_id):
        raise WrongOwner(bucket_id)

    append_to_partition_log(bucket_id, op)
    apply_to_local_state(bucket_id, op)

That version check is not defensive polish. It is the mechanism that prevents stale routers from writing to the old owner after cutover. If a router still believes bucket 173 -> shard B after map v105 is published, shard B rejects the write and forces a retry with fresh metadata. Harbor Point would rather surface a short retry than accept a decisive booking mutation on the wrong shard.

The catch-up stream is what makes the cutover meaningful. Suppose Marta extends hold H-8821 after the snapshot started. Shard B assigns that mutation log sequence 918487, applies it locally, and streams it to F. At cutover time, Harbor Point does not ask "Did the copy finish?" It asks "Has the target replayed every mutation through the barrier sequence?" Only when the answer is yes does the control plane publish map v105.

For high-value writes, Harbor Point accepts a tiny write pause on one bucket during cutover. That brief barrier keeps the ownership handoff simple: no concurrent writers, no ambiguous order, no post-hoc merge logic. The trade-off is explicit. A short and scoped availability dip is cheaper than turning a definitive booking path into a distributed reconciliation problem.

Part 3: The operational work continues after the flip

Once the new map is live, the move is not over. In-flight requests may still arrive at B with stale routing metadata for a few seconds. Those must be rejected or forwarded consistently. Read paths also need a policy. Harbor Point keeps advisory browse reads flexible, but it does not let POST /bookings/confirm take a weaker path just because a migration is running. The consistency contract from 048.md still governs what the target must know before it serves the critical endpoint.

Harbor Point also delays deleting the source copy. First it checks parity: row counts, highest replicated log sequence, and a sample of sensitive records such as active holds and newly confirmed bookings. Only after the drain period is clean does shard B garbage-collect bucket 173. Fast deletion feels efficient, but delayed cleanup preserves an audit path and a recovery option when migration bugs surface immediately after cutover.

This is why live rebalancing needs dedicated observability rather than a generic "migration succeeded" metric. Harbor Point tracks source-to-target catch-up lag, stale-route reject rate, cutover barrier duration, per-bucket error rate, and latency before and after the move. Those signals reveal whether the migration is merely slow or actually unsafe. They also tell the team when the deeper fix is a different partition layout rather than repeatedly moving the same hotspot around the cluster.

The broader lesson is that rebalancing is the operational continuation of shard-key design. Many logical buckets made motion possible. Versioned routing, delta replay, and disciplined cutover make that motion safe under production traffic.

Failure Modes and Misconceptions

Issue: "Once the bulk copy finishes, the partition has moved."
- Why it is tempting: The snapshot is the visibly expensive step, so teams treat it as the whole migration.
- Corrective mental model: The snapshot is only a starting point. Writes that landed after the snapshot must be replayed before the target can become authoritative.
- Operational fix: Gate cutover on a concrete catch-up marker such as a replicated log sequence, not on copy completion alone.
Issue: "Dual-writing to source and target is safer because both copies stay current."
- Why it is tempting: Mirroring writes sounds redundant and therefore safe.
- Corrective mental model: For authoritative inventory, dual writes create two live decision points and make ordering bugs harder to reason about.
- Operational fix: Prefer single-owner migrations with snapshot-plus-delta replay, and reserve dual-write patterns for derived or explicitly reconcilable data.
Issue: "Routers will refresh quickly enough, so owner-version checks are optional."
- Why it is tempting: Metadata propagation usually looks fast in staging.
- Corrective mental model: Even short-lived stale routes can send decisive writes to the wrong shard after cutover.
- Operational fix: Attach partition-map versions to routed requests and make old owners reject stale writes predictably.
Issue: "Delete the source copy immediately after cutover so nobody can route back."
- Why it is tempting: Fast cleanup feels like a clean finish.
- Corrective mental model: Immediate deletion removes the easiest rollback and audit path right when migration risk is highest.
- Operational fix: Keep the old copy read-only through a drain window, verify parity, then remove it after traffic and control-plane state stabilize.

Connections

Connection 1: 049.md created the precondition for safe motion

Logical partitions made it possible to move one bucket instead of redesigning the whole keyspace. Rebalancing is the operational proof that the shard-key strategy can survive changing traffic.

Connection 2: 048.md explains why cutover rules differ by endpoint

A browse query can tolerate a softer handoff than POST /bookings/confirm. Rebalancing has to respect each API's consistency contract instead of treating all reads and writes as equally flexible.

Connection 3: 051.md will turn this mechanism into a full operating plan

The capstone will combine replication, partitioning, lag budgets, and rebalancing into one scale-out design, which is how production teams justify topology changes before they execute them.

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Revisit the partitioning and data-migration chapters with attention to how ownership boundaries and online change streams keep systems available during topology changes.
[DOC] Vitess Resharding
- Focus: Study how Vitess stages copy, catch-up, traffic switching, and rollback rather than treating resharding as a single bulk move.
[DOC] CockroachDB Replication Layer
- Focus: Pay attention to the snapshot-plus-log-replay description of replica rebalancing; it is a concrete example of the same catch-up-before-cutover pattern.
[DOC] MongoDB Sharded Cluster Balancer
- Focus: Use the balancer guidance to see the operational costs, throttling, and cleanup behavior of live range movement.

Key Takeaways

Live rebalancing is an ownership-transfer problem, not merely a file-copy problem.
Safe migrations keep one authoritative writer, stream deltas after the snapshot, and cut over with versioned routing so stale owners reject writes.
A brief, scoped write barrier at cutover is often a better trade-off than the complexity of dual authoritative writers.
The same partitioning design that enables scale must also enable safe motion, or the cluster becomes operationally rigid as traffic shifts.

← Back to Consistency and Replication

← Back to Learning Hub