LESSON

014 30 min intermediate

Day 285: Sharding Strategies and Rebalancing in Production

The core idea: sharding is not just spreading load across more machines. It is dividing data authority across partitions, which means every choice about shard key, routing, and rebalancing changes both scale behavior and application semantics.

Today's "Aha!" Moment

The insight: Replication gave us extra copies of the same authority domain. Sharding changes the game completely: now each node owns only part of the data, so the system must know where a key lives and what happens when that mapping changes.

Why this matters: Teams often reach for sharding as if it were just "horizontal scaling for databases." In practice, it changes query shape, transaction boundaries, operational playbooks, and failure modes.

Concrete anchor: A multi-tenant SaaS platform outgrows a single primary-replica cluster. Tenant data is split across shards by tenant ID. That helps scale writes and storage, but now any rebalance must move data safely, update routing, and avoid making one "hot tenant" dominate a single shard.

The practical sentence to remember:
Sharding scales by splitting ownership, and every split creates a routing and rebalancing problem.

Why This Matters

The problem: A single database, even with replicas, still has one write authority domain. At some point the problem is no longer "how do I make copies of this node?" but "how do I divide the dataset so multiple authority domains can work in parallel?"

Without this model:

Teams pick a shard key by convenience instead of workload shape.
Cross-shard queries or joins become accidental architecture.
Rebalancing is treated as a background migration rather than a correctness-sensitive data movement event.

With this model:

You can reason about what makes a good partition key.
You can predict hotspot and skew problems before they become incidents.
You can separate "steady-state routing" from "live movement of ownership."

Operational payoff: Better shard-key choices, fewer hot shards, safer rebalancing, and clearer boundaries for transactions and consistency.

Learning Objectives

By the end of this lesson, you should be able to:

Explain why sharding exists as a scale-out strategy for write load, storage growth, and authority partitioning.
Describe the major sharding strategies and how shard routing works in practice.
Reason about rebalancing trade-offs including hot shards, movement cost, and operational risk.

Core Concepts Explained

Concept 1: Sharding Splits Authority by Data Partition

Concrete example / mini-scenario: A user service stores profiles for 200 million accounts. At first everything fits on one primary-replica cluster. Eventually writes, storage, or maintenance windows become too large for one authority domain.

Intuition: Sharding means deciding that different subsets of the dataset are owned by different nodes or clusters.

What changes after sharding:

A single query may need shard-aware routing
Local transactions apply only inside one shard unless you pay extra coordination cost
Capacity planning becomes about per-shard balance, not just cluster totals

The most important design choice: the shard key

That key determines:

How evenly data is distributed
Whether common queries stay shard-local
How likely hotspots are
How painful rebalancing becomes later

Typical shard-key strategies:

Range-based sharding
- Good for locality and range scans
- Dangerous if new writes cluster at one end
Hash-based sharding
- Good for spreading load evenly
- Worse for range queries and locality-sensitive access
Directory / lookup-based sharding
- Flexible mapping layer
- More operational metadata and control-plane complexity

Rule of thumb:
Choose a key that keeps common write paths and common invariants local whenever possible.

Concept 2: Routing Is Part of the Database Design

Concrete example / mini-scenario: An API request arrives for tenant_id = 48291. Before the database can answer anything, the system must know which shard owns that tenant.

Intuition: Once data is partitioned, a request cannot even begin until routing resolves the target shard or shard set.

Where routing logic can live:

In the application
In a proxy/router tier
In a coordinator layer
In a client library with topology metadata

Why routing matters so much:

Wrong routing causes misses or stale reads
Multi-shard fanout multiplies cost quickly
Cross-shard joins or aggregations become architectural choices instead of simple SQL conveniences

Healthy sharded design tends to prefer:

Point queries that map to one shard
Transactions that stay local to one shard
APIs shaped around the partition key

Smell to watch for:

Frequent "scatter-gather" queries across all shards
Business workflows that need many shards inside one logical operation
A partition key that looked balanced at launch but no longer matches access patterns

These are signs the system is fighting the shard boundary instead of using it.

Concept 3: Rebalancing Is a Live Authority Migration

Concrete example / mini-scenario: One shard becomes much larger and busier than the others. The team decides to split it or move part of its range to a new node.

Intuition: Rebalancing is not just data copying. It is a controlled change to who owns which subset of data, while live reads and writes are still happening.

Why rebalancing is hard:

Data must be copied safely
Writes may still be arriving during movement
Routing metadata must switch at the right time
Old and new owners must not both think they are authoritative indefinitely

What usually drives rebalancing:

Storage growth
Uneven write load
Hot tenants or hot key ranges
Infrastructure expansion or shrinkage

Main trade-offs:

Aggressive balance
- Better long-term capacity distribution
- More short-term movement cost and risk
Minimal movement
- Safer in the short run
- Leaves skew and hotspots alive longer
Online rebalancing
- Better availability
- Much more operational complexity
Offline or maintenance-window moves
- Simpler coordination
- Worse availability or longer degraded periods

Key practical point:
Hotspot risk is usually not solved by rebalancing alone.
If the shard key creates concentrated write traffic, the hotspot may simply move with the data.

Troubleshooting

Issue: Total cluster capacity looks fine, but one shard is constantly overloaded.

Why it happens: Sharding failures are often local, not global. A hot shard can exist even when the cluster average looks healthy.

Clarification / Fix: Monitor per-shard load, storage, and tail latency. Cluster-wide averages hide skew.

Issue: New data always lands on the same shard range.

Why it happens: Range-based partitioning often creates "last partition" hotspots when keys are time-ordered or monotonic.

Clarification / Fix: Revisit key design, write distribution, or range-splitting strategy rather than just adding replicas.

Issue: Rebalancing causes latency spikes and client errors.

Why it happens: Data movement, dual writes, metadata changes, and cache invalidation all compete with normal traffic.

Clarification / Fix: Treat rebalance as a controlled migration with observability, rate limits, and a clear cutover plan.

Issue: Queries increasingly fan out across many shards.

Why it happens: The API or query model is drifting away from the partition key that the data model assumes.

Clarification / Fix: Revisit access patterns. Sharding works best when the application speaks the same partition language as the database.

Advanced Connections

Connection 1: Sharding <-> Replication

The contrast: Replication keeps multiple copies of the same authority. Sharding creates multiple authorities for different subsets of data.

Why this matters: Most real systems need both eventually, but they solve different scale problems.

Connection 2: Sharding <-> Distributed Transactions

The bridge: Once a single business operation spans multiple shards, local transactions are no longer enough.

Why this matters: That is why the next lesson matters. Sharding often forces you to confront 2PC, sagas, or application-level coordination sooner than expected.

Resources

Suggested Resources

[BOOK] Designing Data-Intensive Applications - Book site
Focus: conceptual grounding for partitioning, skew, and operational scale trade-offs.
[DOC] Vitess Sharding Documentation - Documentation
Focus: practical examples of routing, resharding, and control-plane concerns in a production sharded system.
[BOOK] Database Internals - Book site
Focus: useful systems-level intuition for partitioning and operational database design.

Key Insights

Sharding scales by splitting ownership, not by merely adding more replicas.
The shard key is the core design decision because it shapes balance, locality, and future pain.
Rebalancing is a live authority migration, so it must be treated as a correctness-sensitive operation, not just bulk copy.

← Back to Consistency and Replication

← Back to Learning Hub