Replication, Partitioning, and State Placement
LESSON
Replication, Partitioning, and State Placement
Core Insight
When Maya opens a social app, it loads the timeline and settings for user-42. The request looks like one lookup. In the service, it is a chain of decisions: a router must find the right slice of data, an authoritative machine must accept a new post, and a nearby copy may answer a read without making Maya wait for a distant region.
The chain becomes visible when user-42 becomes popular. Their posts and notifications overload the machine that currently owns the account's timeline. The team wants to move that state to a less busy shard while users keep reading and posting. Simply copying the rows is not enough. While the copy is moving, new posts are still arriving. If both old and new machines accept writes, the two histories can diverge. If routing switches too early, the new machine may be missing a post that Maya was just told was saved.
State placement is the set of rules that makes this path safe: where a key belongs, who is allowed to make official changes, which copies may serve which reads, and how those facts change over time. Partitioning divides keys into manageable ownership units. Replication keeps additional copies for survival, locality, read capacity, or recovery. Neither term, by itself, says who has authority.
The key idea is that a replica is a copy of state, while an owner is a right to change state. A good placement design turns that distinction into a visible protocol. For each key, it can answer: “Where do I send this request?”, “Which version is safe to read?”, “Who may write now?”, and “How do we move the answer without losing or duplicating work?”
Follow One Key Through The Placement Map
Start with one account rather than a whole cluster. The service uses a placement map that routers cache and refresh.
placement map, epoch 84
key range: users 40,000–49,999
shard: timeline-7
write owner: Madrid leader
read replicas: Madrid follower, Frankfurt follower
state position: log entry 900
The map is a routing and authority contract. A request for user-42 is assigned to timeline-7 because a partitioning rule places that user-id range there. A timeline write goes to the Madrid leader because it is the current write owner. A timeline read may go to the Frankfurt follower only if the product permits the follower's current lag.
The terms describe different layers:
- A key is the item being located:
user-42, an account id, an order id, or a document id. - A partition or shard is a group of keys managed together. It gives the system a unit for ownership, movement, and load measurement.
- A replica is a machine holding a copy of that shard's state.
- The owner is the replica, or group acting through a leader, that has authority to accept the writes that matter.
- A placement epoch is a version number for the map. It lets the system tell an old routing decision from a new one.
These pieces work together. The router uses the key and epoch to find the shard. The owner serializes or coordinates changes for that shard. Replicas apply those changes and report how far they have caught up. If a router does not know the current epoch, it must refresh, be redirected, or fail safely. Guessing creates split ownership.
Partitioning Spreads Work, Not Every Kind Of Load
Partitioning prevents every key from landing on one machine. A simple rule can be a hash:
shard = hash(user_id) mod 16
user-42 -> timeline-7
user-43 -> timeline-2
Hashing usually distributes a large population of ordinary users reasonably well. A range rule instead groups adjacent values, such as account ids from 40,000 to 49,999. Ranges can make scans and locality easier to reason about, but a busy range may concentrate traffic. A directory can map a tenant to a chosen shard, which is useful when tenants have very different sizes but adds metadata that must itself remain correct.
None of these rules removes skew. If user-42 receives millions of reads or creates a very active live room, one key can overload one owner even when the rest of the cluster is balanced. That is a hot key. Adding read replicas may absorb read traffic. It does not automatically spread writes when one authority must still serialize the account's official timeline. The solution might need a different data model: split a large feed, precompute fan-out, rate-limit a tenant, or give a hot account specialized placement.
Partitioning also chooses what is kept together. Keeping a user's profile, settings, and timeline on one shard can make local updates simple. Separating them can spread load but turns a workflow such as “publish a post and update the profile's activity” into a cross-shard operation. Placement is therefore an application boundary, not merely a capacity setting.
Replication Gives Copies A Job Description
The Frankfurt follower in the map has a copy of timeline-7, but “it has a copy” is not enough to safely use it. The service needs a job description for every replica.
Madrid leader:
may accept timeline writes at epoch 84
publishes ordered changes to followers
Frankfurt follower:
may serve timeline reads up to its applied position
may not accept official writes
may become a candidate owner only after it catches up and is elected
For a low-risk timeline view, the app may allow Frankfurt to answer from position 895 while the leader has reached 900. The screen is slightly behind, but the product can tolerate that. For an action whose result must be immediately visible, the client might read from the leader, send a minimum required position such as 900, or wait until the follower catches up. The next lesson names these as user-visible consistency guarantees; placement supplies the machinery that makes them possible.
Replication improves several things at once, but it spends work as well. Extra copies can keep data available after a machine loss, place reads near users, and provide a source for repair. They also create replication lag, transfer traffic, lag monitoring, and a decision about which copy is authoritative after failure. A stale follower may be useful for a feed and unsafe for a password check. Freshness is a policy, not a property acquired by making a second copy.
Worked Trace: Move user-42 Without Split Ownership
The Madrid leader for timeline-7 is saturated. The system will move the shard's range, including user-42, to a new shard named timeline-12. The central rule is simple: at every moment, the protocol identifies one write authority.
1. Prepare A Destination That Cannot Yet Write
The controller creates a new placement plan but leaves epoch 84 active for clients. timeline-12 receives a snapshot of timeline-7 at log position 900.
active map: epoch 84 -> timeline-7 owns writes
source: timeline-7 at position 900
destination: timeline-12 imports snapshot at position 900
client routing: still goes to timeline-7
The destination is a replica during this phase, not an owner. New posts continue to go to timeline-7. The source streams the new log entries to timeline-12: 901, 902, 903, and so on. This avoids a long write outage while data is copied.
2. Catch Up And Verify The Same History
At one moment, the source has reached position 916 and the destination has applied through 912. The move is not ready. The controller can measure the gap, checksum a snapshot segment, and verify that the destination is applying the same ordered changes. When the destination reaches a chosen handoff point, it is safe to consider changing authority.
source position: 916
destination position: 916
verification: checksums and log order agree
still authoritative: timeline-7, epoch 84
The important detail is that being caught up is necessary but not sufficient. The destination has the data; it has not yet received permission to create the next official entry.
3. Hand Off The Right To Write
The system briefly fences the old owner: it stops accepting new writes at epoch 84 after a final entry, say 917. It makes sure timeline-12 applies that entry. Then the controller commits a new map.
final source entry: 917
destination applied: 917
new active map, epoch 85
key range users 40,000–49,999 -> timeline-12
write owner -> timeline-12 leader
timeline-7 -> follower or forwarding source only
The formal tool behind this is often called fencing. The new epoch acts like a newer authorization token. Storage and replicas reject a write that claims the old epoch after the handoff. A delayed message from the former owner cannot quietly become a valid new write just because it arrived late.
Clients with epoch 84 do not need to create a second history. They can be redirected to the new owner, forwarded through a controlled path, or receive an error that makes them refresh the placement map and retry with the same idempotency key. The protocol must define which behavior it uses. What it must not do is let an old client write to timeline-7 while a new client writes independently to timeline-12.
4. Drain, Observe, And Keep A Repair Path
Once traffic uses epoch 85, the old shard keeps a replica long enough to serve controlled forwarding, verify that the new owner remains caught up, and provide a rollback or repair source. Only after the system observes stable routing, replica health, and no old-epoch writes can it remove the old copy or reuse its capacity.
key -> router with epoch 85 -> timeline-12 owner -> followers
|
-> replication, metrics, repair
The move is successful not when the copy finishes, but when requests, authority, and evidence all agree on the new location. This choreography is why state placement is a live protocol rather than a static diagram.
Failover Uses The Same Authority Discipline
Movement is planned; failure is not. If the Madrid leader disappears, promoting the nearest follower immediately may preserve availability, but it risks choosing a follower that lacks recent writes or overlaps with a slow-to-detect former leader.
A safer failover procedure checks which replicas have an acceptable state position, chooses a new owner through the system's membership or consensus rule, and advances the ownership epoch. The old leader, if it returns, sees that its epoch is obsolete and becomes a follower instead of resuming writes. This is why “replica B is the backup” is not a complete failover plan. The plan must state how staleness is measured, who decides promotion, and how the old authority is fenced off.
There is a trade-off. Conservative failover may wait for evidence and temporarily reject writes. Aggressive failover restores service faster but can expose stale state or force more repair. The right balance depends on the promise attached to the shard: a social timeline, a payment ledger, and a password store should not use the same promotion rule by default.
Failure Modes And Operational Signals
Three failures reveal weak placement design. A hot shard appears when one key, tenant, or range consumes a disproportionate share of CPU, queue depth, or write capacity. Replicating the hot data may ease reads but leaves a single write authority saturated.
Routing drift appears when some routers still use epoch 84 after epoch 85 is active. If old routing is not redirected or fenced, requests can reach an owner that no longer has authority. Replica lag appears when a follower's applied position stays behind the leader, making it unsafe for a promised-fresh read or a rapid failover.
Useful signals include:
- per-shard read/write rate, queue depth, and tail latency, to find hot keys and ranges;
- placement-map age and the number of requests rejected or redirected for an old epoch;
- replication lag in bytes, log entries, and elapsed time;
- time spent in snapshot copy and catch-up phases during a move;
- failed checksum or state-verification counts; and
- old-epoch write attempts after a handoff.
The main trade-off is between a simple, stable ownership rule and flexible distribution. One owner per key is easy to explain and protects authority, but it can bottleneck. More replicas, finer partitions, and frequent movement improve locality and capacity, but require better metadata, fencing, monitoring, and recovery. The best placement plan is not the one with the most copies; it is the one whose copies and transitions have clear responsibilities.
Design Check
Choose one key from a system you know: user_id, tenant_id, account_id, order_id, room_id, or document_id. Without looking back, write a placement record and a move plan:
key and partitioning rule:
current placement epoch:
write owner and why it has authority:
replicas and the reads each may serve:
maximum acceptable replica lag:
condition that permits failover:
handoff point when the key moves:
how an old owner is fenced:
signal that reveals a hot key or stale routing:
Then test the design with one delayed request sent under the old epoch. If it can create a new official write after the new owner starts, the handoff is incomplete. If it is rejected, forwarded, or made to refresh and retry safely, the authority boundary is explicit.
Resources
- [PAPER] Dynamo: Amazon's Highly Available Key-value Store
- Focus: Partitioning, replication, versioning, and repair in a highly available key-value store.
- [PAPER] Bigtable: A Distributed Storage System for Structured Data
- Focus: Tablet assignment, serving, splitting, and the operational movement of structured data.
- [BOOK] Designing Data-Intensive Applications
- Focus: Partitioning, replication, leader/follower replication, and the trade-offs behind rebalancing.
Key Takeaways
- Placement maps connect a key to a shard, a write authority, permitted read replicas, and a versioned routing decision.
- Replication creates copies; it does not grant those copies authority to write or prove that they are fresh enough for every read.
- A safe rebalance copies and catches up state before changing authority, then fences the old owner with a newer placement epoch.
- Hot keys, replica lag, and stale routing are operational evidence that placement rules need attention.