Replication, Partitioning, and State Placement

LESSON

Distributed Systems Foundations

008 20 min beginner

Replication, Partitioning, and State Placement

Core Insight

When Maya opens a social app, it loads the timeline and settings for user-42. The request looks like one lookup. In the service, it is a chain of decisions: a router must find the right slice of data, an authoritative machine must accept a new post, and a nearby copy may answer a read without making Maya wait for a distant region.

The chain becomes visible when user-42 becomes popular. Their posts and notifications overload the machine that currently owns the account's timeline. The team wants to move that state to a less busy shard while users keep reading and posting. Simply copying the rows is not enough. While the copy is moving, new posts are still arriving. If both old and new machines accept writes, the two histories can diverge. If routing switches too early, the new machine may be missing a post that Maya was just told was saved.

State placement is the set of rules that makes this path safe: where a key belongs, who is allowed to make official changes, which copies may serve which reads, and how those facts change over time. Partitioning divides keys into manageable ownership units. Replication keeps additional copies for survival, locality, read capacity, or recovery. Neither term, by itself, says who has authority.

The key idea is that a replica is a copy of state, while an owner is a right to change state. A good placement design turns that distinction into a visible protocol. For each key, it can answer: “Where do I send this request?”, “Which version is safe to read?”, “Who may write now?”, and “How do we move the answer without losing or duplicating work?”

Follow One Key Through The Placement Map

Start with one account rather than a whole cluster. The service uses a placement map that routers cache and refresh.

placement map, epoch 84

key range:      users 40,000–49,999
shard:          timeline-7
write owner:    Madrid leader
read replicas:  Madrid follower, Frankfurt follower
state position: log entry 900

The map is a routing and authority contract. A request for user-42 is assigned to timeline-7 because a partitioning rule places that user-id range there. A timeline write goes to the Madrid leader because it is the current write owner. A timeline read may go to the Frankfurt follower only if the product permits the follower's current lag.

The terms describe different layers:

These pieces work together. The router uses the key and epoch to find the shard. The owner serializes or coordinates changes for that shard. Replicas apply those changes and report how far they have caught up. If a router does not know the current epoch, it must refresh, be redirected, or fail safely. Guessing creates split ownership.

Partitioning Spreads Work, Not Every Kind Of Load

Partitioning prevents every key from landing on one machine. A simple rule can be a hash:

shard = hash(user_id) mod 16

user-42 -> timeline-7
user-43 -> timeline-2

Hashing usually distributes a large population of ordinary users reasonably well. A range rule instead groups adjacent values, such as account ids from 40,000 to 49,999. Ranges can make scans and locality easier to reason about, but a busy range may concentrate traffic. A directory can map a tenant to a chosen shard, which is useful when tenants have very different sizes but adds metadata that must itself remain correct.

None of these rules removes skew. If user-42 receives millions of reads or creates a very active live room, one key can overload one owner even when the rest of the cluster is balanced. That is a hot key. Adding read replicas may absorb read traffic. It does not automatically spread writes when one authority must still serialize the account's official timeline. The solution might need a different data model: split a large feed, precompute fan-out, rate-limit a tenant, or give a hot account specialized placement.

Partitioning also chooses what is kept together. Keeping a user's profile, settings, and timeline on one shard can make local updates simple. Separating them can spread load but turns a workflow such as “publish a post and update the profile's activity” into a cross-shard operation. Placement is therefore an application boundary, not merely a capacity setting.

Replication Gives Copies A Job Description

The Frankfurt follower in the map has a copy of timeline-7, but “it has a copy” is not enough to safely use it. The service needs a job description for every replica.

Madrid leader:
  may accept timeline writes at epoch 84
  publishes ordered changes to followers

Frankfurt follower:
  may serve timeline reads up to its applied position
  may not accept official writes
  may become a candidate owner only after it catches up and is elected

For a low-risk timeline view, the app may allow Frankfurt to answer from position 895 while the leader has reached 900. The screen is slightly behind, but the product can tolerate that. For an action whose result must be immediately visible, the client might read from the leader, send a minimum required position such as 900, or wait until the follower catches up. The next lesson names these as user-visible consistency guarantees; placement supplies the machinery that makes them possible.

Replication improves several things at once, but it spends work as well. Extra copies can keep data available after a machine loss, place reads near users, and provide a source for repair. They also create replication lag, transfer traffic, lag monitoring, and a decision about which copy is authoritative after failure. A stale follower may be useful for a feed and unsafe for a password check. Freshness is a policy, not a property acquired by making a second copy.

Worked Trace: Move user-42 Without Split Ownership

The Madrid leader for timeline-7 is saturated. The system will move the shard's range, including user-42, to a new shard named timeline-12. The central rule is simple: at every moment, the protocol identifies one write authority.

1. Prepare A Destination That Cannot Yet Write

The controller creates a new placement plan but leaves epoch 84 active for clients. timeline-12 receives a snapshot of timeline-7 at log position 900.

active map:      epoch 84 -> timeline-7 owns writes

source:          timeline-7 at position 900
destination:     timeline-12 imports snapshot at position 900
client routing:  still goes to timeline-7

The destination is a replica during this phase, not an owner. New posts continue to go to timeline-7. The source streams the new log entries to timeline-12: 901, 902, 903, and so on. This avoids a long write outage while data is copied.

2. Catch Up And Verify The Same History

At one moment, the source has reached position 916 and the destination has applied through 912. The move is not ready. The controller can measure the gap, checksum a snapshot segment, and verify that the destination is applying the same ordered changes. When the destination reaches a chosen handoff point, it is safe to consider changing authority.

source position:       916
destination position:  916
verification:          checksums and log order agree
still authoritative:   timeline-7, epoch 84

The important detail is that being caught up is necessary but not sufficient. The destination has the data; it has not yet received permission to create the next official entry.

3. Hand Off The Right To Write

The system briefly fences the old owner: it stops accepting new writes at epoch 84 after a final entry, say 917. It makes sure timeline-12 applies that entry. Then the controller commits a new map.

final source entry: 917
destination applied: 917

new active map, epoch 85
  key range users 40,000–49,999 -> timeline-12
  write owner -> timeline-12 leader
  timeline-7 -> follower or forwarding source only

The formal tool behind this is often called fencing. The new epoch acts like a newer authorization token. Storage and replicas reject a write that claims the old epoch after the handoff. A delayed message from the former owner cannot quietly become a valid new write just because it arrived late.

Clients with epoch 84 do not need to create a second history. They can be redirected to the new owner, forwarded through a controlled path, or receive an error that makes them refresh the placement map and retry with the same idempotency key. The protocol must define which behavior it uses. What it must not do is let an old client write to timeline-7 while a new client writes independently to timeline-12.

4. Drain, Observe, And Keep A Repair Path

Once traffic uses epoch 85, the old shard keeps a replica long enough to serve controlled forwarding, verify that the new owner remains caught up, and provide a rollback or repair source. Only after the system observes stable routing, replica health, and no old-epoch writes can it remove the old copy or reuse its capacity.

key -> router with epoch 85 -> timeline-12 owner -> followers
                                   |
                                   -> replication, metrics, repair

The move is successful not when the copy finishes, but when requests, authority, and evidence all agree on the new location. This choreography is why state placement is a live protocol rather than a static diagram.

Failover Uses The Same Authority Discipline

Movement is planned; failure is not. If the Madrid leader disappears, promoting the nearest follower immediately may preserve availability, but it risks choosing a follower that lacks recent writes or overlaps with a slow-to-detect former leader.

A safer failover procedure checks which replicas have an acceptable state position, chooses a new owner through the system's membership or consensus rule, and advances the ownership epoch. The old leader, if it returns, sees that its epoch is obsolete and becomes a follower instead of resuming writes. This is why “replica B is the backup” is not a complete failover plan. The plan must state how staleness is measured, who decides promotion, and how the old authority is fenced off.

There is a trade-off. Conservative failover may wait for evidence and temporarily reject writes. Aggressive failover restores service faster but can expose stale state or force more repair. The right balance depends on the promise attached to the shard: a social timeline, a payment ledger, and a password store should not use the same promotion rule by default.

Failure Modes And Operational Signals

Three failures reveal weak placement design. A hot shard appears when one key, tenant, or range consumes a disproportionate share of CPU, queue depth, or write capacity. Replicating the hot data may ease reads but leaves a single write authority saturated.

Routing drift appears when some routers still use epoch 84 after epoch 85 is active. If old routing is not redirected or fenced, requests can reach an owner that no longer has authority. Replica lag appears when a follower's applied position stays behind the leader, making it unsafe for a promised-fresh read or a rapid failover.

Useful signals include:

The main trade-off is between a simple, stable ownership rule and flexible distribution. One owner per key is easy to explain and protects authority, but it can bottleneck. More replicas, finer partitions, and frequent movement improve locality and capacity, but require better metadata, fencing, monitoring, and recovery. The best placement plan is not the one with the most copies; it is the one whose copies and transitions have clear responsibilities.

Design Check

Choose one key from a system you know: user_id, tenant_id, account_id, order_id, room_id, or document_id. Without looking back, write a placement record and a move plan:

key and partitioning rule:
current placement epoch:
write owner and why it has authority:
replicas and the reads each may serve:
maximum acceptable replica lag:
condition that permits failover:
handoff point when the key moves:
how an old owner is fenced:
signal that reveals a hot key or stale routing:

Then test the design with one delayed request sent under the old epoch. If it can create a new official write after the new owner starts, the handoff is incomplete. If it is rejected, forwarded, or made to refresh and retry safely, the authority boundary is explicit.

Resources

Key Takeaways

PREVIOUS CAP, PACELC, and Partition-Time Behavior NEXT Consistency Models and User Guarantees