Leaderless Replication, Sloppy Quorums, and Hinted Handoff
LESSON
Leaderless Replication, Sloppy Quorums, and Hinted Handoff
The core idea: A sloppy quorum preserves write availability by accepting data on reachable fallback nodes, but it weakens the clean quorum-overlap story until hinted handoff and repair return data to the intended replica set.
Core Insight
Imagine Harbor Point storing trader watchlist preferences in a leaderless replicated store. This data matters to the user experience, but it is not the legal source of truth for reservation approval. Each watchlist key has three home replicas. Under normal conditions, the service writes to two of those replicas before returning success.
Then two home replicas for watchlist:trader-17 become unreachable during a zone networking incident. The product has a choice: reject preference updates even though many other nodes are healthy, or accept the write somewhere else and repair placement later.
A sloppy quorum chooses the second path. The coordinator writes to reachable fallback nodes outside the key's home replica set, often leaving hints that say, "this value belongs to a temporarily unavailable home replica." That keeps the product available, but it changes what a successful quorum means.
The trade-off is precise. A strict quorum buys overlap inside the intended replica set. A sloppy quorum buys durable acceptance on reachable nodes. Those are not the same guarantee. The system now depends on hinted handoff, read repair, and anti-entropy to move from temporary availability back toward correct placement and convergence.
The Hidden Assumption in Quorum Math
Lesson 006 used the rule:
R + W > N
That rule depends on a quiet assumption: reads and writes are both choosing from the same home replica set for the key.
For watchlist:trader-17, the intended placement might be:
home replicas: A, B, C
N = 3
W = 2
R = 2
In the strict case, a successful write and later quorum read must overlap:
write W=2: A B
read R=2: B C
overlap: B
That overlap is why the read has a chance to see the newest version. It is not magic attached to the word "quorum"; it is a property of selecting enough replicas from the same family.
Once writes can land outside that family, the proof changes. The write may be durable somewhere, but a later read from the home replicas may not intersect the temporary write set until repair has moved the data home.
Sloppy Quorum Write Path
Now suppose the home replicas are A, B, and C, but A and B are unreachable. The cluster still has healthy fallback nodes D and E.
preferred order for key K: [A, B, C, D, E, F]
A = unreachable
B = unreachable
C = healthy
D = healthy fallback
E = healthy fallback
With strict quorum, the write may fail because the coordinator cannot reach enough home replicas. With sloppy quorum, the coordinator can accept the write on reachable nodes:
intended home set: A B C
failure state: down down ok
sloppy write set: C D E
home fallback fallback
The fallback nodes usually store hints that identify the intended owners:
D stores value for K with hint: "deliver to A when A returns"
E stores value for K with hint: "deliver to B when B returns"
This is hinted handoff. When A and B recover, D and E try to hand the temporarily stored updates back to the proper home replicas. If handoff succeeds quickly, the system enjoyed higher write availability and then restored normal placement.
If handoff stalls, the cluster accumulates repair debt. Some reads may miss the newest value because they consult the home set before the hints have been delivered. The write was not lost, but it is not yet where the normal quorum proof expected it to be.
Worked Example: Availability for the Right Data
Harbor Point should not use sloppy quorum for every operation. A watchlist preference is a good candidate because the user benefits from progress and the system can repair or reconcile later. Reservation approval is a poor candidate because accepting writes on fallback nodes may create exactly the kind of ambiguity the business cannot tolerate.
Operation Sloppy quorum fit? Reason
-------------------------------- ---------------------- ------------------------------
watchlist preference update yes, usually user-visible, repairable state
dashboard layout setting yes, usually low correctness consequence
reservation approval no, usually authority and audit matter
issuer limit change no, usually stale or misplaced writes are costly
This is the same discipline as earlier lessons: attach the replication behavior to the API contract. Sloppy quorum is not "less correct" in a vacuum. It is correct for data whose product promise prioritizes availability and repairable convergence. It is risky for data whose product promise requires immediate authority and clear auditability.
The operational signs also differ. For strict quorum, the main question is whether enough home replicas are available. For sloppy quorum, the questions include:
Question Why it matters
------------------------------------ ----------------------------------------------
How many writes landed on fallbacks? measures temporary placement drift
How old are the oldest hints? shows whether handoff is keeping up
Which home replicas are missing data? predicts stale reads after recovery
Is anti-entropy catching leftovers? covers hints that failed or expired
Sloppy quorum turns a hard availability failure into a convergence obligation. That is a good trade only if the system is built to pay that obligation.
Repair Debt and Read Semantics
The most common misunderstanding is to keep using strict-quorum intuition after a sloppy write.
Suppose C, D, and E accepted the newest watchlist value while A and B were down. A later read after partial recovery asks A and B because they are home replicas again:
latest write: C, D, E
later read: A, B
overlap: none
The read may return an older value until hinted handoff, read repair, or anti-entropy closes the gap. That does not mean the system is broken. It means the system chose availability during failure and now has to converge.
This is why the next lesson matters. Hinted handoff is the targeted repair path for temporarily misplaced writes. Read repair and anti-entropy are broader mechanisms that detect and heal divergence after the fact. Sloppy quorum only makes sense as part of that full repair system.
Failure Modes
Assuming R + W > N still proves freshness after fallback writes. The overlap proof assumes reads and writes draw from the same home replica set. Sloppy writes may land elsewhere.
Using sloppy quorum for authoritative data. If the operation cannot tolerate temporary ambiguity, misplacement, or later reconciliation, sloppy quorum is usually the wrong availability trade-off.
Treating hints as free storage. Hints consume disk, network, and replay capacity. If home replicas stay down or handoff falls behind, hints become operational debt.
Ignoring failed handoff. Hints can expire, be dropped, or fail repeatedly. Anti-entropy and read repair need to catch what handoff does not.
Resources
- [PAPER] Dynamo: Amazon's Highly Available Key-value Store
- Focus: Read the sections on sloppy quorum and hinted handoff as an availability-first response to partial failure.
- [DOC] Riak KV Replication and Sloppy Quorum
- Focus: Use this as a concrete production-family explanation of fallback placement and repair.
- [DOC] Riak KV Glossary: Sloppy Quorum
- Focus: Compare the glossary definition with the stricter quorum-overlap model from the previous lesson.
- [BOOK] Designing Data-Intensive Applications
- Focus: Review leaderless replication, sloppy quorums, hinted handoff, read repair, and anti-entropy as one repair-oriented design family.
Key Takeaways
- Strict quorum overlap assumes reads and writes contact the same intended replica set for a key.
- Sloppy quorum improves write availability by accepting writes on reachable fallback nodes when home replicas are unavailable.
- Hinted handoff records where fallback data belongs and tries to move it home after recovery.
- Sloppy quorum is a repair-backed availability trade-off, not a standalone freshness guarantee.