Leaderless Replication, Sloppy Quorums, and Hinted Handoff

LESSON

Consistency and Replication

007 30 min intermediate

Leaderless Replication, Sloppy Quorums, and Hinted Handoff

The core idea: A sloppy quorum preserves write availability by accepting data on reachable fallback nodes, but it weakens the clean quorum-overlap story until hinted handoff and repair return data to the intended replica set.

Core Insight

Imagine Harbor Point storing trader watchlist preferences in a leaderless replicated store. This data matters to the user experience, but it is not the legal source of truth for reservation approval. Each watchlist key has three home replicas. Under normal conditions, the service writes to two of those replicas before returning success.

Then two home replicas for watchlist:trader-17 become unreachable during a zone networking incident. The product has a choice: reject preference updates even though many other nodes are healthy, or accept the write somewhere else and repair placement later.

A sloppy quorum chooses the second path. The coordinator writes to reachable fallback nodes outside the key's home replica set, often leaving hints that say, "this value belongs to a temporarily unavailable home replica." That keeps the product available, but it changes what a successful quorum means.

The trade-off is precise. A strict quorum buys overlap inside the intended replica set. A sloppy quorum buys durable acceptance on reachable nodes. Those are not the same guarantee. The system now depends on hinted handoff, read repair, and anti-entropy to move from temporary availability back toward correct placement and convergence.

The Hidden Assumption in Quorum Math

Lesson 006 used the rule:

R + W > N

That rule depends on a quiet assumption: reads and writes are both choosing from the same home replica set for the key.

For watchlist:trader-17, the intended placement might be:

home replicas: A, B, C
N = 3
W = 2
R = 2

In the strict case, a successful write and later quorum read must overlap:

write W=2:  A   B
read R=2:   B   C
overlap:    B

That overlap is why the read has a chance to see the newest version. It is not magic attached to the word "quorum"; it is a property of selecting enough replicas from the same family.

Once writes can land outside that family, the proof changes. The write may be durable somewhere, but a later read from the home replicas may not intersect the temporary write set until repair has moved the data home.

Sloppy Quorum Write Path

Now suppose the home replicas are A, B, and C, but A and B are unreachable. The cluster still has healthy fallback nodes D and E.

preferred order for key K: [A, B, C, D, E, F]

A = unreachable
B = unreachable
C = healthy
D = healthy fallback
E = healthy fallback

With strict quorum, the write may fail because the coordinator cannot reach enough home replicas. With sloppy quorum, the coordinator can accept the write on reachable nodes:

intended home set:       A      B      C
failure state:           down   down   ok

sloppy write set:                      C      D      E
                                       home   fallback fallback

The fallback nodes usually store hints that identify the intended owners:

D stores value for K with hint: "deliver to A when A returns"
E stores value for K with hint: "deliver to B when B returns"

This is hinted handoff. When A and B recover, D and E try to hand the temporarily stored updates back to the proper home replicas. If handoff succeeds quickly, the system enjoyed higher write availability and then restored normal placement.

If handoff stalls, the cluster accumulates repair debt. Some reads may miss the newest value because they consult the home set before the hints have been delivered. The write was not lost, but it is not yet where the normal quorum proof expected it to be.

Worked Example: Availability for the Right Data

Harbor Point should not use sloppy quorum for every operation. A watchlist preference is a good candidate because the user benefits from progress and the system can repair or reconcile later. Reservation approval is a poor candidate because accepting writes on fallback nodes may create exactly the kind of ambiguity the business cannot tolerate.

Operation                         Sloppy quorum fit?      Reason
--------------------------------  ----------------------  ------------------------------
watchlist preference update        yes, usually            user-visible, repairable state
dashboard layout setting           yes, usually            low correctness consequence
reservation approval               no, usually             authority and audit matter
issuer limit change                no, usually             stale or misplaced writes are costly

This is the same discipline as earlier lessons: attach the replication behavior to the API contract. Sloppy quorum is not "less correct" in a vacuum. It is correct for data whose product promise prioritizes availability and repairable convergence. It is risky for data whose product promise requires immediate authority and clear auditability.

The operational signs also differ. For strict quorum, the main question is whether enough home replicas are available. For sloppy quorum, the questions include:

Question                              Why it matters
------------------------------------  ----------------------------------------------
How many writes landed on fallbacks?   measures temporary placement drift
How old are the oldest hints?          shows whether handoff is keeping up
Which home replicas are missing data?  predicts stale reads after recovery
Is anti-entropy catching leftovers?    covers hints that failed or expired

Sloppy quorum turns a hard availability failure into a convergence obligation. That is a good trade only if the system is built to pay that obligation.

Repair Debt and Read Semantics

The most common misunderstanding is to keep using strict-quorum intuition after a sloppy write.

Suppose C, D, and E accepted the newest watchlist value while A and B were down. A later read after partial recovery asks A and B because they are home replicas again:

latest write:   C, D, E
later read:     A, B
overlap:        none

The read may return an older value until hinted handoff, read repair, or anti-entropy closes the gap. That does not mean the system is broken. It means the system chose availability during failure and now has to converge.

This is why the next lesson matters. Hinted handoff is the targeted repair path for temporarily misplaced writes. Read repair and anti-entropy are broader mechanisms that detect and heal divergence after the fact. Sloppy quorum only makes sense as part of that full repair system.

Failure Modes

Assuming R + W > N still proves freshness after fallback writes. The overlap proof assumes reads and writes draw from the same home replica set. Sloppy writes may land elsewhere.

Using sloppy quorum for authoritative data. If the operation cannot tolerate temporary ambiguity, misplacement, or later reconciliation, sloppy quorum is usually the wrong availability trade-off.

Treating hints as free storage. Hints consume disk, network, and replay capacity. If home replicas stay down or handoff falls behind, hints become operational debt.

Ignoring failed handoff. Hints can expire, be dropped, or fail repeatedly. Anti-entropy and read repair need to catch what handoff does not.

Resources

Key Takeaways

  1. Strict quorum overlap assumes reads and writes contact the same intended replica set for a key.
  2. Sloppy quorum improves write availability by accepting writes on reachable fallback nodes when home replicas are unavailable.
  3. Hinted handoff records where fallback data belongs and tries to move it home after recovery.
  4. Sloppy quorum is a repair-backed availability trade-off, not a standalone freshness guarantee.
PREVIOUS Quorum Reads, Writes, and Tunable Consistency NEXT Read Repair, Anti-Entropy, and Merkle Divergence Checks