Day 232: Read Repair & Anti-Entropy - Healing Diverged Replicas

Replication systems often accept that replicas will drift under failures, lag, and partial writes. Read repair and anti-entropy are the mechanisms that push them back together, but they do it at different times and with very different operational profiles.

Today's "Aha!" Moment

By now we have seen several ways replicas can stop agreeing:

a sloppy quorum writes to fallback nodes
a home replica misses updates while it is down
a recovery path catches up some keys but not others
replicas return at different times with different histories

Once that happens, the system needs ways to heal itself.

The aha is:

read repair and anti-entropy solve the same broad problem, but they wake up for different reasons

Read repair wakes up because a client happened to read a key and the system noticed the replicas disagree.

Anti-entropy wakes up because the system proactively compares replicas in the background, even if no user is reading that data right now.

That makes the distinction much more useful than the names suggest:

read repair is opportunistic healing on the hot path
anti-entropy is systematic healing off the hot path

Once we see that, we can reason clearly about cold data, repair backlog, extra read latency, background I/O, and why many systems need both.

Why This Matters

Imagine a product catalog stored in an eventually consistent cluster.

One replica missed a set of updates during a short outage. Another replica has the newest value. A third is returning an older version because it never received the repair.

If a customer reads the item right now, the system might notice the mismatch and repair the stale copy during that read. That is read repair.

But what if the item is rarely read?

Then the stale replica may remain stale for days or months unless there is a background process actively scanning for inconsistency. That is where anti-entropy matters.

This is why teams get into trouble when they say vague things like:

"replicas eventually converge"

They do not converge by magic.

They converge because some mechanism:

observes disagreement
determines which version or state should win
copies or merges data so replicas move back toward agreement

The product consequence is direct:

hot keys may heal quickly with read repair
cold keys may remain broken unless anti-entropy exists

So this lesson is really about understanding the repair surface of an eventually consistent system.

Learning Objectives

By the end of this session, you will be able to:

Explain why replica healing is needed - Describe the kinds of divergence that accumulate in distributed storage even after writes have "succeeded."
Differentiate read repair from anti-entropy - Show what triggers each mechanism and what kinds of data each one can realistically heal.
Evaluate their operational cost - Connect healing quality to tail latency, background I/O, CPU, and the presence of cold data.

Core Concepts Explained

Concept 1: Read Repair Heals Divergence When a Read Exposes It

Suppose a key should exist on replicas A, B, and C.

But due to a temporary outage, they now hold:

A -> version 7
B -> version 7
C -> version 5

A client reads the key.

The coordinator asks enough replicas to satisfy the read policy and discovers that one replica is behind.

At that moment, the system can do two things:

return the correct value to the client
also push the fresher state back to the stale replica

That second part is read repair.

ASCII sketch:

client read
   |
   v
coordinator asks replicas
   |
   v
A = v7, B = v7, C = v5
   |
   v
return v7 to client
   |
   v
repair C in response to what the read discovered

This is elegant because the system uses real traffic to find and heal inconsistencies.

But the limitation is just as important:

read repair can only heal keys that somebody actually reads

So it is strongest for hot data and weakest for cold data.

Concept 2: Anti-Entropy Heals Divergence Even When Clients Are Not Touching the Data

Now imagine a different key that no client has read for months.

If one replica is stale, read repair never fires because nothing triggers it.

Anti-entropy exists for exactly this gap.

It is a background process that compares replica state and repairs differences even without a foreground read.

In Dynamo-style systems, a common strategy is:

compare replica ranges using Merkle trees or similar summaries
detect where differences exist without copying every object eagerly
synchronize only the ranges or keys that are actually out of sync

That gives us a second mental model:

read repair is reactive and key-by-key
anti-entropy is proactive and range-by-range

This is especially important for:

cold data
long-lived clusters
replicas that missed updates during outages
silent drift that would otherwise remain invisible

Without anti-entropy, the system may look healthy in normal traffic while quietly carrying stale data that nobody has asked for yet.

Concept 3: These Mechanisms Complement Each Other, But They Push Cost Into Different Places

Both mechanisms improve convergence, but they spend system resources differently.

Read repair pushes some healing work into the foreground path:

more replica comparison during reads
possible extra latency on unlucky requests
healing concentrated on data users already care about

Anti-entropy pushes healing into the background:

sustained disk and network work
CPU cost for tree building, comparison, and repair
broader coverage, especially for cold data

So the trade-off is not "which one is correct?" It is:

where do you want to pay for healing?

A useful summary:

Mechanism       Trigger              Strength                    Weakness
--------------  -------------------  --------------------------  -----------------------------
Read repair     Foreground read      Fast healing for hot keys   Cold data may stay stale
Anti-entropy    Background scan      Broad, proactive coverage   Ongoing background overhead

This is why many systems use both.

Read repair catches obvious divergence on data users are touching right now. Anti-entropy catches the rest.

Troubleshooting

Issue: "Eventually consistent means replicas will converge on their own."

Why it happens / is confusing: The word "eventually" sounds like time alone solves the problem.

Clarification / Fix: Convergence needs a mechanism. Read repair and anti-entropy are examples of the concrete processes that make "eventually" true.

Issue: "If we have read repair, we do not need anti-entropy."

Why it happens / is confusing: Hot paths are visible, so teams assume real traffic will heal everything important.

Clarification / Fix: Read repair only heals keys that are actually read. Cold data can remain stale indefinitely without a background repair path.

Issue: "Anti-entropy is free because it runs in the background."

Why it happens / is confusing: Background work feels detached from user latency.

Clarification / Fix: Anti-entropy still consumes disk, CPU, and network resources. It must be tuned so repair coverage does not create unacceptable production pressure.

Advanced Connections

Connection 1: Read Repair & Anti-Entropy <-> Hinted Handoff

The parallel: Hinted handoff is a focused recovery mechanism for temporary misplacement. Read repair and anti-entropy are broader healing tools for replicas that still diverge after that first recovery path.

Connection 2: Read Repair & Anti-Entropy <-> Chain Replication

The parallel: Chain replication tries to avoid ordinary divergence by enforcing one committed order up front. Read repair and anti-entropy belong to architectures that accept divergence more readily and therefore need explicit healing mechanisms afterward.

Resources

[DOC] Active Anti-Entropy
[DOC] Replication
[PAPER] Dynamo: Amazon's Highly Available Key-value Store
[DOC] Riak KV Glossary
[BOOK] Designing Data-Intensive Applications

Key Insights

Read repair heals what reads happen to expose - It is reactive, opportunistic, and especially useful for hot keys.
Anti-entropy heals what reads may never touch - It provides proactive background coverage, especially for cold data and long-lived divergence.
Healing is not free, only relocated - Read repair spends some cost on the foreground path, while anti-entropy spends it in the background.

Knowledge Check

What best describes read repair?
- A) A background scanner that continuously compares all key ranges
- B) A repair mechanism triggered when a read discovers replica disagreement
- C) A way to avoid storing multiple replicas
Why is anti-entropy important even if read repair exists?
- A) Because anti-entropy heals cold data that clients may not read for a long time
- B) Because read repair only works on deletes
- C) Because anti-entropy removes the need for replication
What is the main operational difference between read repair and anti-entropy?
- A) Read repair runs only on leaders, anti-entropy only on followers
- B) Read repair pushes some cost into the read path, while anti-entropy pushes cost into background processing
- C) There is no meaningful difference; they are two names for the same mechanism

Answers

1. B: Read repair is triggered when a foreground read discovers that replicas disagree and the system uses that opportunity to heal stale copies.

2. A: Without anti-entropy, stale cold data may remain unrepaired indefinitely because nobody is reading it.

3. B: Both heal divergence, but they spend resources in different places: one on the hot path, the other in the background.

← Back to Learning