CRDTs and Coordination Avoidance: Metadata Growth, Compaction, and Tombstone Collection
LESSON
CRDTs and Coordination Avoidance: Metadata Growth, Compaction, and Tombstone Collection
Core Insight
Imagine the replicated workspace from the previous lessons. A user deletes the label urgent from a document in Madrid. Dublin has already seen the label and receives the delete. Lisbon was offline for two weeks and still has an old add event for urgent in its local state. When Lisbon reconnects, what prevents the deleted label from coming back?
The answer is metadata. A CRDT value often stores more than the user-visible value. It stores dots, causal context, remove markers, version-vector entries, delta history, schema markers, and sometimes explicit tombstones. That metadata is not clutter. It is evidence the merge function needs to distinguish "this update is new" from "this update was already removed."
Compaction is the work of replacing detailed evidence with smaller evidence, or deleting evidence that is no longer needed. Tombstone collection is the specific cleanup of remove evidence. The hard part is knowing when cleanup is safe. If you compact too late, metadata becomes the dominant cost. If you compact too early, old state can resurrect deletes, duplicate effects, or create false conflicts.
The trade-off is safety versus retained history. CRDTs buy local writes and delayed merge by carrying enough history to make later reconciliation meaningful. The production skill is to define a stability boundary: the point after which old events cannot arrive in a form that would change the outcome.
The Naive Cleanup Breaks Deletes
Start with an observed-remove set of labels. An add creates a dot:
Madrid adds urgent:
live adds:
urgent -> {M:7}
tombstones:
{}
Dublin sees M:7 and removes urgent:
Dublin removes urgent after observing M:7:
live adds:
urgent -> {}
tombstones:
urgent -> {M:7}
The tombstone says: if an old message later carries add dot M:7, do not show it again.
Now imagine an operator sees that the label is not visible anywhere in the current healthy regions and deletes the tombstone to save space:
after unsafe cleanup:
live adds:
urgent -> {}
tombstones:
{}
Lisbon reconnects with an old state:
Lisbon old state:
live adds:
urgent -> {M:7}
Merge has no evidence that M:7 was removed, so urgent can reappear.
merged state:
live adds:
urgent -> {M:7}
This is why "it is not visible right now" is not a safe cleanup rule. A tombstone is needed not because the item is visible, but because old add evidence may still be in flight, on disk, in a backup, or on an offline device.
What Counts As Stable?
A dot or tombstone is stable when the system has enough evidence that forgetting it cannot change future merges.
Different systems use different stability rules:
all active replicas acknowledged:
every replica in the current membership has seen the remove
durable snapshot boundary:
old logs and old states before snapshot S will never be replayed
membership epoch change:
replicas from epoch E are retired or must rejoin through a full repair path
time-based retention plus repair policy:
old offline clients cannot directly merge stale state after the retention window
central compaction authority:
one service decides when enough reachability evidence exists
Only the first sounds purely logical. The others are operational contracts. That is normal. Production compaction is where CRDT algebra meets backups, client lifetimes, rollout policy, and incident recovery.
The important difference is this:
unsafe:
"The tombstone is old, so delete it."
safer:
"No accepted merge path can reintroduce the removed dot after this boundary."
Worked Example: Collecting A Tombstone
Suppose a workspace has three server replicas and many mobile clients:
servers:
Madrid
Dublin
Lisbon
mobile clients:
partial replicas
can be offline for up to 30 days
The system stores document labels as an OR-set. A remove creates a tombstone for the observed add dots.
remove label urgent:
removed dots:
M:7
A safe cleanup policy could be:
1. Server replicas must all acknowledge the remove.
2. The remove must be included in a durable workspace snapshot.
3. Mobile clients older than the snapshot cannot sync deltas directly.
4. Old clients must rejoin by downloading the compacted snapshot.
5. Backups older than the snapshot cannot be restored into live anti-entropy.
After those conditions, the system can replace detailed remove evidence with the snapshot boundary:
before compaction:
live adds:
urgent -> {}
tombstones:
urgent -> {M:7}
context:
M up to 7 is known
after compaction:
live adds:
urgent -> {}
tombstones:
{}
compacted snapshot:
all states before snapshot S must rejoin through S
Notice what changed. The system did not simply forget. It moved the evidence from one place to another: from per-dot tombstone storage into a rejoin and snapshot rule.
If an old Lisbon mobile client returns with M:7, the server must not merge that old state blindly. It should say:
client state is before snapshot S:
direct delta merge rejected
client downloads compacted state
local unsynced operations are replayed only if they are after the accepted boundary
Compaction is safe only because the topology and rejoin protocol enforce the same story as the metadata used to enforce.
Compaction Techniques
Compaction usually means one of several moves.
explicit tombstones -> causal context:
replace individual removed dots with a version-vector summary
full dot set -> version vector plus exceptions:
summarize contiguous actor histories and keep only gaps
delta log -> durable state snapshot:
stop retaining every old delta after peers can repair from a snapshot
many client actors -> server actors:
route dot creation through fewer durable actors
old schema markers -> migration boundary:
remove compatibility state only after old writers cannot appear
Each technique has a cost.
causal summaries:
smaller, but harder to debug than explicit tombstones
snapshots:
compact, but require a rejoin protocol and backup discipline
fewer actors:
smaller vectors, but more routing and less pure client-local authorship
retention windows:
simple, but only safe if stale replicas are blocked or repaired
There is no free compaction. The metadata leaves the object only when another part of the system takes over the guarantee.
What To Measure
Metadata problems usually appear before correctness failures do. Track them directly:
tombstone_count_per_object
causal_context_bytes_per_object
vector_entries_per_object
delta_log_retention_bytes
oldest_uncompacted_dot_age
replicas_blocking_compaction
offline_client_age_distribution
snapshot_rejoin_failures
delete_resurrection_incidents
The operational question is not only "how much metadata do we have?" It is "which promise is preventing cleanup?"
metadata remains because:
a replica has not acknowledged
an old client can still sync
a backup may still be restored
a schema migration is incomplete
a safety-sensitive remove needs longer evidence
Once you can name the blocker, you can decide whether to wait, force repair, retire a replica, shorten offline support, or redesign the CRDT state.
Failure Modes
- Deleting tombstones by age alone: Time does not prove that old adds cannot arrive.
- Forgetting offline clients: A mobile replica can carry old state long after servers look clean.
- Restoring old backups into live merge: A restore path can reintroduce dots the live system has compacted away.
- Compacting without rejoin rules: Old replicas need a safe path back, usually through a snapshot or full repair.
- Letting actor IDs explode: One vector entry per transient client can make metadata larger than the user data.
- Reusing actor counters: A restarted actor that reuses old dots can break duplicate detection and remove evidence.
- Hiding metadata from product decisions: Offline support, residency, and deletion semantics determine how much history must be retained.
- Cleaning derived indexes but not source state: Search or analytics cleanup does not prove source CRDT metadata is safe to drop.
Practice
Design a compaction policy for a replicated document label set.
system:
three regional server replicas
mobile clients can be offline for 30 days
weekly durable snapshots
backups retained for 90 days
labels use observed-remove semantics
Answer:
1. What evidence must exist before an explicit tombstone can be removed?
2. What happens when a mobile client older than the latest compacted snapshot reconnects?
3. Can a 90-day-old backup be restored directly into anti-entropy?
4. Which metrics show that compaction is blocked?
5. What product promise would change if offline clients were allowed to merge stale state forever?
The small win is recognizing that compaction is not a garbage-collection timer. It is a correctness boundary.
Connections
006.mdintroduced dots and causal context; this lesson shows how those structures become operational retention rules.008.mdshowed why observed-remove sets need tombstones or compact causal summaries.014.mdexplained topology and repair paths; compaction is safe only when those paths enforce the cleanup boundary.016.mdmoves from cleanup boundaries to transaction and atomic-update boundaries.
Resources
- [PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Review the metadata carried by sets, counters, registers, and maps.
- [PAPER] An Optimized Conflict-free Replicated Set
- Focus: Study OR-set metadata reduction and how causal context can replace explicit tombstone detail.
- [PAPER] Delta State Replicated Data Types
- Focus: Connect delta retention, repair, and compaction to the anti-entropy protocol.
- [DOC] Riak Data Types: Sets
- Focus: See how an available database exposes observed-remove behavior to application code.
- [BOOK] Designing Data-Intensive Applications
- Focus: Connect tombstones, compaction, snapshots, and derived data to operational reliability.
Key Takeaways
- CRDT metadata is correctness evidence, not incidental storage overhead.
- Tombstones and causal context can be removed only after a stability boundary prevents old state from changing future merges.
- Safe compaction usually moves evidence into snapshots, rejoin rules, acknowledgments, or membership epochs.
- The operational trade-off is retained history versus storage, bandwidth, privacy, backup, and offline-support cost.