CRDTs and Coordination Avoidance: Sets, Tombstones, and Observed-Remove Semantics
LESSON
CRDTs and Coordination Avoidance: Sets, Tombstones, and Observed-Remove Semantics
Core Insight
Consider a shared checklist replicated to phones, browsers, and two regions. Ana adds milk while she is online in Madrid. Bruno later sees milk in another replica and removes it because the item is already bought. At the same time, Celia is offline and adds milk again because she is planning a second trip. When every replica reconnects, should milk be present?
A plain set cannot answer that question. It only knows whether the value milk is in or out. A replicated set needs to know which addition a removal actually saw. If Bruno's removal observed Ana's add, it can remove that add. If it did not observe Celia's concurrent add, deleting Celia's intent would be a hidden lost update.
Observed-remove semantics make that distinction explicit. An add creates a unique event identifier, often called a dot. A remove deletes the dots it has observed for that element. Tombstones, or compact causal summaries that play the same role, keep removed dots from returning when old state arrives later.
The design trade-off is that a set can remain available and converge without asking every replica for permission, but it must carry enough causal evidence to make removals precise. The system pays metadata and cleanup costs so that "remove what I saw" does not accidentally become "remove every add with this value forever."
Why Plain Sets Break
Start with the simplest possible replicated set:
Replica A set: {milk}
Replica B set: {}
merge: union
result: {milk}
Union is a valid merge for a grow-only set, usually called a G-set. It is monotonic: once an element appears, it never disappears. That is useful for facts that only accumulate, such as "devices ever registered" or "audit markers ever seen."
Most application sets need removal. A tempting next step is a two-phase set, or 2P-set. It keeps two grow-only sets:
adds: values ever added
removes: values ever removed
visible value = adds - removes
This converges because both internal sets grow by union. It also has a sharp limitation: once milk appears in removes, milk can never be visible again. The tombstone is at the value level, not at the add-event level.
For some domains, that is acceptable. If account_id=123 is revoked forever, a 2P-style rule may fit. For a shopping list, tag set, membership list, or collaborative document selection, it is too strong. People remove and re-add the same value all the time.
The problem is not that sets are impossible to merge. The problem is that an application value such as milk is not enough information. The replica must distinguish one add of milk from another add of milk.
Adds Need Identity
An observed-remove set, often shortened to OR-set, treats each add as a separate event. The visible element is still milk, but internally the set stores one or more dots for that element.
A dot is a unique causal identifier, commonly shaped like:
(replica_id, local_counter)
If replica A adds milk, it might create dot A:10:
adds:
milk -> {A:10}
tombstones:
{}
visible:
{milk}
If replica B also adds milk without seeing A's add, it creates a different dot:
adds:
milk -> {B:4}
After merge, both additions are known:
adds:
milk -> {A:10, B:4}
visible:
{milk}
The user sees one checklist item, but the data type remembers two independent reasons that the item is present. That internal distinction is what makes removal precise.
Observed Remove
Now suppose B removes milk after seeing only dot A:10. The remove operation does not say "delete the word milk from all time." It says "for element milk, remove the add dots I have observed."
B removes milk after observing A:10
adds:
milk -> {A:10}
tombstones:
milk -> {A:10}
visible:
{}
If an old message later brings A:10 back, the tombstone prevents resurrection:
incoming old state:
milk -> {A:10}
local tombstone:
milk -> {A:10}
visible after merge:
{}
That is the practical job of a tombstone. It is evidence that a particular add event was removed. Without that evidence, anti-entropy could replay older add state and make deleted items reappear.
Now add concurrency. C adds milk as dot C:2 while B is removing the observed dot A:10. B did not see C:2, so B's remove cannot honestly remove it.
B state:
tombstones:
milk -> {A:10}
C state:
adds:
milk -> {C:2}
merge:
adds:
milk -> {A:10, C:2}
tombstones:
milk -> {A:10}
visible:
{milk} because C:2 is still live
This is called add-wins observed-remove behavior: a concurrent add survives a remove that did not observe it. The name can be misleading if read casually. The add does not win because adds are more important. It wins because the remove lacks evidence that it was meant to remove that particular add.
Tombstones and Causal Context
The teaching model above stores tombstones directly:
removed dots:
milk -> {A:10}
Many production implementations do something more compact. They keep a causal context: a summary of dots already known by the replica, often represented with version vectors plus exceptions. If a dot is covered by the causal context and no longer appears in the live add set, the system can infer that the dot was removed.
The idea is the same as the dots and version vectors from the earlier causal-context lesson. The set needs to answer two questions:
Have I seen this add dot before?
Is this add dot still live, or was it removed?
Explicit tombstones are easy to explain and debug, but they can grow without bound. Compact causal context is more efficient, but it is more subtle to implement. Either way, cleanup must be conservative. A replica can drop remove evidence only when it knows no delayed replica or old message can reintroduce the removed dot.
This creates an operational trade-off. Keeping tombstones forever is safe but expensive. Collecting them too early is cheap but can resurrect deletes. Good systems need a stability rule, such as "every replica in this membership epoch has acknowledged seeing this dot" or "states older than this durable snapshot cannot reappear."
Remove-Wins Is A Different Policy
Observed-remove sets are often configured so concurrent add wins. That is not the only possible policy.
A remove-wins set gives deletion priority over a concurrent add for the same element. That can fit domains where a remove is a safety decision. For example, if an operator removes a compromised API key from an allow-list, a concurrent add of the same key should probably not keep it active.
Remove-wins behavior requires different metadata. The remove must be able to suppress adds it has not individually observed, usually by recording removal context for an element or by using allocation rules that make later adds prove they happened after the remove. That can make the data type more expensive and the semantics harder to explain to application developers.
The important design question is user intent:
shopping list item:
concurrent add usually means "someone still wants it"
add-wins is often reasonable
security allow-list entry:
concurrent remove may mean "do not trust it"
remove-wins or coordination may be required
The CRDT gives convergence. It does not choose the product meaning for you.
Worked Example: Labels On A Task
Imagine a replicated task tracker. Each task has a set of labels:
task.labels = {urgent, backend}
Replica A and B both start with this live dot:
backend -> {A:1}
A removes backend because the task moved to frontend work:
A remove backend:
tombstone backend -> {A:1}
At the same time, B is offline. A teammate at B adds backend again after deciding the API still needs backend work:
B add backend:
backend -> {B:8}
When the replicas merge:
adds:
backend -> {A:1, B:8}
tombstones:
backend -> {A:1}
live dots:
backend -> {B:8}
visible labels:
{backend}
This result is not a conflict bug. It says A removed the old reason for backend, while B introduced a new reason. If the product wants "remove the label even if someone concurrently re-adds it," the team should not call this implementation wrong. It should choose a remove-wins set or require coordination for that label category.
Failure Modes
- Using a G-set where deletion is part of the product: The set converges, but users can never remove values.
- Using a 2P-set for reusable values: A value-level tombstone prevents legitimate re-adds of the same value.
- Dropping tombstones too early: A delayed add message can make a removed value reappear.
- Treating add-wins as universally correct: Concurrent additions survive only because the remove did not observe them; some domains need remove-wins or coordination.
- Reusing dots after restart: A replica that loses its counter state can create ambiguous identifiers, breaking the remove evidence.
- Hiding metadata growth: Set CRDTs can look small at the API while storing many dots, tombstones, or causal-context entries internally.
Practice
For each replicated field, decide whether an add-wins OR-set is appropriate:
team members in a shared workspace
labels on an issue
revoked authentication tokens
items in a collaborative grocery list
For each choice, write down what a concurrent add and remove should mean. If your answer depends on safety, billing, access control, or legal responsibility, the field may need remove-wins semantics or explicit coordination instead of a default add-wins set.
Connections
006.mdintroduced dots and causal context, which are the metadata that make observed-remove semantics possible.007.mdshowed why "one value" is not enough for registers; this lesson applies the same observation discipline to membership in a set.009.mdbuilds on sets by placing CRDT fields inside maps and nested objects, where tombstone and schema choices interact.
Resources
- [PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Read the set CRDT definitions and compare grow-only, two-phase, and observed-remove designs.
- [PAPER] An Optimized Conflict-free Replicated Set
- Focus: Study OR-set implementation details, metadata reduction, and the role of causal context.
- [DOC] Riak Data Types: Sets
- Focus: Look at how an available database exposes set behavior and update semantics to application code.
- [BOOK] Designing Data-Intensive Applications
- Focus: Use the replication chapters to connect tombstones, causality, and conflict resolution to operational behavior.
Key Takeaways
- A replicated set needs add identities, not just element values, when removes and re-adds can happen.
- Observed-remove semantics delete only the add dots the remover has seen; concurrent adds survive under add-wins OR-set behavior.
- Tombstones or compact causal context prevent removed add events from reappearing during delayed synchronization.
- The main trade-off is availability and deterministic convergence versus metadata, cleanup, and choosing the right add-wins or remove-wins domain policy.