CRDTs and Coordination Avoidance: Operation-Based CRDTs and Causal Delivery
LESSON
CRDTs and Coordination Avoidance: Operation-Based CRDTs and Causal Delivery
Core Insight
CartService can replicate by sending full mergeable state, but that is not always cheap. If a customer adds one item to a cart whose internal CRDT payload contains thousands of entries and tombstones, shipping the whole payload after every edit wastes bandwidth. The natural alternative is to send the edit itself: "add item book with tag A:42."
Operation-based CRDTs replicate updates rather than full payloads. A replica executes a local operation, turns it into a message, and other replicas apply that message to their local state. The formal name is commutative replicated data type, often abbreviated as CmRDT.
This can be much smaller than sending state, but it makes the delivery contract sharper.
The non-obvious insight is that operation-based replication moves complexity from merge algebra into message semantics. The message has to be more than "the user clicked add." It must be a prepared CRDT operation.
That operation must be designed so concurrent operations commute. The transport must also deliver enough causal context for the receiver to interpret each operation correctly. If an operation depends on a prior operation that has not arrived yet, applying it immediately can produce the wrong state.
The trade-off is bandwidth and immediacy versus delivery discipline. Operation-based CRDTs can be efficient, especially for small updates to large objects, but they need reliable dissemination, deduplication, and often causal delivery.
From Payloads To Operations
State-based replication sends the receiver a payload to join.
state-based:
A sends payload {book, charger}
B computes B = join(B, payload)
Operation-based replication sends an update event.
operation-based:
A sends add(book, tag=A:42)
B applies add(book, tag=A:42)
The operation is not just an API request. It is a prepared message with the metadata needed for all replicas to apply it safely.
For a set, the operation might include an element id, a unique dot, and causal context. For a counter, it might include the replica id and sequence number.
This is the core contract:
source replica:
validate local preconditions
assign operation metadata
apply locally
disseminate operation
receiving replica:
wait until causal dependencies are satisfied
ignore duplicate operation ids
apply operation exactly once
The CRDT property does not come from sending "less data." It comes from sending operations whose effects commute once their dependencies are respected.
Causal Delivery
Causal delivery means that if operation b depends on operation a, every replica applies a before b.
Suppose an observed-remove set stores adds with unique tags:
add(book) creates tag A:42
remove(book) removes the tags for book that the replica has observed
Replica A sees:
add(book) -> tag A:42
remove(book) -> remove tag A:42
If replica B receives remove(A:42) before it has ever received add(A:42), it must understand what that remove means. Different designs handle this differently:
- buffer the remove until the add arrives
- carry enough causal context to record the remove as a tombstone
- use a delivery layer that guarantees causal order
Without one of those choices, B might ignore the remove as meaningless and later resurrect book when the add arrives. The issue is not ordinary network delay. The issue is missing causal context.
Delivery Guarantees
Operation-based CRDTs usually need a stronger transport contract than state-based CRDTs.
Useful guarantees include:
- Reliable dissemination: every nonfaulty replica eventually receives every operation it needs.
- Deduplication: duplicate operation delivery does not apply the effect twice.
- Causal delivery or causal metadata: dependent operations are applied in a safe order.
- Stable operation identity: retries do not create new logical operations by accident.
For a grow-only set, delivery can be simple:
op id:
A:42
effect:
add book with tag A:42
duplicate:
ignored because tag A:42 is already present
For a remove-aware set, delivery is harder because remove operations refer to previous adds. The remove must either arrive after those adds or carry enough information to make the later add harmless.
This is where operation-based CRDTs begin to look less like ordinary event sourcing. The replicated operation is not only "what the user did."
It is "what effect the CRDT state machine must apply, with enough metadata to commute safely."
Worked Example: Add-Wins Set Operation
Consider an add-wins set for a collaborative checklist.
Local add:
add(item):
dot = next_dot(replica_id)
local_state[item].add(dot)
emit add(item, dot)
Local remove:
remove(item):
observed = local_state[item].dots
local_state[item].remove_all(observed)
tombstones.add_all(observed)
emit remove(item, observed)
Remote apply:
apply add(item, dot):
if dot not in tombstones:
local_state[item].add(dot)
apply remove(item, observed):
local_state[item].remove_all(observed)
tombstones.add_all(observed)
This operation design makes the conflict policy explicit. A remove only removes adds it has observed. A concurrent add with a different dot is not in the observed set, so it remains. That is why the data type is add-wins.
The operation payload is smaller than full state when the set is large. The price is that every dot must be uniquely identified, remove operations must carry the observed dots or equivalent causal context, and replicas must deduplicate operation delivery.
Operation-Based Versus State-Based
State-based CRDTs are forgiving:
receive old state:
join keeps newer local facts
receive duplicate state:
idempotent join absorbs it
miss one sync:
later anti-entropy can send the full payload again
Operation-based CRDTs are leaner but sharper:
miss an operation:
later operations may depend on it
receive duplicate operation:
receiver must detect the duplicate effect
receive remove before add:
receiver needs causal delivery or tombstone context
The choice is not a matter of which style is better in general. It is a system design decision. If bandwidth is scarce and the dissemination layer already gives causal delivery, operation-based can be attractive. If the network is unreliable and full state is affordable, state-based may be simpler to operate.
Failure Modes
- Treating API calls as CRDT operations: A user request may lack the metadata needed to commute at every replica.
- Applying operations without causal readiness: A remove can arrive before the add it references and create resurrection bugs.
- Missing deduplication: Retried operation delivery can double count a non-idempotent effect.
- Assuming reliable multicast exists: The CRDT design may be correct while the transport silently drops operations.
- Using wall-clock order as causal order: Timestamps do not prove which operation observed which prior state.
Practice
Design an operation-based replicated "starred documents" set.
Specify:
- the operation emitted by
star(document_id) - the operation emitted by
unstar(document_id) - the metadata needed to make duplicate delivery harmless
- the causal dependency of
unstar - whether concurrent
starandunstarshould be add-wins or remove-wins - what the receiver does if an
unstararrives before the correspondingstar
Then compare the design with a state-based version. Which one sends less data? Which one is easier to recover after a missed message?
Connections
003.mdintroduced state-based CRDTs, where replicas exchange payloads and merge by join.005.mdshows how delta-state synchronization recovers some operation-like efficiency while preserving a state-based merge contract.006.mddeepens the causal context vocabulary needed to understand dots, version vectors, and observed removes.
Resources
- [PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Compare the CvRDT and CmRDT definitions and their delivery assumptions.
- [PAPER] An optimized conflict-free replicated set
- Focus: Study how observed-remove sets use unique tags and causal context to define remove behavior.
- [BOOK] Designing Data-Intensive Applications
- Focus: Use the replication and causality chapters to keep delivery assumptions explicit.
Key Takeaways
- Operation-based CRDTs replicate prepared update operations instead of full payloads.
- The operation payload must carry enough identity and causal context for replicas to apply it safely.
- Causal delivery matters when one operation refers to or removes the effect of another operation.
- The main trade-off is bandwidth efficiency versus stronger delivery, deduplication, and causal-readiness requirements.