CRDTs and Coordination Avoidance: Operation-Based CRDTs and Causal Delivery

LESSON

CRDTs and Coordination Avoidance

004 30 min intermediate

CRDTs and Coordination Avoidance: Operation-Based CRDTs and Causal Delivery

Core Insight

CartService can replicate by sending full mergeable state, but that is not always cheap. If a customer adds one item to a cart whose internal CRDT payload contains thousands of entries and tombstones, shipping the whole payload after every edit wastes bandwidth. The natural alternative is to send the edit itself: "add item book with tag A:42."

Operation-based CRDTs replicate updates rather than full payloads. A replica executes a local operation, turns it into a message, and other replicas apply that message to their local state. The formal name is commutative replicated data type, often abbreviated as CmRDT.

This can be much smaller than sending state, but it makes the delivery contract sharper.

The non-obvious insight is that operation-based replication moves complexity from merge algebra into message semantics. The message has to be more than "the user clicked add." It must be a prepared CRDT operation.

That operation must be designed so concurrent operations commute. The transport must also deliver enough causal context for the receiver to interpret each operation correctly. If an operation depends on a prior operation that has not arrived yet, applying it immediately can produce the wrong state.

The trade-off is bandwidth and immediacy versus delivery discipline. Operation-based CRDTs can be efficient, especially for small updates to large objects, but they need reliable dissemination, deduplication, and often causal delivery.

From Payloads To Operations

State-based replication sends the receiver a payload to join.

state-based:
  A sends payload {book, charger}
  B computes B = join(B, payload)

Operation-based replication sends an update event.

operation-based:
  A sends add(book, tag=A:42)
  B applies add(book, tag=A:42)

The operation is not just an API request. It is a prepared message with the metadata needed for all replicas to apply it safely.

For a set, the operation might include an element id, a unique dot, and causal context. For a counter, it might include the replica id and sequence number.

This is the core contract:

source replica:
  validate local preconditions
  assign operation metadata
  apply locally
  disseminate operation

receiving replica:
  wait until causal dependencies are satisfied
  ignore duplicate operation ids
  apply operation exactly once

The CRDT property does not come from sending "less data." It comes from sending operations whose effects commute once their dependencies are respected.

Causal Delivery

Causal delivery means that if operation b depends on operation a, every replica applies a before b.

Suppose an observed-remove set stores adds with unique tags:

add(book) creates tag A:42
remove(book) removes the tags for book that the replica has observed

Replica A sees:

add(book) -> tag A:42
remove(book) -> remove tag A:42

If replica B receives remove(A:42) before it has ever received add(A:42), it must understand what that remove means. Different designs handle this differently:

Without one of those choices, B might ignore the remove as meaningless and later resurrect book when the add arrives. The issue is not ordinary network delay. The issue is missing causal context.

Delivery Guarantees

Operation-based CRDTs usually need a stronger transport contract than state-based CRDTs.

Useful guarantees include:

For a grow-only set, delivery can be simple:

op id:
  A:42

effect:
  add book with tag A:42

duplicate:
  ignored because tag A:42 is already present

For a remove-aware set, delivery is harder because remove operations refer to previous adds. The remove must either arrive after those adds or carry enough information to make the later add harmless.

This is where operation-based CRDTs begin to look less like ordinary event sourcing. The replicated operation is not only "what the user did."

It is "what effect the CRDT state machine must apply, with enough metadata to commute safely."

Worked Example: Add-Wins Set Operation

Consider an add-wins set for a collaborative checklist.

Local add:

add(item):
  dot = next_dot(replica_id)
  local_state[item].add(dot)
  emit add(item, dot)

Local remove:

remove(item):
  observed = local_state[item].dots
  local_state[item].remove_all(observed)
  tombstones.add_all(observed)
  emit remove(item, observed)

Remote apply:

apply add(item, dot):
  if dot not in tombstones:
    local_state[item].add(dot)

apply remove(item, observed):
  local_state[item].remove_all(observed)
  tombstones.add_all(observed)

This operation design makes the conflict policy explicit. A remove only removes adds it has observed. A concurrent add with a different dot is not in the observed set, so it remains. That is why the data type is add-wins.

The operation payload is smaller than full state when the set is large. The price is that every dot must be uniquely identified, remove operations must carry the observed dots or equivalent causal context, and replicas must deduplicate operation delivery.

Operation-Based Versus State-Based

State-based CRDTs are forgiving:

receive old state:
  join keeps newer local facts

receive duplicate state:
  idempotent join absorbs it

miss one sync:
  later anti-entropy can send the full payload again

Operation-based CRDTs are leaner but sharper:

miss an operation:
  later operations may depend on it

receive duplicate operation:
  receiver must detect the duplicate effect

receive remove before add:
  receiver needs causal delivery or tombstone context

The choice is not a matter of which style is better in general. It is a system design decision. If bandwidth is scarce and the dissemination layer already gives causal delivery, operation-based can be attractive. If the network is unreliable and full state is affordable, state-based may be simpler to operate.

Failure Modes

Practice

Design an operation-based replicated "starred documents" set.

Specify:

Then compare the design with a state-based version. Which one sends less data? Which one is easier to recover after a missed message?

Connections

Resources

Key Takeaways

PREVIOUS CRDTs and Coordination Avoidance: State-Based CRDTs and Merge Semantics NEXT CRDTs and Coordination Avoidance: Delta-State Synchronization and Anti-Entropy