CRDTs and Coordination Avoidance: Delta-State Synchronization and Anti-Entropy

LESSON

CRDTs and Coordination Avoidance

005 30 min intermediate

CRDTs and Coordination Avoidance: Delta-State Synchronization and Anti-Entropy

Core Insight

Imagine a shared inventory service replicated across three regions. Each region accepts local updates so warehouse workers can keep scanning items even when a wide-area link is slow. A state-based CRDT gives the service a comforting recovery story: when replicas exchange state, each receiver joins the incoming payload with local state, and old or duplicate messages are harmless.

The problem is size. If the inventory object contains millions of item counters, shipping the full state after every local scan is wasteful. Operation-based replication avoids that by sending only the update, but then the system inherits a sharper delivery contract: operations must arrive reliably, duplicates must be controlled, and causal dependencies must be respected.

Delta-state CRDTs sit between those two approaches. A local update produces a small piece of state, called a delta. Replicas later merge that delta with the same join operation used for full state.

The non-obvious idea is that the delta is not an arbitrary patch or command. It is itself joinable CRDT state. That means it can be resent, batched, deduplicated, or merged with full state without changing the convergence rule.

This changes the design decision. Instead of choosing only between "send the whole payload" and "depend on an operation delivery layer," a system can send compact joinable fragments and rely on anti-entropy to repair missed sync. Delta-state synchronization keeps the forgiving merge semantics of state-based CRDTs while reducing the bandwidth pressure that made full-state exchange unattractive.

From Full State To Deltas

A state-based CRDT update has two visible effects. It changes local state, and it gives other replicas something they can later merge.

local update:
  state = update(state)

remote synchronization:
  state = join(state, incoming_state)

The join must be associative, commutative, and idempotent. Because of that, replicas do not need a perfect transport. If a message arrives twice, the second join changes nothing. If messages arrive out of order, the final joined state is the same. If a sync is missed, a later full-state sync can still repair the replica.

The cost is that incoming_state may be large.

replica A state:
  product-1 -> count facts
  product-2 -> count facts
  product-3 -> count facts
  ...
  product-1000000 -> count facts

If a worker increments only product-3, sending the entire object is technically correct but operationally expensive. Delta-state synchronization changes the local update so it returns a small joinable fragment:

local update:
  delta = delta_update(state)
  state = join(state, delta)
  remember delta for later sync

Remote replicas still use the same style of merge:

remote receive:
  state = join(state, delta)

The delta is not a command like increment product-3. It is a small state fragment that represents the new fact created by that command.

For a grow-only counter, the delta might be "replica A's component is now 42." For an observed-remove set, the delta might contain a new add dot, a remove context, or both.

That distinction matters because a delta can be treated like state. It can be duplicated, delayed, batched with other deltas, or joined into a full snapshot. The receiver does not need to replay a command with hidden preconditions; it joins a payload that already fits the CRDT's semilattice.

Delta Intervals

Real systems rarely send each delta immediately to every peer. They queue, batch, acknowledge, and retry. A common shape is a delta interval: a join of several consecutive deltas produced by one replica.

A produces:
  d1, d2, d3, d4

A may send:
  join(d1, d2)
  join(d3, d4)
  join(d1, d2, d3, d4)

Because the interval is also joinable state, the receiver does not care whether it receives one large interval or several smaller ones, as long as it eventually receives enough information to cover the updates that matter. The join absorbs overlap:

B receives join(d1, d2)
B receives join(d2, d3)

effect:
  d2 is present twice, but idempotence makes that harmless

This gives the synchronization layer useful freedom. It can send recent deltas frequently on a fast link, send larger intervals to a slow peer, and fall back to full state when a peer is too far behind or the local delta log has been compacted.

The design still needs bookkeeping. A replica must know which deltas have probably reached which peers, which deltas can be compacted, and when a peer needs a larger repair message. Delta-state CRDTs reduce payload size, but they do not remove the need for an anti-entropy protocol.

Anti-Entropy As Repair

Anti-entropy is the background repair loop. It keeps replicas converging despite missed messages, partitions, restarts, and uneven peer schedules.

It is not a special CRDT data type. It is the sync discipline around the data type.

A simple anti-entropy loop looks like this:

periodically for each peer:
  choose deltas the peer has not acknowledged
  send a delta interval
  record the peer's acknowledgment

on receive:
  join incoming delta interval into local state
  optionally remember the interval for forwarding
  send acknowledgment or causal summary

The loop has two jobs. First, it moves compact recent changes. Second, it provides a repair path when something was missed. If replica B is offline for an hour, A may no longer want to send every individual delta. A can send a larger delta interval, a compacted summary, or a full state snapshot. The receiver still joins whatever it gets.

That forgiving fallback is the main difference from a pure operation stream. If an operation stream loses an operation that later operations depend on, the receiver may need an ordered replay or a causal repair path. With delta-state synchronization, the repair message is still state. A later joined payload can cover facts that earlier messages failed to deliver.

Anti-entropy also shapes topology. A system might use pairwise gossip, hub-and-spoke sync, regional fanout, or opportunistic peer exchange. Delta-state CRDTs make these choices less brittle because messages are mergeable, but topology still affects convergence latency, metadata size, and network load.

Worked Example: Delta Counter Sync

Consider a positive-negative counter used to track inventory adjustments. Each replica keeps two grow-only maps:

P: increments by replica
N: decrements by replica

value = sum(P) - sum(N)

Replica A increments the counter by 5:

before:
  P[A] = 10

local update:
  P[A] = 15

delta:
  P[A] = 15

The delta does not need to include every other replica component. It only needs to include the new fact for A's component. When B receives it:

B.P[A] = max(B.P[A], 15)

If B receives the same delta twice, the second max has no effect. If B first receives a full state where P[A] = 14 and later receives the delta where P[A] = 15, the join still moves B forward. If B already has P[A] = 18, the older delta is absorbed.

Now add anti-entropy. Replica A keeps a short log:

d101: P[A] = 15
d102: N[A] = 2
d103: P[A] = 18

For a nearby peer, A may send every delta quickly. For a peer that was offline, A may send:

join(d101, d102, d103)

or simply send its current compact state. All of those messages are compatible with the same merge rule. The engineering question is not whether convergence is possible; it is which sync strategy gives acceptable bandwidth, memory, and catch-up time.

Bandwidth, Bookkeeping, and Repair Trade-offs

Delta-state synchronization is easy to oversell. A small delta is valuable only when it actually captures the useful change without carrying more metadata than a full payload would have cost.

For a simple counter, deltas are tiny. For a remove-aware set, a delta may need add dots, tombstones, or causal context. For a nested map CRDT, an update may produce a delta for one nested field, but the receiver still needs enough structure to join it into the right place. For a sequence CRDT, the metadata for insertion positions can dominate the user-visible content.

The anti-entropy layer also creates operational pressure:

The central trade-off is bandwidth versus bookkeeping. Smaller messages are useful only if the system can afford the retained delta history, peer knowledge, compaction rules, and repair paths needed to make those messages reliable over time. The useful mental model is that delta-state CRDTs optimize synchronization, not semantics. The convergence law still lives in the join.

Design Review

Suppose a multi-region document service stores user preference maps as CRDTs. A full preference map may contain thousands of keys, but most updates touch one key.

A reasonable delta-state design asks:

Those questions keep the design honest. "We send deltas" is not enough. A production design needs to say what the delta contains, how it is merged, how peers acknowledge it, and how the system repairs gaps.

Failure Modes

Practice

Design delta-state synchronization for a replicated shopping cart.

Specify:

Then compare your design with the operation-based cart from the previous lesson. Which design has a simpler recovery path? Which design sends smaller messages during normal operation? Which one makes causal dependencies more explicit?

Connections

Resources

Key Takeaways

PREVIOUS CRDTs and Coordination Avoidance: Operation-Based CRDTs and Causal Delivery NEXT CRDTs and Coordination Avoidance: Causal Context, Version Vectors, and Dots