CRDTs and Coordination Avoidance: Testing Merge Laws and Counterexample Histories

LESSON

019 30 min intermediate

CRDTs and Coordination Avoidance: Testing Merge Laws and Counterexample Histories

Core Insight

Imagine an offline-first checklist that looks perfect in every normal unit test. Madrid marks one item done. Dublin marks another item done. The replicas sync, and the UI shows both items. Then a mobile client reconnects after a timeout and sends the same state twice. Suddenly the counter for completed work says 3 even though only two tasks were completed.

The bug is not in the screen. It is in the merge. A CRDT is allowed to receive the same information more than once, in different orders, through different paths. If merge is not idempotent, duplicate anti-entropy changes the answer. If merge is not commutative, the result depends on message order. If merge is not associative, tree-shaped fanout can disagree with pairwise sync.

Testing CRDTs therefore starts from algebraic laws, not only example scenarios. For state-based CRDTs, merge should behave like a join: commutative, associative, and idempotent. For operation-based CRDTs, the operation protocol needs its own equivalent contract: stable operation identity, causal preconditions when required, duplicate handling, and the same final effect under allowed delivery histories.

The practical skill is to turn those laws into generated histories. A good test does not only say "replica A and B merge once." It creates edits, partitions replicas, duplicates messages, reorders delivery, compacts state, and then preserves the smallest failing history as a counterexample that a human can understand.

Why Example Tests Miss Merge Bugs

Start with a positive counter. Each replica keeps one component per actor:

state:
  Madrid -> 2
  Dublin -> 1

value:
  3

The correct merge takes the maximum component for each actor:

merge(
  {Madrid: 2},
  {Madrid: 1, Dublin: 1}
)
= {Madrid: 2, Dublin: 1}

That rule is safe because an actor's component only increases. Seeing an older value for Madrid should not reduce it, and seeing the same value again should not add it twice.

Now consider a broken implementation:

broken_merge(a, b):
  for each actor:
    result[actor] = a[actor] + b[actor]

It may pass a naive test:

Madrid:
  {Madrid: 1}

Dublin:
  {Dublin: 1}

broken merge:
  {Madrid: 1, Dublin: 1}
  value = 2

That looks fine because the two states have disjoint actors. The bug appears when anti-entropy resends information:

state x:
  {Madrid: 1}

broken_merge(x, x):
  {Madrid: 2}

The replica counted one increment twice. Distributed systems resend messages all the time. A test suite that never tries duplicate delivery has not tested a CRDT; it has tested a happy-path API.

Merge Laws As Testable Contracts

For a state-based CRDT, merge should satisfy three core laws:

commutative:
  merge(a, b) == merge(b, a)

associative:
  merge(merge(a, b), c) == merge(a, merge(b, c))

idempotent:
  merge(a, a) == a

Commutativity means message order does not matter:

Madrid receives Dublin first
Dublin receives Madrid first
same final state

Associativity means topology does not matter:

(phone -> edge) -> region
phone -> (edge -> region)
same final state

Idempotence means repeated delivery does not matter:

send update once
send update twice
same final state

Those laws are not optional style preferences. They are the reason a state-based CRDT can use gossip, retries, partial sync, and edge fanout without coordination.

A minimal property test can be written as pseudocode:

for many generated states a, b, c:
  assert normalize(merge(a, b)) == normalize(merge(b, a))
  assert normalize(merge(merge(a, b), c)) == normalize(merge(a, merge(b, c)))
  assert normalize(merge(a, a)) == normalize(a)

normalize matters because two internal representations may be equivalent even if their maps are ordered differently or contain compacted metadata.

The trade-off is that law tests are powerful but narrow. They prove that merge has the right shape over generated inputs. They do not prove that the type models the product correctly, that authorization is enforced, or that compaction is safe under every rejoin path.

From States To Histories

Testing only randomly generated states can miss bugs in operation creation. A more realistic test generates a history:

replicas:
  phone
  madrid-edge
  region

events:
  phone increments counter
  phone syncs to madrid-edge
  region increments counter
  madrid-edge syncs to region
  phone receives region state twice

The test runner can simulate network behavior:

actions:
  local_edit(replica, operation)
  send(from, to)
  deliver(message)
  drop(message)
  duplicate(message)
  compact(replica)
  restart(replica)

After enough delivery, all connected replicas should converge:

eventually:
  normalize(phone.state) == normalize(madrid-edge.state)
  normalize(madrid-edge.state) == normalize(region.state)

The runner should also check domain invariants:

completed_count == number of distinct completed task IDs
visible comment replies have their parent comments
inventory never goes below zero without allocated rights
deleted tags do not reappear after observed-remove semantics

This second group is where the earlier coordination lessons matter. A merge law can be correct while the application invariant is still unsafe. For example, two replicas can converge perfectly on "two users claimed the same unique slug." The test should catch that as an invariant violation, not as a convergence failure.

Worked Example: The Counter That Doubles

Use the broken positive counter from above:

broken_merge(a, b):
  result = {}
  for actor in union(keys(a), keys(b)):
    result[actor] = a.get(actor, 0) + b.get(actor, 0)
  return result

Generate this tiny history:

1. Madrid starts at {}
2. Madrid increments once
3. Madrid state is {Madrid: 1}
4. Madrid receives its own state again through sync

The idempotence check fails:

merge({Madrid: 1}, {Madrid: 1})
= {Madrid: 2}

A good property-testing tool shrinks the failing run. If the original failure had ten replicas, twenty increments, and five duplicate messages, shrinking removes everything not needed to reproduce the bug.

The final counterexample should be small enough to paste into a regression test:

counterexample:
  initial: {}
  op: inc(Madrid)
  duplicate_delivery: state {Madrid: 1}
  expected: {Madrid: 1}
  actual: {Madrid: 2}

Then fix the merge:

correct_merge(a, b):
  result = {}
  for actor in union(keys(a), keys(b)):
    result[actor] = max(a.get(actor, 0), b.get(actor, 0))
  return result

The same counterexample now passes:

merge({Madrid: 1}, {Madrid: 1})
= {Madrid: 1}

This is the "aha" loop of CRDT testing: generate an ugly distributed behavior, shrink it to one understandable history, fix the law or invariant, and keep that history forever.

Testing Operation-Based CRDTs

Operation-based CRDTs need a slightly different test shape because the state may not be merged directly. Replicas exchange operations.

For an observed-remove set, an add operation carries a unique dot:

add("urgent"):
  dot = Madrid:41

A remove operation removes the dots the actor has observed:

remove("urgent"):
  removes dots seen for "urgent":
    Madrid:41

The test runner should generate deliveries that include duplicates and reordering, but only allow the delivery patterns promised by the protocol. If the design requires causal delivery, the runner should either enforce it or deliberately test the buffering layer.

Useful properties include:

duplicate operation:
  applying the same op twice has the same effect as once

causal precondition:
  a remove that refers to dot Madrid:41 is not visible before Madrid:41 is known

convergence:
  after all operations are delivered according to the contract,
  all replicas render the same set

observed-remove semantics:
  remove deletes observed adds,
  but not concurrent unseen adds

A counterexample history for an OR-set might look like this:

1. Madrid adds "urgent" with dot M:1
2. Dublin has not seen M:1
3. Dublin removes "urgent"
4. replicas sync

expected:
  "urgent" remains, because Dublin did not observe M:1

buggy implementation:
  "urgent" disappears, because remove stored only the value

That failure is not a syntax problem. It shows that the implementation forgot the semantic difference between "remove what I saw" and "delete every add that may ever exist."

What To Generate

A useful generator should create pressure that real replication creates:

replica count:
  1 to 5 replicas

local operations:
  increments, adds, removes, register writes, text inserts

network events:
  drop, delay, duplicate, reorder, partition, heal

maintenance events:
  snapshot, compact tombstone, restart, resync from cursor

authority events:
  accept, reject, allocate rights, revoke invalid operation

Start small. A two-replica generator with duplicate delivery will catch many severe bugs. Add partitions, compaction, and authorization only when the simple laws are stable.

The oracle can be one of three things:

law oracle:
  merge laws must hold

model oracle:
  compare with a slow, obviously correct reference model

invariant oracle:
  product safety condition must hold after every visible state

The best tests save the seed and the reduced history:

failure:
  seed = 923481
  history = [
    inc(Madrid),
    duplicate_delivery(Madrid -> Dublin)
  ]

Without that saved history, the team only knows "random test failed once." With it, the team has a small distributed story that can be replayed in a debugger.

Failure Modes

Only testing final happy states: A CRDT can pass one merge example and still fail duplicate delivery, reorder, or topology changes.
Forgetting idempotence: Anti-entropy normally resends information; duplicate merge must not change meaning.
Testing raw equality without normalization: Equivalent states may have different internal ordering or compacted representation.
Generating impossible histories: If the protocol requires causal delivery, the test must either model the buffer or label non-causal delivery as an invalid input.
Ignoring product invariants: Convergence does not guarantee uniqueness, bounded inventory, permission safety, or valid document structure.
Dropping counterexample histories: A random failure that cannot be replayed is a weak debugging artifact.
Shrinking away the real cause: Reducers must preserve the failure condition, not just produce a smaller but different symptom.
Skipping compaction and restart: Metadata cleanup and crash recovery often break assumptions that normal in-memory merge tests never touch.

Practice

Write properties for this small OR-set:

state:
  adds: element -> set of dots
  removes: set of dots

visible(element):
  any add dot for element is not in removes

Test these histories:

1. add("urgent") delivered twice
2. add("urgent") and remove("urgent") after observing the add
3. add("urgent") in Madrid while Dublin concurrently removes "urgent" without seeing that add
4. compact tombstones after all replicas have seen the remove
5. old client rejoins with an operation that references a compacted dot

For each one, write:

property:
  what must remain true

counterexample shape:
  what sequence would prove the implementation wrong

repair:
  law fix, metadata fix, authority boundary, or protocol precondition

Then design one generator option that creates duplicate delivery. Run it mentally against a counter, a set, and a rich text insertion. If it feels too easy, add restart after local edit but before sync acknowledgment.

Connections

008.md introduced observed-remove semantics; this lesson turns those semantics into testable histories.
015.md covered tombstone collection; compaction needs tests that include old clients and causal stability.
018.md introduced offline clients and edge replicas; generated histories should include those sync paths, not just two in-memory states.
020.md extends the same mindset into observability, debugging, and repair workflows for real incidents.

Resources

[PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Revisit the merge laws and CRDT families that property tests should preserve.
[PAPER] An Optimized Conflict-free Replicated Set
- Focus: Study add/remove semantics and why dots matter for testing observed-remove behavior.
[BOOK] Property-Based Testing with PropEr, Erlang, and Elixir
- Focus: Use the testing style, generators, and shrinking ideas for distributed histories.
[DOC] Jepsen: Elle
- Focus: Compare generated histories and anomaly explanation with the counterexample style in this lesson.
[ARTICLE] Testing a distributed system: generating histories
- Focus: Read Jepsen analyses for examples of turning executions into concrete failure stories.

Key Takeaways

CRDT tests should check merge laws: commutativity, associativity, and idempotence for state-based merge.
Generated histories expose bugs that simple examples miss: duplicate delivery, reordered sync, partitions, compaction, and restart.
A counterexample history is most useful when it is small, replayable, and saved as a regression test.
Convergence tests must be paired with invariant tests, because replicas can agree on a state that still violates the product's safety rules.

← Back to CRDTs and Coordination Avoidance

← Back to Distributed Systems

← Back to Learning Hub