CRDTs and Coordination Avoidance: Transactions, Causal Consistency, and Atomic Updates

LESSON

016 30 min intermediate

CRDTs and Coordination Avoidance: Transactions, Causal Consistency, and Atomic Updates

Core Insight

Imagine a user moves a task from todo to done while offline. In the local app, that feels like one action. Under the hood, a naive model might represent it as two updates:

remove task T from todo
add task T to done

If those two updates arrive at another replica separately, a user may briefly see the task in both columns, or in neither column. The CRDT fields can each be correct and mergeable, while the product action still leaks a half-finished state.

This lesson is about the boundaries between three ideas that are easy to blur. Causal consistency says dependent facts should become visible in causal order: do not show the reply before the comment it replies to. Atomic updates say a group of changes should become visible together: do not show half of a task move. Transactions are a broader family of guarantees; some require coordination, while some highly available systems only provide local atomicity and causal visibility.

The non-obvious point is that atomic grouping does not automatically mean global serializable transactions. You can often group mergeable effects into one operation or one delta and keep the fast path local. But if the group must decide a non-mergeable invariant, such as unique name allocation or no-oversell without rights, then the group needs a coordination boundary.

The Naive Multi-Field Update

Use the project board from the previous lessons.

board:
  todo -> OR-set of task IDs
  done -> OR-set of task IDs

A task starts in todo:

todo = {T}
done = {}

Madrid moves it to done while disconnected:

remove T from todo
add T to done

Dublin receives only the add first:

todo = {T}
done = {T}

Later it receives the remove:

todo = {}
done = {T}

The final state may be fine, but the intermediate state exposed a broken user promise: a task should not appear in two columns because one human action was delivered in pieces.

One fix is to change the model:

task.status -> register or workflow CRDT

Now "move task to done" is one field update:

task T:
  status = done

That is often the cleanest answer. If the domain has a single logical fact, store it as one logical fact. Do not split it across two sets and then ask the system to hide the split.

But not every action can be modeled as one field. Creating a comment may update a comments map, a notification feed, a mention index, and last_activity_at. The system needs a way to group related effects.

Causal Consistency

Causal consistency preserves "happened after because it observed" relationships.

Suppose Ana writes a comment and then Bruno replies after seeing it:

C1: Ana writes "Can we ship this?"
R1: Bruno replies "Yes" after seeing C1

A causally consistent replica should not show R1 without C1.

bad visibility:
  R1 visible
  C1 missing

causal visibility:
  if R1 is visible,
  then C1 is visible too

The system can enforce this with dependency metadata:

operation R1:
  dot: B:12
  dependencies:
    A:44  # comment C1

When a replica receives R1, it checks whether it already has the dependencies. If not, it can buffer R1, fetch missing state, or hide it until the dependency arrives.

Causal consistency is not the same as total order. Two comments written independently can appear in different orders at different replicas as long as causally dependent facts do not appear backwards.

concurrent:
  C2 from Madrid
  C3 from Dublin
  no causal dependency between them

allowed:
  Madrid shows C2 then C3
  Dublin shows C3 then C2

That is why causal consistency is attractive in coordination-avoidance designs. It protects many user expectations without forcing every replica to agree on one global sequence of all events.

Atomic Update Groups

Atomic visibility means a set of effects appears together.

For a comment creation:

transaction group G:
  add comment C1 to comments map
  add mention notification N1 to Ana's inbox
  update document last_activity_at

The group should be applied as one unit at each replica:

if all effects in G are ready:
  make G visible
else:
  buffer or hide G

In a CRDT system, this group can still be mergeable if each effect is mergeable and the group carries a stable identity:

group_id = Madrid:tx:91
dependencies = causal context observed by the writer
effects:
  comments.add(C1 with dot M:201)
  inbox.add(N1 with dot M:202)
  last_activity.join(M:203)

Every replica deduplicates group_id. It applies the group only once. It waits for dependencies if the group refers to state that has not arrived yet. The group gives atomic visibility; the individual CRDT effects still define how concurrent groups merge.

This is a different promise from "the whole world agreed on this transaction before it committed." The writer committed locally. Other replicas make the group visible when they can interpret it safely.

When Atomic Is Not Enough

Atomic grouping prevents partial visibility. It does not turn unsafe invariants into safe ones.

Consider a group that claims a unique workspace slug and creates the workspace:

group:
  reserve slug "atlas"
  create workspace W

If Madrid and Dublin both create an atomic group for atlas while partitioned, each group is internally complete. The merge still violates uniqueness unless the slug claim went through the authority for atlas.

Madrid group:
  atlas -> workspace W1

Dublin group:
  atlas -> workspace W2

merge:
  two complete groups
  same unique slug

Atomicity solved "do not show half a workspace." It did not solve "only one workspace may own this slug."

Use this rule:

atomic group of mergeable effects:
  can often stay highly available

atomic group that consumes rights:
  can stay local only if rights are available

atomic group that decides uniqueness or absence:
  must route to the authority or coordinate

atomic group that changes safety-sensitive access:
  needs remove-wins semantics, epoch boundaries, leases, or coordination

The trade-off is precise. Atomic visibility improves user experience and avoids half-states, but it adds buffering, dependency tracking, deduplication, and retry complexity. Coordination is still required when the group itself makes a non-mergeable decision.

Worked Example: Move And Notify

Design a local action:

move task T to done
notify watchers

A better domain model stores the task's status as one field:

task T:
  status -> workflow register

The notification is derived from the move:

notification N:
  "Task T moved to done"
  depends_on: status update S

The local replica emits one group:

group M:tx:17
dependencies:
  current task version
effects:
  task_status.write(done, dot M:40)
  notifications.add(N, dot M:41, depends_on M:40)

A remote replica handles it like this:

receive group M:tx:17
if dependencies missing:
  buffer group
else:
  apply status write
  apply notification add
  make both visible together

If another replica concurrently moves the same task to blocked, the status register's merge policy still matters:

Madrid:
  status = done

Dublin:
  status = blocked

merge:
  multi-value conflict
  or workflow resolver
  or authority decision

The group did not hide the conflict. It kept each user's action internally coherent, then let the domain policy decide what concurrent moves mean.

Failure Modes

Splitting one logical fact across several fields: A task can appear in two columns if movement is modeled as remove-from-one-set plus add-to-another-set.
Assuming atomic means serializable: Atomic visibility does not create one global order or preserve non-mergeable invariants.
Showing causally dependent data too early: A reply, notification, or index entry can appear before the source fact it depends on.
Missing transaction IDs: Retries can apply the same group twice unless the group has stable identity and deduplication.
Buffering forever: A replica that waits for missing dependencies needs repair, timeout, or rejoin behavior.
Coordinating every group: Many groups contain only mergeable effects and do not need global agreement.
Ignoring derived views: Notifications, indexes, and counters should name their source dependencies or be repairable.
Using wall-clock order as causality: A later timestamp does not prove that one operation observed another.

Practice

Design the write path for this offline action:

Action:
  create a comment
  mention two users
  update last_activity_at
  add notifications

Fill in:

1. Which effects are authoritative source state?
2. Which effects are derived views?
3. What dependencies should the notification carry?
4. What is the transaction or group identity?
5. Which effects can be visible only when the group is complete?
6. Does the action decide any non-mergeable invariant?

Then change the action:

create workspace with unique slug "atlas"

Explain why atomic grouping is not enough, and name the authority or coordination path that must decide the slug claim.

Connections

006.md introduced causal context; this lesson uses it for visibility and dependency tracking.
013.md decomposed domain objects by promise; atomic groups keep related promises from leaking half-states.
015.md showed cleanup boundaries; buffered transactions and missing dependencies need similar repair and rejoin discipline.
017.md applies these ideas to collaborative editing, where sequences and rich text depend heavily on causal and atomic edit semantics.

Resources

[PAPER] Highly Available Transactions: Virtues and Limitations
- Focus: Separate highly available transactional guarantees from guarantees that require coordination.
[PAPER] Coordination Avoidance in Database Systems
- Focus: Recheck which transaction groups preserve invariants without coordination.
[PAPER] Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
- Focus: Study dependency tracking and causal visibility in geo-replicated storage.
[PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Connect operation identity, causal context, and merge behavior back to CRDT design.
[BOOK] Designing Data-Intensive Applications
- Focus: Connect causal consistency, transactions, replication lag, and user-visible anomalies.

Key Takeaways

Causal consistency prevents dependent facts from becoming visible in the wrong order.
Atomic update groups prevent users from seeing half of one logical action.
Atomicity and causality do not replace coordination for non-mergeable invariants such as uniqueness or global absence.
A good coordination-avoidance design groups mergeable effects locally and routes only invariant-deciding work to rights, authorities, or coordination.

← Back to CRDTs and Coordination Avoidance

← Back to Distributed Systems

← Back to Learning Hub