CRDTs and Coordination Avoidance: Transactions, Causal Consistency, and Atomic Updates
LESSON
CRDTs and Coordination Avoidance: Transactions, Causal Consistency, and Atomic Updates
Core Insight
Imagine a user moves a task from todo to done while offline. In the local app, that feels like one action. Under the hood, a naive model might represent it as two updates:
remove task T from todo
add task T to done
If those two updates arrive at another replica separately, a user may briefly see the task in both columns, or in neither column. The CRDT fields can each be correct and mergeable, while the product action still leaks a half-finished state.
This lesson is about the boundaries between three ideas that are easy to blur. Causal consistency says dependent facts should become visible in causal order: do not show the reply before the comment it replies to. Atomic updates say a group of changes should become visible together: do not show half of a task move. Transactions are a broader family of guarantees; some require coordination, while some highly available systems only provide local atomicity and causal visibility.
The non-obvious point is that atomic grouping does not automatically mean global serializable transactions. You can often group mergeable effects into one operation or one delta and keep the fast path local. But if the group must decide a non-mergeable invariant, such as unique name allocation or no-oversell without rights, then the group needs a coordination boundary.
The Naive Multi-Field Update
Use the project board from the previous lessons.
board:
todo -> OR-set of task IDs
done -> OR-set of task IDs
A task starts in todo:
todo = {T}
done = {}
Madrid moves it to done while disconnected:
remove T from todo
add T to done
Dublin receives only the add first:
todo = {T}
done = {T}
Later it receives the remove:
todo = {}
done = {T}
The final state may be fine, but the intermediate state exposed a broken user promise: a task should not appear in two columns because one human action was delivered in pieces.
One fix is to change the model:
task.status -> register or workflow CRDT
Now "move task to done" is one field update:
task T:
status = done
That is often the cleanest answer. If the domain has a single logical fact, store it as one logical fact. Do not split it across two sets and then ask the system to hide the split.
But not every action can be modeled as one field. Creating a comment may update a comments map, a notification feed, a mention index, and last_activity_at. The system needs a way to group related effects.
Causal Consistency
Causal consistency preserves "happened after because it observed" relationships.
Suppose Ana writes a comment and then Bruno replies after seeing it:
C1: Ana writes "Can we ship this?"
R1: Bruno replies "Yes" after seeing C1
A causally consistent replica should not show R1 without C1.
bad visibility:
R1 visible
C1 missing
causal visibility:
if R1 is visible,
then C1 is visible too
The system can enforce this with dependency metadata:
operation R1:
dot: B:12
dependencies:
A:44 # comment C1
When a replica receives R1, it checks whether it already has the dependencies. If not, it can buffer R1, fetch missing state, or hide it until the dependency arrives.
Causal consistency is not the same as total order. Two comments written independently can appear in different orders at different replicas as long as causally dependent facts do not appear backwards.
concurrent:
C2 from Madrid
C3 from Dublin
no causal dependency between them
allowed:
Madrid shows C2 then C3
Dublin shows C3 then C2
That is why causal consistency is attractive in coordination-avoidance designs. It protects many user expectations without forcing every replica to agree on one global sequence of all events.
Atomic Update Groups
Atomic visibility means a set of effects appears together.
For a comment creation:
transaction group G:
add comment C1 to comments map
add mention notification N1 to Ana's inbox
update document last_activity_at
The group should be applied as one unit at each replica:
if all effects in G are ready:
make G visible
else:
buffer or hide G
In a CRDT system, this group can still be mergeable if each effect is mergeable and the group carries a stable identity:
group_id = Madrid:tx:91
dependencies = causal context observed by the writer
effects:
comments.add(C1 with dot M:201)
inbox.add(N1 with dot M:202)
last_activity.join(M:203)
Every replica deduplicates group_id. It applies the group only once. It waits for dependencies if the group refers to state that has not arrived yet. The group gives atomic visibility; the individual CRDT effects still define how concurrent groups merge.
This is a different promise from "the whole world agreed on this transaction before it committed." The writer committed locally. Other replicas make the group visible when they can interpret it safely.
When Atomic Is Not Enough
Atomic grouping prevents partial visibility. It does not turn unsafe invariants into safe ones.
Consider a group that claims a unique workspace slug and creates the workspace:
group:
reserve slug "atlas"
create workspace W
If Madrid and Dublin both create an atomic group for atlas while partitioned, each group is internally complete. The merge still violates uniqueness unless the slug claim went through the authority for atlas.
Madrid group:
atlas -> workspace W1
Dublin group:
atlas -> workspace W2
merge:
two complete groups
same unique slug
Atomicity solved "do not show half a workspace." It did not solve "only one workspace may own this slug."
Use this rule:
atomic group of mergeable effects:
can often stay highly available
atomic group that consumes rights:
can stay local only if rights are available
atomic group that decides uniqueness or absence:
must route to the authority or coordinate
atomic group that changes safety-sensitive access:
needs remove-wins semantics, epoch boundaries, leases, or coordination
The trade-off is precise. Atomic visibility improves user experience and avoids half-states, but it adds buffering, dependency tracking, deduplication, and retry complexity. Coordination is still required when the group itself makes a non-mergeable decision.
Worked Example: Move And Notify
Design a local action:
move task T to done
notify watchers
A better domain model stores the task's status as one field:
task T:
status -> workflow register
The notification is derived from the move:
notification N:
"Task T moved to done"
depends_on: status update S
The local replica emits one group:
group M:tx:17
dependencies:
current task version
effects:
task_status.write(done, dot M:40)
notifications.add(N, dot M:41, depends_on M:40)
A remote replica handles it like this:
receive group M:tx:17
if dependencies missing:
buffer group
else:
apply status write
apply notification add
make both visible together
If another replica concurrently moves the same task to blocked, the status register's merge policy still matters:
Madrid:
status = done
Dublin:
status = blocked
merge:
multi-value conflict
or workflow resolver
or authority decision
The group did not hide the conflict. It kept each user's action internally coherent, then let the domain policy decide what concurrent moves mean.
Failure Modes
- Splitting one logical fact across several fields: A task can appear in two columns if movement is modeled as remove-from-one-set plus add-to-another-set.
- Assuming atomic means serializable: Atomic visibility does not create one global order or preserve non-mergeable invariants.
- Showing causally dependent data too early: A reply, notification, or index entry can appear before the source fact it depends on.
- Missing transaction IDs: Retries can apply the same group twice unless the group has stable identity and deduplication.
- Buffering forever: A replica that waits for missing dependencies needs repair, timeout, or rejoin behavior.
- Coordinating every group: Many groups contain only mergeable effects and do not need global agreement.
- Ignoring derived views: Notifications, indexes, and counters should name their source dependencies or be repairable.
- Using wall-clock order as causality: A later timestamp does not prove that one operation observed another.
Practice
Design the write path for this offline action:
Action:
create a comment
mention two users
update last_activity_at
add notifications
Fill in:
1. Which effects are authoritative source state?
2. Which effects are derived views?
3. What dependencies should the notification carry?
4. What is the transaction or group identity?
5. Which effects can be visible only when the group is complete?
6. Does the action decide any non-mergeable invariant?
Then change the action:
create workspace with unique slug "atlas"
Explain why atomic grouping is not enough, and name the authority or coordination path that must decide the slug claim.
Connections
006.mdintroduced causal context; this lesson uses it for visibility and dependency tracking.013.mddecomposed domain objects by promise; atomic groups keep related promises from leaking half-states.015.mdshowed cleanup boundaries; buffered transactions and missing dependencies need similar repair and rejoin discipline.017.mdapplies these ideas to collaborative editing, where sequences and rich text depend heavily on causal and atomic edit semantics.
Resources
- [PAPER] Highly Available Transactions: Virtues and Limitations
- Focus: Separate highly available transactional guarantees from guarantees that require coordination.
- [PAPER] Coordination Avoidance in Database Systems
- Focus: Recheck which transaction groups preserve invariants without coordination.
- [PAPER] Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
- Focus: Study dependency tracking and causal visibility in geo-replicated storage.
- [PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Connect operation identity, causal context, and merge behavior back to CRDT design.
- [BOOK] Designing Data-Intensive Applications
- Focus: Connect causal consistency, transactions, replication lag, and user-visible anomalies.
Key Takeaways
- Causal consistency prevents dependent facts from becoming visible in the wrong order.
- Atomic update groups prevent users from seeing half of one logical action.
- Atomicity and causality do not replace coordination for non-mergeable invariants such as uniqueness or global absence.
- A good coordination-avoidance design groups mergeable effects locally and routes only invariant-deciding work to rights, authorities, or coordination.