CRDTs and Coordination Avoidance: Design Review for Coordination Avoidance Strategy
LESSON
CRDTs and Coordination Avoidance: Design Review for Coordination Avoidance Strategy
Core Insight
Imagine a team proposing a multi-region incident-management product. Field teams need to add notes offline, attach photos, assign responders, update checklists, close incidents, enforce permissions, and produce an audit trail. Someone says, "Let's make it CRDT-based so every region stays writable."
That sentence is not a design yet. It is a useful instinct for some parts of the system and a dangerous shortcut for others. Notes, comments, checklist observations, and derived feeds may be good candidates for local merge. Unique audit numbers, permission grants, inventory rights, regulated incident closure, and revocation cross stronger boundaries.
A coordination-avoidance design review asks a sharper question than "can we use CRDTs?" It asks: for each domain promise, what is the state shape, what invariant must hold, who is allowed to create the fact, how do concurrent updates merge, what can lag safely, and what evidence proves the choice is correct?
The non-obvious point is that the review should reduce coordination without hiding it. A good design routes only the non-mergeable decisions to rights, authorities, epochs, or transactions. It keeps local work fast where merge is safe, and it names the moments where waiting is the honest product behavior.
Start From Promises, Not Data Types
A common mistake is to start with a menu of CRDTs:
comments -> OR-set
status -> register
tags -> OR-set
counter -> PN-counter
That is too early. Start with the promises the product makes to users and operators:
1. A technician can add field notes while offline.
2. A warning tag should not disappear just because another replica was stale.
3. A closed incident must have one audit number.
4. Only authorized roles may close regulated incidents.
5. Search and activity feeds may lag, but must repair.
6. Old clients must not bypass new safety semantics.
Now the design has something to test. Each promise points to a state model and a coordination boundary.
Use this review table:
domain promise:
What must users be able to rely on?
source state:
Which facts are authoritative?
merge rule:
How do concurrent facts combine?
invariant:
What must never be violated?
authority:
Who can decide non-mergeable facts?
lag tolerance:
What can be temporarily stale?
repair path:
How does the system fix bad or missing state?
evidence:
What tests, metrics, or runbooks prove this is working?
The trade-off becomes visible. CRDTs reduce waiting by accepting mergeable facts locally. They also force the team to be explicit about state identity, tombstones, causal context, admission control, and repair.
The Coordination Avoidance Decision Tree
For each write, ask the same sequence of questions.
1. Is the operation monotonic?
If yes, it is often easier to merge.
2. If not monotonic, is it still invariant-confluent?
If all valid concurrent states merge into a valid state, coordination may be avoidable.
3. Does the write consume a bounded right?
If yes, it can be local only where rights have been allocated.
4. Does the write decide uniqueness or global absence?
If yes, route it to an authority or coordinate.
5. Does the write change trust, permissions, or safety boundaries?
If yes, use an authority, epoch, remove-wins rule, or explicit revocation model.
6. Is the field derived from source facts?
If yes, prefer rebuild and repair over treating it as independent truth.
This is not a slogan. It is a review process that turns vague availability goals into concrete design choices.
Examples:
add comment:
mergeable if comment IDs are unique and actor is authorized
remove warning tag:
maybe remove-wins or authority-controlled, depending on safety meaning
claim slug "atlas":
authority or allocation needed
decrement inventory:
local only with escrowed rights
close regulated incident:
authority decision, because legal/audit invariant matters
activity feed entry:
derived from accepted source operations
The goal is not to avoid coordination everywhere. The goal is to spend coordination where it buys correctness.
Worked Example: Incident Response App
Review this proposed object:
incident I9:
notes
photos
warning_tags
assigned_owner
status
audit_number
activity_feed
Notes
Notes are append-like. A technician can add one while offline.
operation phone:77
actor: technician-9
effect:
notes.add(note:N77, body:"Valve inspected")
Design:
source state:
OR-map note_id -> rich text body CRDT
merge rule:
unique note IDs merge by map key
body edits merge by text CRDT
authority:
admission checks actor can add notes
lag tolerance:
notes may appear later on other replicas
evidence:
duplicate delivery tests
authorization rejection tests
sync backlog metrics
This is a good CRDT candidate. The hard part is not agreement on one global order. The hard part is stable identity, authorization, and repair.
Warning Tags
Warning tags look like a set, but the promise matters.
tag:
"gas-risk"
If a warning tag is safety-critical, stale removal is dangerous. A plain add-wins or remove-wins set is not automatically correct.
Possible design:
source state:
warning_tags as observed-remove set
policy:
adding warning tags can be local
removing warning tags requires safety authority
merge rule:
local adds merge
authority remove supersedes observed dots in an epoch
evidence:
old client cannot remove warning tag
concurrent add and unauthorized remove leaves warning visible
The same data type can be correct or wrong depending on the domain promise. "Tag" is not enough information.
Assigned Owner
Assignment is a workflow fact.
assigned_owner = user:ana
If temporary disagreement is acceptable, use a multi-value register and surface conflict:
Madrid:
assigned_owner = Ana
Dublin:
assigned_owner = Bruno
merge:
conflict: {Ana, Bruno}
If the product requires exactly one active responder for safety or paging, use an authority:
assignment service:
decides owner
emits authoritative assignment operation
The review question is not "register or lock?" It is "what should a user see when two people assign concurrently?"
Status And Audit Number
Closing the incident changes several promises:
close incident:
status = closed
audit_number = unique
edits freeze after close
This is not a good local CRDT write.
authority:
verifies close rights
allocates audit number
emits freeze epoch
replicas:
accept comments before freeze
reject or quarantine late edits after freeze
render close decision when received
The fast path is still useful. Users can add notes and photos locally while waiting. The close action should honestly show "waiting for authority" instead of pretending the local device can decide the audit invariant.
Activity Feed
The feed is derived:
activity_feed:
"Ana added note N77"
"Incident closed by authority"
Do not make feed entries independent source truth unless there is a reason. Derive them from accepted operations:
source operations -> feed projection
Then a missing or duplicated feed entry is a rebuild problem, not a domain conflict.
Review Checklist
Use this checklist before approving a coordination-avoidance design.
Data identity:
Are operation IDs, dots, object IDs, and actor IDs stable?
Merge laws:
Does merge satisfy the required laws or operation delivery contract?
Invariant safety:
Which invariants are preserved without coordination?
Which are protected by rights or authority?
Admission control:
Who may emit each operation?
What gets rejected before merge?
Causal visibility:
Which operations depend on earlier facts?
What is buffered or hidden until dependencies arrive?
Compaction:
What metadata grows?
What stability signal makes cleanup safe?
Migration:
What happens when old and new clients coexist?
Observability:
Can operators explain missing, rejected, repaired, and conflicting operations?
User experience:
What does the UI show for pending authority, conflict, rejection, and repair?
Evidence:
Which property tests, generated histories, dashboards, and runbooks exist?
The review should produce decisions, not just discussion. For each field, write:
local CRDT
local with allocated rights
authority-routed
derived and repairable
not suitable for coordination avoidance
If a field cannot be classified, the design is not ready.
Architecture Review Output
A useful review output is short and concrete.
Field: notes
Decision: local CRDT
Why: append/edit facts merge by stable ID
Boundary: actor must have note-write capability
Evidence: duplicate delivery, offline reconnect, unauthorized write tests
Field: audit_number
Decision: authority-routed
Why: uniqueness cannot be inferred locally from absence
Boundary: regional audit allocator
Evidence: partition test where two regions attempt close concurrently
Field: warning_tags.remove
Decision: authority or safety-specific remove-wins policy
Why: stale clients must not erase safety warnings
Boundary: safety authority and epoch
Evidence: malicious/stale removal tests
This format prevents two common review failures. First, it stops teams from saying "eventual consistency is okay" without naming what can be stale. Second, it stops teams from saying "we need strong consistency" for everything, when only a few fields need authority.
Failure Modes
- Reviewing data types instead of promises:
OR-setorregisterdoes not say whether the domain invariant is safe. - Hiding authorities: A system may claim to be coordination-free while quietly relying on one service for uniqueness, rights, or closure.
- Treating derived views as source state: Feeds, counters, and search indexes should usually rebuild from accepted operations.
- Ignoring trust boundaries: Mergeable unauthorized updates are still invalid.
- No answer for concurrent intent: If two users write conflicting assignments, the UI and workflow need a visible policy.
- No compaction story: Tombstones, dots, and operation logs cannot grow forever without stability signals.
- No migration story: Old clients can corrupt new semantics if semantic fields are silently dropped.
- No operational evidence: A design that cannot be tested, observed, or repaired is not ready for production.
Practice
Review this proposal:
Product:
multi-region maintenance board
Fields:
task title
task comments
task status
tags
assigned technician
inventory parts used
audit close number
notification feed
For each field, classify it:
local CRDT:
local with allocated rights:
authority-routed:
derived and repairable:
needs explicit conflict state:
Then choose one risky field and fill out:
domain promise:
invariant:
concurrent update example:
safe merge or authority boundary:
rejection/repair path:
test history:
dashboard signal:
Finally, write one sentence the UI should show when a user performs an action locally but the system must wait for authority.
Connections
010.mdintroduced invariant confluence; this review uses it to decide what can merge safely.011.mdintroduced escrow and bounded rights; the review uses rights when local work consumes scarce capacity.020.mdcovered observability and repair; those are required evidence for an approved design.021.mdcovered trust boundaries; every admitted operation in the design needs that lens.024.mdturns this review method into a capstone architecture for a mergeable multi-region application.
Resources
- [PAPER] Coordination Avoidance in Database Systems
- Focus: Use invariant confluence as the basis for deciding where coordination is necessary.
- [PAPER] Highly Available Transactions: Virtues and Limitations
- Focus: Separate highly available guarantees from stronger transactional guarantees.
- [PAPER] Keeping CALM: When Distributed Consistency is Easy
- Focus: Connect monotonic reasoning to coordination avoidance.
- [PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Recheck the data-type mechanics behind the review decisions.
- [BOOK] Designing Data-Intensive Applications
- Focus: Compare replication, transactions, derived data, and user-visible consistency trade-offs.
Key Takeaways
- Coordination-avoidance review starts from domain promises and invariants, not from a list of CRDT types.
- The best designs keep mergeable facts local and route uniqueness, rights, trust, and safety decisions to explicit boundaries.
- Every field should have a decision: local CRDT, local with rights, authority-routed, derived and repairable, or not suitable.
- A production-ready design includes evidence: merge tests, invariant histories, observability, repair paths, and migration behavior.