CRDTs and Coordination Avoidance: Design Review for Coordination Avoidance Strategy

LESSON

023 30 min intermediate

CRDTs and Coordination Avoidance: Design Review for Coordination Avoidance Strategy

Core Insight

Imagine a team proposing a multi-region incident-management product. Field teams need to add notes offline, attach photos, assign responders, update checklists, close incidents, enforce permissions, and produce an audit trail. Someone says, "Let's make it CRDT-based so every region stays writable."

That sentence is not a design yet. It is a useful instinct for some parts of the system and a dangerous shortcut for others. Notes, comments, checklist observations, and derived feeds may be good candidates for local merge. Unique audit numbers, permission grants, inventory rights, regulated incident closure, and revocation cross stronger boundaries.

A coordination-avoidance design review asks a sharper question than "can we use CRDTs?" It asks: for each domain promise, what is the state shape, what invariant must hold, who is allowed to create the fact, how do concurrent updates merge, what can lag safely, and what evidence proves the choice is correct?

The non-obvious point is that the review should reduce coordination without hiding it. A good design routes only the non-mergeable decisions to rights, authorities, epochs, or transactions. It keeps local work fast where merge is safe, and it names the moments where waiting is the honest product behavior.

Start From Promises, Not Data Types

A common mistake is to start with a menu of CRDTs:

comments -> OR-set
status -> register
tags -> OR-set
counter -> PN-counter

That is too early. Start with the promises the product makes to users and operators:

1. A technician can add field notes while offline.
2. A warning tag should not disappear just because another replica was stale.
3. A closed incident must have one audit number.
4. Only authorized roles may close regulated incidents.
5. Search and activity feeds may lag, but must repair.
6. Old clients must not bypass new safety semantics.

Now the design has something to test. Each promise points to a state model and a coordination boundary.

Use this review table:

domain promise:
  What must users be able to rely on?

source state:
  Which facts are authoritative?

merge rule:
  How do concurrent facts combine?

invariant:
  What must never be violated?

authority:
  Who can decide non-mergeable facts?

lag tolerance:
  What can be temporarily stale?

repair path:
  How does the system fix bad or missing state?

evidence:
  What tests, metrics, or runbooks prove this is working?

The trade-off becomes visible. CRDTs reduce waiting by accepting mergeable facts locally. They also force the team to be explicit about state identity, tombstones, causal context, admission control, and repair.

The Coordination Avoidance Decision Tree

For each write, ask the same sequence of questions.

1. Is the operation monotonic?
   If yes, it is often easier to merge.

2. If not monotonic, is it still invariant-confluent?
   If all valid concurrent states merge into a valid state, coordination may be avoidable.

3. Does the write consume a bounded right?
   If yes, it can be local only where rights have been allocated.

4. Does the write decide uniqueness or global absence?
   If yes, route it to an authority or coordinate.

5. Does the write change trust, permissions, or safety boundaries?
   If yes, use an authority, epoch, remove-wins rule, or explicit revocation model.

6. Is the field derived from source facts?
   If yes, prefer rebuild and repair over treating it as independent truth.

This is not a slogan. It is a review process that turns vague availability goals into concrete design choices.

Examples:

add comment:
  mergeable if comment IDs are unique and actor is authorized

remove warning tag:
  maybe remove-wins or authority-controlled, depending on safety meaning

claim slug "atlas":
  authority or allocation needed

decrement inventory:
  local only with escrowed rights

close regulated incident:
  authority decision, because legal/audit invariant matters

activity feed entry:
  derived from accepted source operations

The goal is not to avoid coordination everywhere. The goal is to spend coordination where it buys correctness.

Worked Example: Incident Response App

Review this proposed object:

incident I9:
  notes
  photos
  warning_tags
  assigned_owner
  status
  audit_number
  activity_feed

Notes

Notes are append-like. A technician can add one while offline.

operation phone:77
  actor: technician-9
  effect:
    notes.add(note:N77, body:"Valve inspected")

Design:

source state:
  OR-map note_id -> rich text body CRDT

merge rule:
  unique note IDs merge by map key
  body edits merge by text CRDT

authority:
  admission checks actor can add notes

lag tolerance:
  notes may appear later on other replicas

evidence:
  duplicate delivery tests
  authorization rejection tests
  sync backlog metrics

This is a good CRDT candidate. The hard part is not agreement on one global order. The hard part is stable identity, authorization, and repair.

Warning Tags

Warning tags look like a set, but the promise matters.

tag:
  "gas-risk"

If a warning tag is safety-critical, stale removal is dangerous. A plain add-wins or remove-wins set is not automatically correct.

Possible design:

source state:
  warning_tags as observed-remove set

policy:
  adding warning tags can be local
  removing warning tags requires safety authority

merge rule:
  local adds merge
  authority remove supersedes observed dots in an epoch

evidence:
  old client cannot remove warning tag
  concurrent add and unauthorized remove leaves warning visible

The same data type can be correct or wrong depending on the domain promise. "Tag" is not enough information.

Assigned Owner

Assignment is a workflow fact.

assigned_owner = user:ana

If temporary disagreement is acceptable, use a multi-value register and surface conflict:

Madrid:
  assigned_owner = Ana

Dublin:
  assigned_owner = Bruno

merge:
  conflict: {Ana, Bruno}

If the product requires exactly one active responder for safety or paging, use an authority:

assignment service:
  decides owner
  emits authoritative assignment operation

The review question is not "register or lock?" It is "what should a user see when two people assign concurrently?"

Status And Audit Number

Closing the incident changes several promises:

close incident:
  status = closed
  audit_number = unique
  edits freeze after close

This is not a good local CRDT write.

authority:
  verifies close rights
  allocates audit number
  emits freeze epoch

replicas:
  accept comments before freeze
  reject or quarantine late edits after freeze
  render close decision when received

The fast path is still useful. Users can add notes and photos locally while waiting. The close action should honestly show "waiting for authority" instead of pretending the local device can decide the audit invariant.

Activity Feed

The feed is derived:

activity_feed:
  "Ana added note N77"
  "Incident closed by authority"

Do not make feed entries independent source truth unless there is a reason. Derive them from accepted operations:

source operations -> feed projection

Then a missing or duplicated feed entry is a rebuild problem, not a domain conflict.

Review Checklist

Use this checklist before approving a coordination-avoidance design.

Data identity:
  Are operation IDs, dots, object IDs, and actor IDs stable?

Merge laws:
  Does merge satisfy the required laws or operation delivery contract?

Invariant safety:
  Which invariants are preserved without coordination?
  Which are protected by rights or authority?

Admission control:
  Who may emit each operation?
  What gets rejected before merge?

Causal visibility:
  Which operations depend on earlier facts?
  What is buffered or hidden until dependencies arrive?

Compaction:
  What metadata grows?
  What stability signal makes cleanup safe?

Migration:
  What happens when old and new clients coexist?

Observability:
  Can operators explain missing, rejected, repaired, and conflicting operations?

User experience:
  What does the UI show for pending authority, conflict, rejection, and repair?

Evidence:
  Which property tests, generated histories, dashboards, and runbooks exist?

The review should produce decisions, not just discussion. For each field, write:

local CRDT
local with allocated rights
authority-routed
derived and repairable
not suitable for coordination avoidance

If a field cannot be classified, the design is not ready.

Architecture Review Output

A useful review output is short and concrete.

Field: notes
Decision: local CRDT
Why: append/edit facts merge by stable ID
Boundary: actor must have note-write capability
Evidence: duplicate delivery, offline reconnect, unauthorized write tests

Field: audit_number
Decision: authority-routed
Why: uniqueness cannot be inferred locally from absence
Boundary: regional audit allocator
Evidence: partition test where two regions attempt close concurrently

Field: warning_tags.remove
Decision: authority or safety-specific remove-wins policy
Why: stale clients must not erase safety warnings
Boundary: safety authority and epoch
Evidence: malicious/stale removal tests

This format prevents two common review failures. First, it stops teams from saying "eventual consistency is okay" without naming what can be stale. Second, it stops teams from saying "we need strong consistency" for everything, when only a few fields need authority.

Failure Modes

Reviewing data types instead of promises: OR-set or register does not say whether the domain invariant is safe.
Hiding authorities: A system may claim to be coordination-free while quietly relying on one service for uniqueness, rights, or closure.
Treating derived views as source state: Feeds, counters, and search indexes should usually rebuild from accepted operations.
Ignoring trust boundaries: Mergeable unauthorized updates are still invalid.
No answer for concurrent intent: If two users write conflicting assignments, the UI and workflow need a visible policy.
No compaction story: Tombstones, dots, and operation logs cannot grow forever without stability signals.
No migration story: Old clients can corrupt new semantics if semantic fields are silently dropped.
No operational evidence: A design that cannot be tested, observed, or repaired is not ready for production.

Practice

Review this proposal:

Product:
  multi-region maintenance board

Fields:
  task title
  task comments
  task status
  tags
  assigned technician
  inventory parts used
  audit close number
  notification feed

For each field, classify it:

local CRDT:

local with allocated rights:

authority-routed:

derived and repairable:

needs explicit conflict state:

Then choose one risky field and fill out:

domain promise:

invariant:

concurrent update example:

safe merge or authority boundary:

rejection/repair path:

test history:

dashboard signal:

Finally, write one sentence the UI should show when a user performs an action locally but the system must wait for authority.

Connections

010.md introduced invariant confluence; this review uses it to decide what can merge safely.
011.md introduced escrow and bounded rights; the review uses rights when local work consumes scarce capacity.
020.md covered observability and repair; those are required evidence for an approved design.
021.md covered trust boundaries; every admitted operation in the design needs that lens.
024.md turns this review method into a capstone architecture for a mergeable multi-region application.

Resources

[PAPER] Coordination Avoidance in Database Systems
- Focus: Use invariant confluence as the basis for deciding where coordination is necessary.
[PAPER] Highly Available Transactions: Virtues and Limitations
- Focus: Separate highly available guarantees from stronger transactional guarantees.
[PAPER] Keeping CALM: When Distributed Consistency is Easy
- Focus: Connect monotonic reasoning to coordination avoidance.
[PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Recheck the data-type mechanics behind the review decisions.
[BOOK] Designing Data-Intensive Applications
- Focus: Compare replication, transactions, derived data, and user-visible consistency trade-offs.

Key Takeaways

Coordination-avoidance review starts from domain promises and invariants, not from a list of CRDT types.
The best designs keep mergeable facts local and route uniqueness, rights, trust, and safety decisions to explicit boundaries.
Every field should have a decision: local CRDT, local with rights, authority-routed, derived and repairable, or not suitable.
A production-ready design includes evidence: merge tests, invariant histories, observability, repair paths, and migration behavior.

← Back to CRDTs and Coordination Avoidance

← Back to Distributed Systems

← Back to Learning Hub