CRDTs and Coordination Avoidance: Design Review for Coordination Avoidance Strategy

LESSON

CRDTs and Coordination Avoidance

023 30 min intermediate

CRDTs and Coordination Avoidance: Design Review for Coordination Avoidance Strategy

Core Insight

Imagine a team proposing a multi-region incident-management product. Field teams need to add notes offline, attach photos, assign responders, update checklists, close incidents, enforce permissions, and produce an audit trail. Someone says, "Let's make it CRDT-based so every region stays writable."

That sentence is not a design yet. It is a useful instinct for some parts of the system and a dangerous shortcut for others. Notes, comments, checklist observations, and derived feeds may be good candidates for local merge. Unique audit numbers, permission grants, inventory rights, regulated incident closure, and revocation cross stronger boundaries.

A coordination-avoidance design review asks a sharper question than "can we use CRDTs?" It asks: for each domain promise, what is the state shape, what invariant must hold, who is allowed to create the fact, how do concurrent updates merge, what can lag safely, and what evidence proves the choice is correct?

The non-obvious point is that the review should reduce coordination without hiding it. A good design routes only the non-mergeable decisions to rights, authorities, epochs, or transactions. It keeps local work fast where merge is safe, and it names the moments where waiting is the honest product behavior.

Start From Promises, Not Data Types

A common mistake is to start with a menu of CRDTs:

comments -> OR-set
status -> register
tags -> OR-set
counter -> PN-counter

That is too early. Start with the promises the product makes to users and operators:

1. A technician can add field notes while offline.
2. A warning tag should not disappear just because another replica was stale.
3. A closed incident must have one audit number.
4. Only authorized roles may close regulated incidents.
5. Search and activity feeds may lag, but must repair.
6. Old clients must not bypass new safety semantics.

Now the design has something to test. Each promise points to a state model and a coordination boundary.

Use this review table:

domain promise:
  What must users be able to rely on?

source state:
  Which facts are authoritative?

merge rule:
  How do concurrent facts combine?

invariant:
  What must never be violated?

authority:
  Who can decide non-mergeable facts?

lag tolerance:
  What can be temporarily stale?

repair path:
  How does the system fix bad or missing state?

evidence:
  What tests, metrics, or runbooks prove this is working?

The trade-off becomes visible. CRDTs reduce waiting by accepting mergeable facts locally. They also force the team to be explicit about state identity, tombstones, causal context, admission control, and repair.

The Coordination Avoidance Decision Tree

For each write, ask the same sequence of questions.

1. Is the operation monotonic?
   If yes, it is often easier to merge.

2. If not monotonic, is it still invariant-confluent?
   If all valid concurrent states merge into a valid state, coordination may be avoidable.

3. Does the write consume a bounded right?
   If yes, it can be local only where rights have been allocated.

4. Does the write decide uniqueness or global absence?
   If yes, route it to an authority or coordinate.

5. Does the write change trust, permissions, or safety boundaries?
   If yes, use an authority, epoch, remove-wins rule, or explicit revocation model.

6. Is the field derived from source facts?
   If yes, prefer rebuild and repair over treating it as independent truth.

This is not a slogan. It is a review process that turns vague availability goals into concrete design choices.

Examples:

add comment:
  mergeable if comment IDs are unique and actor is authorized

remove warning tag:
  maybe remove-wins or authority-controlled, depending on safety meaning

claim slug "atlas":
  authority or allocation needed

decrement inventory:
  local only with escrowed rights

close regulated incident:
  authority decision, because legal/audit invariant matters

activity feed entry:
  derived from accepted source operations

The goal is not to avoid coordination everywhere. The goal is to spend coordination where it buys correctness.

Worked Example: Incident Response App

Review this proposed object:

incident I9:
  notes
  photos
  warning_tags
  assigned_owner
  status
  audit_number
  activity_feed

Notes

Notes are append-like. A technician can add one while offline.

operation phone:77
  actor: technician-9
  effect:
    notes.add(note:N77, body:"Valve inspected")

Design:

source state:
  OR-map note_id -> rich text body CRDT

merge rule:
  unique note IDs merge by map key
  body edits merge by text CRDT

authority:
  admission checks actor can add notes

lag tolerance:
  notes may appear later on other replicas

evidence:
  duplicate delivery tests
  authorization rejection tests
  sync backlog metrics

This is a good CRDT candidate. The hard part is not agreement on one global order. The hard part is stable identity, authorization, and repair.

Warning Tags

Warning tags look like a set, but the promise matters.

tag:
  "gas-risk"

If a warning tag is safety-critical, stale removal is dangerous. A plain add-wins or remove-wins set is not automatically correct.

Possible design:

source state:
  warning_tags as observed-remove set

policy:
  adding warning tags can be local
  removing warning tags requires safety authority

merge rule:
  local adds merge
  authority remove supersedes observed dots in an epoch

evidence:
  old client cannot remove warning tag
  concurrent add and unauthorized remove leaves warning visible

The same data type can be correct or wrong depending on the domain promise. "Tag" is not enough information.

Assigned Owner

Assignment is a workflow fact.

assigned_owner = user:ana

If temporary disagreement is acceptable, use a multi-value register and surface conflict:

Madrid:
  assigned_owner = Ana

Dublin:
  assigned_owner = Bruno

merge:
  conflict: {Ana, Bruno}

If the product requires exactly one active responder for safety or paging, use an authority:

assignment service:
  decides owner
  emits authoritative assignment operation

The review question is not "register or lock?" It is "what should a user see when two people assign concurrently?"

Status And Audit Number

Closing the incident changes several promises:

close incident:
  status = closed
  audit_number = unique
  edits freeze after close

This is not a good local CRDT write.

authority:
  verifies close rights
  allocates audit number
  emits freeze epoch

replicas:
  accept comments before freeze
  reject or quarantine late edits after freeze
  render close decision when received

The fast path is still useful. Users can add notes and photos locally while waiting. The close action should honestly show "waiting for authority" instead of pretending the local device can decide the audit invariant.

Activity Feed

The feed is derived:

activity_feed:
  "Ana added note N77"
  "Incident closed by authority"

Do not make feed entries independent source truth unless there is a reason. Derive them from accepted operations:

source operations -> feed projection

Then a missing or duplicated feed entry is a rebuild problem, not a domain conflict.

Review Checklist

Use this checklist before approving a coordination-avoidance design.

Data identity:
  Are operation IDs, dots, object IDs, and actor IDs stable?

Merge laws:
  Does merge satisfy the required laws or operation delivery contract?

Invariant safety:
  Which invariants are preserved without coordination?
  Which are protected by rights or authority?

Admission control:
  Who may emit each operation?
  What gets rejected before merge?

Causal visibility:
  Which operations depend on earlier facts?
  What is buffered or hidden until dependencies arrive?

Compaction:
  What metadata grows?
  What stability signal makes cleanup safe?

Migration:
  What happens when old and new clients coexist?

Observability:
  Can operators explain missing, rejected, repaired, and conflicting operations?

User experience:
  What does the UI show for pending authority, conflict, rejection, and repair?

Evidence:
  Which property tests, generated histories, dashboards, and runbooks exist?

The review should produce decisions, not just discussion. For each field, write:

local CRDT
local with allocated rights
authority-routed
derived and repairable
not suitable for coordination avoidance

If a field cannot be classified, the design is not ready.

Architecture Review Output

A useful review output is short and concrete.

Field: notes
Decision: local CRDT
Why: append/edit facts merge by stable ID
Boundary: actor must have note-write capability
Evidence: duplicate delivery, offline reconnect, unauthorized write tests
Field: audit_number
Decision: authority-routed
Why: uniqueness cannot be inferred locally from absence
Boundary: regional audit allocator
Evidence: partition test where two regions attempt close concurrently
Field: warning_tags.remove
Decision: authority or safety-specific remove-wins policy
Why: stale clients must not erase safety warnings
Boundary: safety authority and epoch
Evidence: malicious/stale removal tests

This format prevents two common review failures. First, it stops teams from saying "eventual consistency is okay" without naming what can be stale. Second, it stops teams from saying "we need strong consistency" for everything, when only a few fields need authority.

Failure Modes

Practice

Review this proposal:

Product:
  multi-region maintenance board

Fields:
  task title
  task comments
  task status
  tags
  assigned technician
  inventory parts used
  audit close number
  notification feed

For each field, classify it:

local CRDT:

local with allocated rights:

authority-routed:

derived and repairable:

needs explicit conflict state:

Then choose one risky field and fill out:

domain promise:

invariant:

concurrent update example:

safe merge or authority boundary:

rejection/repair path:

test history:

dashboard signal:

Finally, write one sentence the UI should show when a user performs an action locally but the system must wait for authority.

Connections

Resources

Key Takeaways

PREVIOUS CRDTs and Coordination Avoidance: Migration, Compatibility, and Rolling Upgrades NEXT CRDTs and Coordination Avoidance: Capstone: Design a Mergeable Multi-Region Application