CRDTs and Coordination Avoidance: Offline-First Clients and Edge Replication

LESSON

018 30 min intermediate

CRDTs and Coordination Avoidance: Offline-First Clients and Edge Replication

Core Insight

Imagine a field technician opening a maintenance app in a basement with no signal. She edits a checklist, attaches a note, marks one part as replaced, and closes the laptop. Ten minutes later, the device reconnects through a weak mobile link. Meanwhile, an edge replica near the warehouse has accepted another update for the same job.

A server-first design treats the offline edit as a problem: the client could not reach the authority, so the app must block, queue a form submission, or ask the user to retry later. An offline-first design treats the client as a replica. The local device records durable operations, updates the UI immediately, and synchronizes when a path to other replicas exists.

CRDTs help because many edits can merge without asking a central coordinator for permission. But "offline-first" does not mean "every decision is safe offline." The client can create comments, fill checklists, append observations, and edit mergeable fields locally. It still needs authority for unique claims, rights it does not hold, safety-critical state transitions, and policy changes that must not be accepted from an untrusted device.

The design move is to split the system into local-first data that can converge, derived views that can lag and repair, and authoritative decisions that must route through a boundary. Edge replication then becomes useful without becoming magical: it reduces latency and improves availability, while making sync protocol, conflict visibility, identity, and compaction part of the product architecture.

From Remote Form To Local Replica

Start with a simple checklist:

job J42:
  tasks:
    inspect valve
    replace filter
    upload photo

In a remote-form design, the browser sends mutations to a server:

client -> server:
  mark "replace filter" done

server -> client:
  success

If the network is down, the user cannot complete the action. The app may store a pending HTTP request, but the local state is often just an optimistic guess.

An offline-first design stores the action as local replicated state:

local operation:
  op_id: phone:91
  actor: phone
  dependency_context: phone has seen server:144, edge:33
  effect:
    checklist[J42].done.add("replace filter")

The local database applies the effect immediately:

phone visible state:
  replace filter = done
  sync status = not yet shared

Later, the phone sends the operation to an edge node:

phone -> edge:
  phone:91

edge merges:
  done set includes "replace filter"

edge -> phone:
  ack + operations phone was missing

The important shift is not just caching. The device has its own durable operation log or CRDT state. A cache can be thrown away and rebuilt from the server. A local replica is a source of new facts that must be synchronized, deduplicated, authorized, and sometimes repaired.

The Sync Loop

A practical offline-first client usually needs four pieces:

local store:
  current CRDT state or materialized view

outbox:
  local operations not yet acknowledged by enough peers

sync cursor:
  what this replica believes it has received

inbox/apply path:
  remote operations, snapshots, or deltas waiting to merge

The loop is small enough to sketch:

on local edit:
  assign stable op_id
  write operation to outbox
  apply operation to local state
  update UI

on network available:
  send operations missing on peer
  receive operations missing locally
  verify authorization and dependencies
  merge
  mark acknowledged operations

The client must be able to crash between any two steps. If it applies an edit to the UI but loses the operation before writing the outbox, the user sees work that cannot sync. If it sends an operation twice, the receiver must deduplicate by op_id.

receiver rule:
  if op_id already seen:
    ignore duplicate
  else:
    validate, merge, remember op_id

This is where causal context from earlier lessons becomes operational. A client can say "I have seen everything up to these dots" instead of resending the whole document:

client context:
  phone:91
  edge-madrid:450
  server:144

edge sends:
  operations newer than that context

For large documents or long-lived accounts, the system also needs snapshots:

snapshot:
  state at causal frontier F
  compacted operations before F
  resume by fetching operations after F

Snapshots are not only performance optimization. They are how a device that has been offline for months can rejoin without replaying every historical edit from the beginning of the product.

What Edge Replicas Change

An edge replica is a nearby server-side replica placed close to users, devices, warehouses, regions, or networks. It may be a point of presence, a regional database, or a service running inside a factory network.

Without an edge:

phone in Madrid -> primary in Virginia

Every sync waits for a long path. The phone can still edit offline, but sharing with nearby users or devices may be slow.

With an edge:

phone in Madrid -> Madrid edge -> global sync
tablet in warehouse -> Madrid edge -> global sync

The edge can acknowledge mergeable operations quickly, fan them out to nearby replicas, and continue serving during a wide-area outage. This improves latency and local availability.

It also adds new questions:

1. Which data can the edge accept as authoritative?
2. Which data can the edge cache but not decide?
3. Which operations must be routed to a home region or authority?
4. What happens if two edges accept conflicting work?
5. How does a returning client prove what it already has?

For CRDT-friendly state, the edge can often accept operations locally:

safe local accept:
  append comment
  add checklist item
  mark item observed
  add tag to OR-set
  edit rich text body with sequence CRDT

For non-mergeable decisions, the edge should route or reject:

needs authority:
  claim globally unique job number
  spend inventory rights not allocated to this edge
  grant admin permission
  close safety incident with legal effect

The trade-off is clear. Edges move work closer to users, but they also force the architecture to say which facts are local, which are mergeable, and which remain authoritative somewhere else.

Worked Example: Warehouse Job Notes

Design a job note system for technicians. The app supports comments, checklist completion, photo attachments, and one final "job closed" transition.

Use CRDTs for the parts that naturally merge:

job_notes:
  OR-map note_id -> note body CRDT

checklist_done:
  OR-set task_id

attachments:
  OR-set attachment_id

activity_feed:
  derived view from operations

A phone can create a note offline:

operation phone:18
  add note N77
  body = "Filter was cracked"
  depends_on: job J42 exists

The phone can also mark a checklist item done:

operation phone:19
  checklist_done.add("replace filter")

Both operations sync to the Madrid edge. Another technician's tablet receives them from the same edge before the global region catches up.

phone -> madrid-edge -> tablet
                      -> global-region later

The edge should not blindly accept the final close if the domain says closing a job consumes a unique audit number and freezes the checklist:

operation phone:20
  close job J42
  audit_number = next()

That operation crosses a coordination boundary. The client might show:

close requested
waiting for authority

If the authority accepts, it emits the authoritative close operation:

operation authority:7001
  job_status.write(closed)
  audit_number.assign(A-2026-0042)
  freeze_epoch = authority:7001

The lesson is not "CRDTs for some fields, transactions for everything else." It is more precise:

merge locally:
  facts where concurrent additions or edits can be joined

route to authority:
  facts that allocate scarce rights, decide uniqueness, or change safety boundaries

derive and repair:
  feeds, counters, search indexes, notifications

That split lets most user work feel instant while the few non-mergeable decisions remain explicit.

Conflict Visibility and Repair

Offline-first systems should not pretend conflicts never happen. They should make the common conflicts merge automatically and the meaningful conflicts visible in a form the user or domain can resolve.

Consider a job title stored as a last-writer-wins register:

phone:
  title = "Replace filter"

tablet:
  title = "Replace pump"

If the system picks the later timestamp, one edit disappears. That may be acceptable for a draft label, but it is risky if the title carries operational meaning.

A multi-value register makes the conflict explicit:

title conflict:
  - Replace filter
  - Replace pump

The edge can sync both values. A user or workflow can resolve them later:

resolution operation:
  title = "Replace pump filter"
  supersedes phone:44, tablet:12

This is still coordination avoidance. The system did not block both technicians while they were disconnected. It preserved enough information to converge to an honest conflict state and then resolve it with a later operation.

Repair paths matter too. An edge may accept an operation that a central policy later rejects:

edge accepted:
  add attachment A9

authority later rejects:
  attachment violates retention policy

The repair should be a new fact, not silent history editing:

authority:810:
  revoke attachment A9
  reason = retention_policy

That gives clients a causal explanation and lets derived views clean themselves up.

Failure Modes

Confusing cache with replica: A cache can be discarded; an offline client may contain new user facts that must not be lost.
Losing the outbox: If a local edit reaches the UI but not durable sync state, the system has created unrecoverable optimism.
Using wall-clock timestamps as proof: Clocks help with display, not with causality or authorization.
Accepting every edge write as globally valid: Edge acceptance is safe only for operations the edge is allowed to validate or merge.
Hiding meaningful conflicts: Automatic merge is useful when it preserves intent; otherwise it can erase important user work.
Letting old clients rejoin without a plan: Offline-first systems need snapshots, causal cursors, or forced resync boundaries.
Treating derived views as source state: Feeds, counters, and indexes should be recomputable or repairable from authoritative operations.
Skipping authorization on sync: A valid CRDT operation can still be illegal if the actor lacks rights for that object or epoch.

Practice

Design the sync model for an offline-first note app used by field teams.

Use this object:

case C19:
  notes
  tags
  assigned_owner
  photos
  closed_at

Classify each field:

1. Can it be edited locally with a CRDT?
2. Does it need rights allocated in advance?
3. Does it need a central authority?
4. Is it a derived view that can repair later?

Then sketch the client state:

local store:

outbox:

sync cursor:

conflict surface:

snapshot or compaction rule:

Finally, explain what the UI should show when a local "close case" action is waiting for authority while comments and photos continue syncing normally.

Connections

011.md introduced escrow and rights transfer; offline clients can act locally only when they hold the right they need.
016.md explained atomic update groups; offline actions often group local effects before syncing them.
017.md showed collaborative sequence editing; offline-first clients are where those sequence operations become product behavior.
019.md turns this design into tests by checking merge laws and counterexample histories.

Resources

[ARTICLE] Local-first software
- Focus: Study the user-facing goals of local responsiveness, ownership, and collaboration.
[PAPER] Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
- Focus: Compare conflict detection, tentative updates, and application-level resolution with modern offline-first designs.
[PAPER] Dynamo: Amazon's Highly Available Key-value Store
- Focus: Review always-writable replicas, versioning, and conflict surfacing in a production distributed store.
[DOC] Automerge: The Local-First Database
- Focus: Connect CRDT synchronization, documents, and local-first application structure.
[DOC] Yjs Document Updates
- Focus: Look at compact binary updates, idempotent application, and update exchange.

Key Takeaways

Offline-first clients are replicas, not just caches; their local operations must be durable, deduplicated, authorized, and synchronized.
Edge replicas improve latency and local availability when they accept only the operations they can safely validate or merge.
Coordination avoidance works best when the system separates mergeable local facts, repairable derived views, and authoritative decisions.
A good offline-first design makes conflicts and pending authority visible instead of hiding them behind fragile optimism.

← Back to CRDTs and Coordination Avoidance

← Back to Distributed Systems

← Back to Learning Hub