CRDTs and Coordination Avoidance: Offline-First Clients and Edge Replication
LESSON
CRDTs and Coordination Avoidance: Offline-First Clients and Edge Replication
Core Insight
Imagine a field technician opening a maintenance app in a basement with no signal. She edits a checklist, attaches a note, marks one part as replaced, and closes the laptop. Ten minutes later, the device reconnects through a weak mobile link. Meanwhile, an edge replica near the warehouse has accepted another update for the same job.
A server-first design treats the offline edit as a problem: the client could not reach the authority, so the app must block, queue a form submission, or ask the user to retry later. An offline-first design treats the client as a replica. The local device records durable operations, updates the UI immediately, and synchronizes when a path to other replicas exists.
CRDTs help because many edits can merge without asking a central coordinator for permission. But "offline-first" does not mean "every decision is safe offline." The client can create comments, fill checklists, append observations, and edit mergeable fields locally. It still needs authority for unique claims, rights it does not hold, safety-critical state transitions, and policy changes that must not be accepted from an untrusted device.
The design move is to split the system into local-first data that can converge, derived views that can lag and repair, and authoritative decisions that must route through a boundary. Edge replication then becomes useful without becoming magical: it reduces latency and improves availability, while making sync protocol, conflict visibility, identity, and compaction part of the product architecture.
From Remote Form To Local Replica
Start with a simple checklist:
job J42:
tasks:
inspect valve
replace filter
upload photo
In a remote-form design, the browser sends mutations to a server:
client -> server:
mark "replace filter" done
server -> client:
success
If the network is down, the user cannot complete the action. The app may store a pending HTTP request, but the local state is often just an optimistic guess.
An offline-first design stores the action as local replicated state:
local operation:
op_id: phone:91
actor: phone
dependency_context: phone has seen server:144, edge:33
effect:
checklist[J42].done.add("replace filter")
The local database applies the effect immediately:
phone visible state:
replace filter = done
sync status = not yet shared
Later, the phone sends the operation to an edge node:
phone -> edge:
phone:91
edge merges:
done set includes "replace filter"
edge -> phone:
ack + operations phone was missing
The important shift is not just caching. The device has its own durable operation log or CRDT state. A cache can be thrown away and rebuilt from the server. A local replica is a source of new facts that must be synchronized, deduplicated, authorized, and sometimes repaired.
The Sync Loop
A practical offline-first client usually needs four pieces:
local store:
current CRDT state or materialized view
outbox:
local operations not yet acknowledged by enough peers
sync cursor:
what this replica believes it has received
inbox/apply path:
remote operations, snapshots, or deltas waiting to merge
The loop is small enough to sketch:
on local edit:
assign stable op_id
write operation to outbox
apply operation to local state
update UI
on network available:
send operations missing on peer
receive operations missing locally
verify authorization and dependencies
merge
mark acknowledged operations
The client must be able to crash between any two steps. If it applies an edit to the UI but loses the operation before writing the outbox, the user sees work that cannot sync. If it sends an operation twice, the receiver must deduplicate by op_id.
receiver rule:
if op_id already seen:
ignore duplicate
else:
validate, merge, remember op_id
This is where causal context from earlier lessons becomes operational. A client can say "I have seen everything up to these dots" instead of resending the whole document:
client context:
phone:91
edge-madrid:450
server:144
edge sends:
operations newer than that context
For large documents or long-lived accounts, the system also needs snapshots:
snapshot:
state at causal frontier F
compacted operations before F
resume by fetching operations after F
Snapshots are not only performance optimization. They are how a device that has been offline for months can rejoin without replaying every historical edit from the beginning of the product.
What Edge Replicas Change
An edge replica is a nearby server-side replica placed close to users, devices, warehouses, regions, or networks. It may be a point of presence, a regional database, or a service running inside a factory network.
Without an edge:
phone in Madrid -> primary in Virginia
Every sync waits for a long path. The phone can still edit offline, but sharing with nearby users or devices may be slow.
With an edge:
phone in Madrid -> Madrid edge -> global sync
tablet in warehouse -> Madrid edge -> global sync
The edge can acknowledge mergeable operations quickly, fan them out to nearby replicas, and continue serving during a wide-area outage. This improves latency and local availability.
It also adds new questions:
1. Which data can the edge accept as authoritative?
2. Which data can the edge cache but not decide?
3. Which operations must be routed to a home region or authority?
4. What happens if two edges accept conflicting work?
5. How does a returning client prove what it already has?
For CRDT-friendly state, the edge can often accept operations locally:
safe local accept:
append comment
add checklist item
mark item observed
add tag to OR-set
edit rich text body with sequence CRDT
For non-mergeable decisions, the edge should route or reject:
needs authority:
claim globally unique job number
spend inventory rights not allocated to this edge
grant admin permission
close safety incident with legal effect
The trade-off is clear. Edges move work closer to users, but they also force the architecture to say which facts are local, which are mergeable, and which remain authoritative somewhere else.
Worked Example: Warehouse Job Notes
Design a job note system for technicians. The app supports comments, checklist completion, photo attachments, and one final "job closed" transition.
Use CRDTs for the parts that naturally merge:
job_notes:
OR-map note_id -> note body CRDT
checklist_done:
OR-set task_id
attachments:
OR-set attachment_id
activity_feed:
derived view from operations
A phone can create a note offline:
operation phone:18
add note N77
body = "Filter was cracked"
depends_on: job J42 exists
The phone can also mark a checklist item done:
operation phone:19
checklist_done.add("replace filter")
Both operations sync to the Madrid edge. Another technician's tablet receives them from the same edge before the global region catches up.
phone -> madrid-edge -> tablet
-> global-region later
The edge should not blindly accept the final close if the domain says closing a job consumes a unique audit number and freezes the checklist:
operation phone:20
close job J42
audit_number = next()
That operation crosses a coordination boundary. The client might show:
close requested
waiting for authority
If the authority accepts, it emits the authoritative close operation:
operation authority:7001
job_status.write(closed)
audit_number.assign(A-2026-0042)
freeze_epoch = authority:7001
The lesson is not "CRDTs for some fields, transactions for everything else." It is more precise:
merge locally:
facts where concurrent additions or edits can be joined
route to authority:
facts that allocate scarce rights, decide uniqueness, or change safety boundaries
derive and repair:
feeds, counters, search indexes, notifications
That split lets most user work feel instant while the few non-mergeable decisions remain explicit.
Conflict Visibility and Repair
Offline-first systems should not pretend conflicts never happen. They should make the common conflicts merge automatically and the meaningful conflicts visible in a form the user or domain can resolve.
Consider a job title stored as a last-writer-wins register:
phone:
title = "Replace filter"
tablet:
title = "Replace pump"
If the system picks the later timestamp, one edit disappears. That may be acceptable for a draft label, but it is risky if the title carries operational meaning.
A multi-value register makes the conflict explicit:
title conflict:
- Replace filter
- Replace pump
The edge can sync both values. A user or workflow can resolve them later:
resolution operation:
title = "Replace pump filter"
supersedes phone:44, tablet:12
This is still coordination avoidance. The system did not block both technicians while they were disconnected. It preserved enough information to converge to an honest conflict state and then resolve it with a later operation.
Repair paths matter too. An edge may accept an operation that a central policy later rejects:
edge accepted:
add attachment A9
authority later rejects:
attachment violates retention policy
The repair should be a new fact, not silent history editing:
authority:810:
revoke attachment A9
reason = retention_policy
That gives clients a causal explanation and lets derived views clean themselves up.
Failure Modes
- Confusing cache with replica: A cache can be discarded; an offline client may contain new user facts that must not be lost.
- Losing the outbox: If a local edit reaches the UI but not durable sync state, the system has created unrecoverable optimism.
- Using wall-clock timestamps as proof: Clocks help with display, not with causality or authorization.
- Accepting every edge write as globally valid: Edge acceptance is safe only for operations the edge is allowed to validate or merge.
- Hiding meaningful conflicts: Automatic merge is useful when it preserves intent; otherwise it can erase important user work.
- Letting old clients rejoin without a plan: Offline-first systems need snapshots, causal cursors, or forced resync boundaries.
- Treating derived views as source state: Feeds, counters, and indexes should be recomputable or repairable from authoritative operations.
- Skipping authorization on sync: A valid CRDT operation can still be illegal if the actor lacks rights for that object or epoch.
Practice
Design the sync model for an offline-first note app used by field teams.
Use this object:
case C19:
notes
tags
assigned_owner
photos
closed_at
Classify each field:
1. Can it be edited locally with a CRDT?
2. Does it need rights allocated in advance?
3. Does it need a central authority?
4. Is it a derived view that can repair later?
Then sketch the client state:
local store:
outbox:
sync cursor:
conflict surface:
snapshot or compaction rule:
Finally, explain what the UI should show when a local "close case" action is waiting for authority while comments and photos continue syncing normally.
Connections
011.mdintroduced escrow and rights transfer; offline clients can act locally only when they hold the right they need.016.mdexplained atomic update groups; offline actions often group local effects before syncing them.017.mdshowed collaborative sequence editing; offline-first clients are where those sequence operations become product behavior.019.mdturns this design into tests by checking merge laws and counterexample histories.
Resources
- [ARTICLE] Local-first software
- Focus: Study the user-facing goals of local responsiveness, ownership, and collaboration.
- [PAPER] Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
- Focus: Compare conflict detection, tentative updates, and application-level resolution with modern offline-first designs.
- [PAPER] Dynamo: Amazon's Highly Available Key-value Store
- Focus: Review always-writable replicas, versioning, and conflict surfacing in a production distributed store.
- [DOC] Automerge: The Local-First Database
- Focus: Connect CRDT synchronization, documents, and local-first application structure.
- [DOC] Yjs Document Updates
- Focus: Look at compact binary updates, idempotent application, and update exchange.
Key Takeaways
- Offline-first clients are replicas, not just caches; their local operations must be durable, deduplicated, authorized, and synchronized.
- Edge replicas improve latency and local availability when they accept only the operations they can safely validate or merge.
- Coordination avoidance works best when the system separates mergeable local facts, repairable derived views, and authoritative decisions.
- A good offline-first design makes conflicts and pending authority visible instead of hiding them behind fragile optimism.