CRDTs and Coordination Avoidance: Observability, Debugging, and Repair Workflows
LESSON
CRDTs and Coordination Avoidance: Observability, Debugging, and Repair Workflows
Core Insight
Imagine a support ticket from a customer: "My tablet says the job is closed, but my laptop says it is still open." Both devices have synced recently. No request failed. No database is obviously down. The system is doing exactly what coordination-avoidance systems often do: it is accepting local work, merging later, and exposing a temporary disagreement that needs explanation.
In a leader-based system, debugging often starts with "what did the primary commit?" In a CRDT-based or offline-first system, the better first question is "which facts has each replica seen, and which facts are still missing, hidden, rejected, or waiting for dependencies?" The answer lives in operation IDs, causal frontiers, merge decisions, validation results, and repair records.
Observability for CRDTs is therefore not only CPU graphs and request latency. You need to observe convergence itself. Which replicas disagree? Is the disagreement expected lag, a missing operation, a rejected operation, a compaction mistake, a policy failure, or a bug in merge semantics?
The trade-off is that highly available writes make incidents less binary. Fewer users are blocked, but operators need tools that explain partial knowledge. Repair should usually be another explicit operation, not a manual database edit, so that every replica can learn the correction through the same merge path.
What To Observe In A Mergeable System
A basic service dashboard might show:
request rate
error rate
latency
database saturation
Those signals matter, but they do not tell you why two replicas render different values. Add CRDT-specific signals:
replica frontier:
latest dot or version known from each actor
operation backlog:
local operations not yet acknowledged by peers
dependency wait:
operations buffered because prerequisites are missing
merge conflicts:
visible multi-value fields or domain conflicts
rejected operations:
sync updates rejected by authorization, epoch, or schema policy
repair operations:
corrections emitted after invalid state or missing data is discovered
A causal frontier is a compact summary of what a replica has seen:
tablet frontier:
tablet:91
madrid-edge:450
region:144
laptop frontier:
laptop:23
madrid-edge:452
region:143
This tells you that the laptop has seen more from the edge but less from the region. If the "job closed" operation came from region:144, the laptop may simply be one operation behind.
The useful mental model is a comparison table, not one global timestamp:
fact:
job J42 closed by region:144
tablet:
has region:144
renders closed
laptop:
has only region:143
renders open
That is expected lag. The repair is sync, not conflict resolution.
The Minimum Debug Record
Every operation that can affect replicated state should leave enough information to answer four questions:
1. Who created it?
2. What dependencies did it claim to have seen?
3. What object and field did it affect?
4. Was it accepted, buffered, rejected, compacted, or repaired?
A useful debug record might look like this:
operation_id: phone:118
actor: phone-7
object: job:J42
effect: checklist_done.add("replace filter")
dependencies:
job:J42 exists at region:140
created_at_wall_time: 2026-06-08T10:14:20Z
received_by:
madrid-edge at frontier {phone:118, edge:450}
region at frontier {phone:118, region:144}
validation:
accepted
Wall-clock time is useful for humans, but it is not the proof of ordering. The dependency context and operation ID are the proof you use for merge reasoning.
For rejected or repaired operations, keep the reason:
operation_id: phone:121
effect: job_status.write(closed)
validation:
rejected
reason:
missing close-job right for epoch E9
repair:
authority:7001 emitted close_request_denied(phone:121)
This gives support, operators, and engineers one shared language. The user sees "close request denied." The operator sees the missing right. The engineer can replay the state transition.
The trade-off is storage and privacy. Debug records can contain sensitive actor, object, and content references. Keep enough metadata to explain behavior, but redact payloads where possible and apply retention rules deliberately.
Worked Example: The Missing Checklist Item
Suppose the tablet shows:
job J42:
inspect valve = done
replace filter = done
The laptop shows:
job J42:
inspect valve = done
replace filter = not done
Start with the user's fact:
expected fact:
"replace filter" was marked done
Find the operation:
operation:
phone:118
effect:
checklist_done.add("replace filter")
Compare frontiers:
tablet:
phone:118 present
laptop:
phone:117 only
This is a missing operation, not a merge conflict. The immediate repair is anti-entropy:
laptop asks tablet or edge:
send operations after phone:117
laptop receives:
phone:118
laptop renders:
replace filter = done
Now change the case. The laptop has phone:118, but still renders the item as not done.
laptop:
phone:118 present
visible state disagrees
Now you inspect merge and validation:
operation phone:118:
accepted by edge
rejected by laptop
reason:
schema version did not recognize task_id format
This is not sync lag. It is compatibility failure. The repair may be a schema migration, a replay after upgrade, or an explicit repair operation:
repair operation:
region:9002
translate old task id "replace filter"
to canonical task id task:filter-replacement
depends_on: phone:118
A third case: both replicas have the operation, both accepted it, but they render differently after compaction.
tablet:
compacted tombstone for task old-id
laptop:
late operation references old-id
That points to an unsafe compaction boundary. The repair is not only resync. You need to restore a snapshot, block unsafe compaction, or force old clients to rejoin through a new epoch.
Repair As Data
Manual edits are tempting during incidents:
UPDATE jobs SET status = 'open' WHERE id = 'J42';
In a replicated mergeable system, that can create a new split. The primary database may now say one thing, while offline clients and edge replicas keep merging old operations.
Prefer repairs that enter the same replicated log or CRDT state:
repair operation:
op_id: region:9010
kind: override job_status
object: job:J42
new_status: open
reason: close operation lacked required right
supersedes: phone:121
authorized_by: incident-commander-4
Every replica can receive, validate, and merge the repair:
if repair authority is valid for object and epoch:
mark phone:121 as revoked
render job status from region:9010
preserve audit trail
else:
reject repair
That keeps the system explainable. The status did not mysteriously change; a named repair superseded a named operation.
Some repairs are derived-view rebuilds:
rebuild search index from accepted operations
recompute activity feed from source facts
recalculate completed_count from checklist_done set
Other repairs are semantic:
resolve multi-value register conflict
reassign duplicate slug after authority decision
revoke malicious or unauthorized operation
emit missing tombstone after failed deletion
Derived repairs can often be automated. Semantic repairs need domain authority and an audit trail.
Debugging Playbook
When a CRDT-backed workflow looks wrong, avoid starting with "which replica is correct?" Start with "what facts does each replica know?"
1. Name the user-visible symptom.
2. Identify the object, field, and expected fact.
3. Find the operation IDs that should explain that fact.
4. Compare replica frontiers.
5. Check whether missing operations are lag, drop, rejection, or compaction.
6. Check whether the merge result violates a law or product invariant.
7. Repair with sync, replay, rebuild, or an explicit repair operation.
8. Save the incident as a regression history.
The last step connects directly to testing. If production found a failure, turn it into a history like the ones from the previous lesson:
history:
phone emits phone:118
edge accepts phone:118
laptop rejects phone:118 due to schema version
laptop renders missing checklist item
expected:
old schema buffers or upgrades operation
actual:
old schema silently drops operation
Now the team can write a property or scenario test:
old clients must not silently drop unknown but valid operations
Good observability shortens the path from "a user saw something impossible" to "here is the exact history we must never allow again."
Failure Modes
- Only watching request errors: A merge bug can happen while every HTTP request succeeds.
- Using wall-clock order as causality: Clocks help with incident timelines, but frontiers and dependencies explain replicated state.
- Dropping rejected operations silently: A rejected update needs a reason that clients and operators can see.
- Repairing only one database copy: Manual edits outside the replicated path can be overwritten or contradicted later.
- Treating derived views as source truth: Search, feeds, counters, and read models should be rebuildable from accepted operations.
- Compacting without observability: If you cannot explain which clients are safe past a boundary, tombstone cleanup can break rejoin.
- Hiding conflict state from support tools: Operators need to see multi-value conflicts and pending authority decisions.
- Forgetting privacy in debug records: Operation metadata is valuable, but payloads and actor details need retention and access control.
Practice
Diagnose this incident:
Symptom:
phone shows case C19 as closed
web app shows case C19 as open
Known facts:
phone emitted phone:77 close request
edge accepted phone:77
region emitted region:500 close denied
phone has not received region:500
web app has region:500
Answer:
1. Is this expected lag, a conflict, or an invalid repair?
2. Which operation should the phone receive next?
3. What should the UI show while the denial is missing?
4. What metric or debug record would have made this easier to see?
Then design a repair workflow for this case:
Bug:
old clients silently dropped valid checklist operations with a new task_id format
Specify:
source of truth:
replay or repair operation:
client behavior after upgrade:
regression history:
dashboard signal:
Connections
015.mdintroduced compaction boundaries; observability needs to show when tombstones and snapshots are safe.018.mdtreated clients and edges as replicas; this lesson adds the operational records needed to debug them.019.mdshowed counterexample histories; incidents should become the same kind of replayable history.021.mdmoves from accidental bad updates to malicious or unauthorized updates, where trust boundaries become part of repair.
Resources
- [PAPER] Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
- Focus: Study tentative updates, conflict handling, and application-level resolution in weakly connected replication.
- [PAPER] Dynamo: Amazon's Highly Available Key-value Store
- Focus: Review version tracking, divergent values, and operational handling of always-writable replicas.
- [DOC] OpenTelemetry Concepts
- Focus: Use the vocabulary of traces, metrics, and logs while adding CRDT-specific state such as frontiers and operation IDs.
- [DOC] Automerge: Sync
- Focus: Compare frontiers, missing changes, and synchronization state with the lesson's debug model.
- [ARTICLE] Jepsen Analyses
- Focus: Read concrete examples of turning distributed anomalies into explainable histories.
Key Takeaways
- CRDT observability must explain partial knowledge: operation IDs, causal frontiers, dependency waits, rejections, conflicts, and repairs.
- Debugging starts by comparing what each replica has seen, not by assuming one copy is automatically authoritative.
- Repairs should usually be explicit replicated operations with reasons, authority, and audit trail.
- Production incidents should become replayable histories so the same merge, sync, or compatibility failure is caught before it repeats.