CRDTs and Coordination Avoidance: Observability, Debugging, and Repair Workflows

LESSON

020 30 min intermediate

CRDTs and Coordination Avoidance: Observability, Debugging, and Repair Workflows

Core Insight

Imagine a support ticket from a customer: "My tablet says the job is closed, but my laptop says it is still open." Both devices have synced recently. No request failed. No database is obviously down. The system is doing exactly what coordination-avoidance systems often do: it is accepting local work, merging later, and exposing a temporary disagreement that needs explanation.

In a leader-based system, debugging often starts with "what did the primary commit?" In a CRDT-based or offline-first system, the better first question is "which facts has each replica seen, and which facts are still missing, hidden, rejected, or waiting for dependencies?" The answer lives in operation IDs, causal frontiers, merge decisions, validation results, and repair records.

Observability for CRDTs is therefore not only CPU graphs and request latency. You need to observe convergence itself. Which replicas disagree? Is the disagreement expected lag, a missing operation, a rejected operation, a compaction mistake, a policy failure, or a bug in merge semantics?

The trade-off is that highly available writes make incidents less binary. Fewer users are blocked, but operators need tools that explain partial knowledge. Repair should usually be another explicit operation, not a manual database edit, so that every replica can learn the correction through the same merge path.

What To Observe In A Mergeable System

A basic service dashboard might show:

request rate
error rate
latency
database saturation

Those signals matter, but they do not tell you why two replicas render different values. Add CRDT-specific signals:

replica frontier:
  latest dot or version known from each actor

operation backlog:
  local operations not yet acknowledged by peers

dependency wait:
  operations buffered because prerequisites are missing

merge conflicts:
  visible multi-value fields or domain conflicts

rejected operations:
  sync updates rejected by authorization, epoch, or schema policy

repair operations:
  corrections emitted after invalid state or missing data is discovered

A causal frontier is a compact summary of what a replica has seen:

tablet frontier:
  tablet:91
  madrid-edge:450
  region:144

laptop frontier:
  laptop:23
  madrid-edge:452
  region:143

This tells you that the laptop has seen more from the edge but less from the region. If the "job closed" operation came from region:144, the laptop may simply be one operation behind.

The useful mental model is a comparison table, not one global timestamp:

fact:
  job J42 closed by region:144

tablet:
  has region:144
  renders closed

laptop:
  has only region:143
  renders open

That is expected lag. The repair is sync, not conflict resolution.

The Minimum Debug Record

Every operation that can affect replicated state should leave enough information to answer four questions:

1. Who created it?
2. What dependencies did it claim to have seen?
3. What object and field did it affect?
4. Was it accepted, buffered, rejected, compacted, or repaired?

A useful debug record might look like this:

operation_id: phone:118
actor: phone-7
object: job:J42
effect: checklist_done.add("replace filter")
dependencies:
  job:J42 exists at region:140
created_at_wall_time: 2026-06-08T10:14:20Z
received_by:
  madrid-edge at frontier {phone:118, edge:450}
  region at frontier {phone:118, region:144}
validation:
  accepted

Wall-clock time is useful for humans, but it is not the proof of ordering. The dependency context and operation ID are the proof you use for merge reasoning.

For rejected or repaired operations, keep the reason:

operation_id: phone:121
effect: job_status.write(closed)
validation:
  rejected
reason:
  missing close-job right for epoch E9
repair:
  authority:7001 emitted close_request_denied(phone:121)

This gives support, operators, and engineers one shared language. The user sees "close request denied." The operator sees the missing right. The engineer can replay the state transition.

The trade-off is storage and privacy. Debug records can contain sensitive actor, object, and content references. Keep enough metadata to explain behavior, but redact payloads where possible and apply retention rules deliberately.

Worked Example: The Missing Checklist Item

Suppose the tablet shows:

job J42:
  inspect valve = done
  replace filter = done

The laptop shows:

job J42:
  inspect valve = done
  replace filter = not done

Start with the user's fact:

expected fact:
  "replace filter" was marked done

Find the operation:

operation:
  phone:118
  effect:
    checklist_done.add("replace filter")

Compare frontiers:

tablet:
  phone:118 present

laptop:
  phone:117 only

This is a missing operation, not a merge conflict. The immediate repair is anti-entropy:

laptop asks tablet or edge:
  send operations after phone:117

laptop receives:
  phone:118

laptop renders:
  replace filter = done

Now change the case. The laptop has phone:118, but still renders the item as not done.

laptop:
  phone:118 present
  visible state disagrees

Now you inspect merge and validation:

operation phone:118:
  accepted by edge
  rejected by laptop

reason:
  schema version did not recognize task_id format

This is not sync lag. It is compatibility failure. The repair may be a schema migration, a replay after upgrade, or an explicit repair operation:

repair operation:
  region:9002
  translate old task id "replace filter"
  to canonical task id task:filter-replacement
  depends_on: phone:118

A third case: both replicas have the operation, both accepted it, but they render differently after compaction.

tablet:
  compacted tombstone for task old-id

laptop:
  late operation references old-id

That points to an unsafe compaction boundary. The repair is not only resync. You need to restore a snapshot, block unsafe compaction, or force old clients to rejoin through a new epoch.

Repair As Data

Manual edits are tempting during incidents:

UPDATE jobs SET status = 'open' WHERE id = 'J42';

In a replicated mergeable system, that can create a new split. The primary database may now say one thing, while offline clients and edge replicas keep merging old operations.

Prefer repairs that enter the same replicated log or CRDT state:

repair operation:
  op_id: region:9010
  kind: override job_status
  object: job:J42
  new_status: open
  reason: close operation lacked required right
  supersedes: phone:121
  authorized_by: incident-commander-4

Every replica can receive, validate, and merge the repair:

if repair authority is valid for object and epoch:
  mark phone:121 as revoked
  render job status from region:9010
  preserve audit trail
else:
  reject repair

That keeps the system explainable. The status did not mysteriously change; a named repair superseded a named operation.

Some repairs are derived-view rebuilds:

rebuild search index from accepted operations
recompute activity feed from source facts
recalculate completed_count from checklist_done set

Other repairs are semantic:

resolve multi-value register conflict
reassign duplicate slug after authority decision
revoke malicious or unauthorized operation
emit missing tombstone after failed deletion

Derived repairs can often be automated. Semantic repairs need domain authority and an audit trail.

Debugging Playbook

When a CRDT-backed workflow looks wrong, avoid starting with "which replica is correct?" Start with "what facts does each replica know?"

1. Name the user-visible symptom.
2. Identify the object, field, and expected fact.
3. Find the operation IDs that should explain that fact.
4. Compare replica frontiers.
5. Check whether missing operations are lag, drop, rejection, or compaction.
6. Check whether the merge result violates a law or product invariant.
7. Repair with sync, replay, rebuild, or an explicit repair operation.
8. Save the incident as a regression history.

The last step connects directly to testing. If production found a failure, turn it into a history like the ones from the previous lesson:

history:
  phone emits phone:118
  edge accepts phone:118
  laptop rejects phone:118 due to schema version
  laptop renders missing checklist item

expected:
  old schema buffers or upgrades operation

actual:
  old schema silently drops operation

Now the team can write a property or scenario test:

old clients must not silently drop unknown but valid operations

Good observability shortens the path from "a user saw something impossible" to "here is the exact history we must never allow again."

Failure Modes

Only watching request errors: A merge bug can happen while every HTTP request succeeds.
Using wall-clock order as causality: Clocks help with incident timelines, but frontiers and dependencies explain replicated state.
Dropping rejected operations silently: A rejected update needs a reason that clients and operators can see.
Repairing only one database copy: Manual edits outside the replicated path can be overwritten or contradicted later.
Treating derived views as source truth: Search, feeds, counters, and read models should be rebuildable from accepted operations.
Compacting without observability: If you cannot explain which clients are safe past a boundary, tombstone cleanup can break rejoin.
Hiding conflict state from support tools: Operators need to see multi-value conflicts and pending authority decisions.
Forgetting privacy in debug records: Operation metadata is valuable, but payloads and actor details need retention and access control.

Practice

Diagnose this incident:

Symptom:
  phone shows case C19 as closed
  web app shows case C19 as open

Known facts:
  phone emitted phone:77 close request
  edge accepted phone:77
  region emitted region:500 close denied
  phone has not received region:500
  web app has region:500

Answer:

1. Is this expected lag, a conflict, or an invalid repair?
2. Which operation should the phone receive next?
3. What should the UI show while the denial is missing?
4. What metric or debug record would have made this easier to see?

Then design a repair workflow for this case:

Bug:
  old clients silently dropped valid checklist operations with a new task_id format

Specify:

source of truth:

replay or repair operation:

client behavior after upgrade:

regression history:

dashboard signal:

Connections

015.md introduced compaction boundaries; observability needs to show when tombstones and snapshots are safe.
018.md treated clients and edges as replicas; this lesson adds the operational records needed to debug them.
019.md showed counterexample histories; incidents should become the same kind of replayable history.
021.md moves from accidental bad updates to malicious or unauthorized updates, where trust boundaries become part of repair.

Resources

[PAPER] Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
- Focus: Study tentative updates, conflict handling, and application-level resolution in weakly connected replication.
[PAPER] Dynamo: Amazon's Highly Available Key-value Store
- Focus: Review version tracking, divergent values, and operational handling of always-writable replicas.
[DOC] OpenTelemetry Concepts
- Focus: Use the vocabulary of traces, metrics, and logs while adding CRDT-specific state such as frontiers and operation IDs.
[DOC] Automerge: Sync
- Focus: Compare frontiers, missing changes, and synchronization state with the lesson's debug model.
[ARTICLE] Jepsen Analyses
- Focus: Read concrete examples of turning distributed anomalies into explainable histories.

Key Takeaways

CRDT observability must explain partial knowledge: operation IDs, causal frontiers, dependency waits, rejections, conflicts, and repairs.
Debugging starts by comparing what each replica has seen, not by assuming one copy is automatically authoritative.
Repairs should usually be explicit replicated operations with reasons, authority, and audit trail.
Production incidents should become replayable histories so the same merge, sync, or compatibility failure is caught before it repeats.

← Back to CRDTs and Coordination Avoidance

← Back to Distributed Systems

← Back to Learning Hub