CRDTs and Coordination Avoidance: Migration, Compatibility, and Rolling Upgrades

LESSON

CRDTs and Coordination Avoidance

022 30 min intermediate

CRDTs and Coordination Avoidance: Migration, Compatibility, and Rolling Upgrades

Core Insight

Imagine a collaborative app that has stored tags as a simple add-only set for years. The product now needs observed-remove semantics, so users can remove a tag on one device without it reappearing from another device's old state. The new client writes dots and tombstones. The old client only knows tags = {"urgent", "vip"}.

If you deploy the new merge code everywhere at once, the data center may be fine, but offline clients and edge replicas are still old. Some of them will wake up next week and send old-format operations. Others may read a new-format snapshot and silently drop fields they do not understand. In a coordination-avoidance system, "rolling upgrade" includes clients that are not rolling on your schedule.

CRDT migration is not just schema migration. It is migration of merge semantics. A new field can change how deletes, compaction, permissions, dependencies, or conflicts behave. The system must preserve convergence while old and new replicas coexist.

The practical design is to make compatibility a protocol, not a hope. Introduce new operation versions, adapters, feature gates, epochs, backfills, and safe read/write phases. A replica should either understand an operation, translate it, buffer it, or reject it with a clear reason. It should not silently apply half of a new semantic rule.

Why Ordinary Schema Migration Is Not Enough

In a single-primary database, a common migration is:

1. add nullable column
2. backfill
3. deploy code that writes the column
4. remove old column later

That pattern is useful, but CRDT systems add more pressure:

offline clients:
  may keep writing old operations for days or months

edge replicas:
  may accept writes before global state catches up

anti-entropy:
  may replay old states or deltas after new code is deployed

merge semantics:
  may change the meaning of old fields

compaction:
  may delete metadata that old clients still reference

Consider the tag migration:

old state:
  tags: set tag

new state:
  adds: tag -> set dot
  removes: set dot

The old representation can tell you that "urgent" is currently visible. It cannot tell you which add dot created it. The new observed-remove representation needs dots so a remove can say exactly which observed additions it removes.

If a new replica converts:

tags = {"urgent"}

into:

adds["urgent"] = {migration:1}
removes = {}

then an old client may later send the old set again:

old client sync:
  tags = {"urgent"}

If the adapter turns that into a fresh add every time, the tag may become impossible to remove. The migration needs stable translation, not one-off parsing.

Versioned Operations

A safe migration begins by making operation shape explicit:

operation:
  op_id: phone:77
  object: project:P7
  kind: tags.add
  schema_version: 1
  effect:
    tag: urgent

New operations carry the new shape:

operation:
  op_id: phone:102
  object: project:P7
  kind: tags.add
  schema_version: 2
  effect:
    tag: urgent
    dot: phone:102

Versioning lets each replica choose a safe path:

if version is supported:
  validate and apply
elif translator exists:
  translate to supported form
elif operation can wait:
  buffer and request upgrade or missing context
else:
  reject with reason

The dangerous path is silent downgrade:

new operation:
  effect:
    add tag urgent with dot phone:102
    remove dots {tablet:9}

old client sees:
  tag urgent

old client drops:
  remove dots {tablet:9}

The old client has not merely ignored optional metadata. It has changed the meaning of the operation. If a field affects merge semantics, old replicas must not silently discard it.

Use this rule:

optional field:
  can be ignored without changing safety or merge meaning

semantic field:
  must be understood, translated, buffered, or rejected

Compatibility Modes

Rolling upgrades are easier when the system names its mode.

mode 1: read-old-write-old
  every replica understands only old operations

mode 2: read-new-write-old
  new code can read new representation
  writes still use old operation shape

mode 3: read-new-write-both
  new writers emit old and new compatible effects
  validators compare both views

mode 4: read-new-write-new
  new operation shape is allowed
  old clients must be upgraded, routed through adapters, or blocked

mode 5: compact-old
  old metadata is removed only after stability is proven

The exact names matter less than the discipline. You do not want a random client to start writing version 2 operations before edges, validators, repair tools, and support dashboards can explain them.

Feature gates help:

project P7:
  tag_model = observed_remove
  epoch = tags-v2
  minimum_client = 4.8

When a client below 4.8 tries to write tags, the server or edge can decide:

allow old note edits:
  safe unrelated field

reject old tag writes:
  old client cannot preserve observed-remove semantics

offer upgrade:
  client must resync from tags-v2 snapshot

This is a coordination boundary, but a targeted one. You are not coordinating every note or comment. You are coordinating the moment when an object starts using new semantics.

Worked Example: Migrating Tags To Observed-Remove

Start with old add-only tags:

old project:
  tags = {"urgent"}

The product needs removal:

user action:
  remove "urgent"

The target CRDT is:

adds:
  urgent -> {dot}

removes:
  {dot}

visible:
  urgent is visible if any add dot is not removed

A safe migration can use these phases.

Phase 1: readers learn the new shape.

new reader:
  if tags_v2 exists:
    render tags_v2
  else:
    render old tags

No one writes tags_v2 yet.

Phase 2: backfill stable dots.

for each project P:
  for each old tag T:
    create dot migration:P:T
    tags_v2.adds[T].add(migration:P:T)

The migration dot is deterministic. Running the backfill twice produces the same dot, not a new add.

Phase 3: compare old and new views.

old visible:
  {"urgent"}

new visible:
  {"urgent"}

if different:
  stop rollout and repair before allowing writes

Phase 4: new writes use tags_v2.

new remove:
  removes.add(migration:P:urgent)

Old clients must not keep writing the old tags field for objects in tags-v2 epoch. If they do, they can resurrect removed tags. The system can reject old tag writes:

rejection:
  old tag write rejected
  reason: project P is in tags-v2 epoch
  action: client must upgrade and resync

Phase 5: compact old state.

safe to remove old tags only when:
  all active writers use tags_v2
  old clients are blocked or forced to resync
  repair tools read tags_v2
  snapshots include tags_v2
  observability dashboards expose tags_v2 conflicts

The small win is concrete: the learner can now explain why "add new field and backfill" is not enough. The backfill needs stable identity, old writers need a policy, and compaction needs a rejoin boundary.

Rolling Upgrades And Epochs

Some migrations change only representation. Others change authority or safety semantics. Those need epochs.

epoch tags-v1:
  old add-only tags

epoch tags-v2:
  observed-remove tags
  old tag writes rejected
  old clients must resync

An epoch says, "from this boundary onward, operations are interpreted under this rule." It gives validators and repair tools something precise to check:

operation phone:88
  object: project:P7
  epoch: tags-v1
  effect: tags.add("urgent")

current object epoch:
  tags-v2

decision:
  reject or translate only if safe

Not every field needs a hard epoch. A backward-compatible metadata field may roll out without blocking old clients:

new optional field:
  display_color = blue

old clients:
  ignore color
  still preserve source facts

But a field that changes remove semantics, permission semantics, inventory rights, or compaction safety should get an explicit boundary.

The trade-off is user experience. Blocking old clients can be painful. Allowing them to corrupt new semantics is worse. Good rollout plans make the block narrow and explainable:

you can still add notes
you cannot edit tags until this client upgrades

Compatibility Tests

The previous lesson tested generated histories. Migration tests should generate mixed-version histories.

replicas:
  old client v1
  new client v2
  edge v2
  region v2

events:
  old client writes old tag
  backfill creates migration dot
  new client removes tag
  old client reconnects with stale old state
  edge rejects old tag write in tags-v2 epoch
  all replicas converge

Useful properties:

stable translation:
  translating the same old fact twice produces the same new identity

no silent semantic drop:
  a client that cannot preserve new semantics must not apply only part of the effect

mixed-version convergence:
  all supported versions converge after allowed delivery

epoch enforcement:
  old operations after a boundary are rejected, translated, or quarantined by policy

rollback safety:
  rolling back code does not create writers that corrupt already-migrated objects

Rollback deserves special attention. If a deployment rolls back from v2 code to v1 code after some objects entered tags-v2, the old code may be unable to interpret current data.

Plan rollback like this:

before enabling v2 writes:
  old code can safely read or refuse v2 objects

after enabling v2 writes:
  rollback must preserve validators that reject unsafe old writes
  or rollback must be blocked until data is returned to a safe state

Rollback is not only redeploying a binary. It is returning to a previous protocol, and the data may already have moved forward.

Failure Modes

Practice

Design a migration for this change:

old field:
  assignees = add-only set of user_id

new field:
  assignees = observed-remove set with dots

Fill in:

1. How will you create deterministic migration dots?
2. Which clients may read old and new state?
3. When may new clients write the new operation shape?
4. What should happen when an old client writes assignees after the object enters v2?
5. Which dashboards or debug records should show rejected old writes?
6. When is it safe to compact the old field?

Then add a rollback scenario:

v2 writes have been enabled for 10 percent of projects.
The team discovers a bug and wants to roll back code.

Explain whether rollback is safe, which validators must remain active, and what data or epochs prevent old code from corrupting migrated projects.

Connections

Resources

Key Takeaways

PREVIOUS CRDTs and Coordination Avoidance: Security, Trust Boundaries, and Malicious Updates NEXT CRDTs and Coordination Avoidance: Design Review for Coordination Avoidance Strategy