CRDTs and Coordination Avoidance: Migration, Compatibility, and Rolling Upgrades
LESSON
CRDTs and Coordination Avoidance: Migration, Compatibility, and Rolling Upgrades
Core Insight
Imagine a collaborative app that has stored tags as a simple add-only set for years. The product now needs observed-remove semantics, so users can remove a tag on one device without it reappearing from another device's old state. The new client writes dots and tombstones. The old client only knows tags = {"urgent", "vip"}.
If you deploy the new merge code everywhere at once, the data center may be fine, but offline clients and edge replicas are still old. Some of them will wake up next week and send old-format operations. Others may read a new-format snapshot and silently drop fields they do not understand. In a coordination-avoidance system, "rolling upgrade" includes clients that are not rolling on your schedule.
CRDT migration is not just schema migration. It is migration of merge semantics. A new field can change how deletes, compaction, permissions, dependencies, or conflicts behave. The system must preserve convergence while old and new replicas coexist.
The practical design is to make compatibility a protocol, not a hope. Introduce new operation versions, adapters, feature gates, epochs, backfills, and safe read/write phases. A replica should either understand an operation, translate it, buffer it, or reject it with a clear reason. It should not silently apply half of a new semantic rule.
Why Ordinary Schema Migration Is Not Enough
In a single-primary database, a common migration is:
1. add nullable column
2. backfill
3. deploy code that writes the column
4. remove old column later
That pattern is useful, but CRDT systems add more pressure:
offline clients:
may keep writing old operations for days or months
edge replicas:
may accept writes before global state catches up
anti-entropy:
may replay old states or deltas after new code is deployed
merge semantics:
may change the meaning of old fields
compaction:
may delete metadata that old clients still reference
Consider the tag migration:
old state:
tags: set tag
new state:
adds: tag -> set dot
removes: set dot
The old representation can tell you that "urgent" is currently visible. It cannot tell you which add dot created it. The new observed-remove representation needs dots so a remove can say exactly which observed additions it removes.
If a new replica converts:
tags = {"urgent"}
into:
adds["urgent"] = {migration:1}
removes = {}
then an old client may later send the old set again:
old client sync:
tags = {"urgent"}
If the adapter turns that into a fresh add every time, the tag may become impossible to remove. The migration needs stable translation, not one-off parsing.
Versioned Operations
A safe migration begins by making operation shape explicit:
operation:
op_id: phone:77
object: project:P7
kind: tags.add
schema_version: 1
effect:
tag: urgent
New operations carry the new shape:
operation:
op_id: phone:102
object: project:P7
kind: tags.add
schema_version: 2
effect:
tag: urgent
dot: phone:102
Versioning lets each replica choose a safe path:
if version is supported:
validate and apply
elif translator exists:
translate to supported form
elif operation can wait:
buffer and request upgrade or missing context
else:
reject with reason
The dangerous path is silent downgrade:
new operation:
effect:
add tag urgent with dot phone:102
remove dots {tablet:9}
old client sees:
tag urgent
old client drops:
remove dots {tablet:9}
The old client has not merely ignored optional metadata. It has changed the meaning of the operation. If a field affects merge semantics, old replicas must not silently discard it.
Use this rule:
optional field:
can be ignored without changing safety or merge meaning
semantic field:
must be understood, translated, buffered, or rejected
Compatibility Modes
Rolling upgrades are easier when the system names its mode.
mode 1: read-old-write-old
every replica understands only old operations
mode 2: read-new-write-old
new code can read new representation
writes still use old operation shape
mode 3: read-new-write-both
new writers emit old and new compatible effects
validators compare both views
mode 4: read-new-write-new
new operation shape is allowed
old clients must be upgraded, routed through adapters, or blocked
mode 5: compact-old
old metadata is removed only after stability is proven
The exact names matter less than the discipline. You do not want a random client to start writing version 2 operations before edges, validators, repair tools, and support dashboards can explain them.
Feature gates help:
project P7:
tag_model = observed_remove
epoch = tags-v2
minimum_client = 4.8
When a client below 4.8 tries to write tags, the server or edge can decide:
allow old note edits:
safe unrelated field
reject old tag writes:
old client cannot preserve observed-remove semantics
offer upgrade:
client must resync from tags-v2 snapshot
This is a coordination boundary, but a targeted one. You are not coordinating every note or comment. You are coordinating the moment when an object starts using new semantics.
Worked Example: Migrating Tags To Observed-Remove
Start with old add-only tags:
old project:
tags = {"urgent"}
The product needs removal:
user action:
remove "urgent"
The target CRDT is:
adds:
urgent -> {dot}
removes:
{dot}
visible:
urgent is visible if any add dot is not removed
A safe migration can use these phases.
Phase 1: readers learn the new shape.
new reader:
if tags_v2 exists:
render tags_v2
else:
render old tags
No one writes tags_v2 yet.
Phase 2: backfill stable dots.
for each project P:
for each old tag T:
create dot migration:P:T
tags_v2.adds[T].add(migration:P:T)
The migration dot is deterministic. Running the backfill twice produces the same dot, not a new add.
Phase 3: compare old and new views.
old visible:
{"urgent"}
new visible:
{"urgent"}
if different:
stop rollout and repair before allowing writes
Phase 4: new writes use tags_v2.
new remove:
removes.add(migration:P:urgent)
Old clients must not keep writing the old tags field for objects in tags-v2 epoch. If they do, they can resurrect removed tags. The system can reject old tag writes:
rejection:
old tag write rejected
reason: project P is in tags-v2 epoch
action: client must upgrade and resync
Phase 5: compact old state.
safe to remove old tags only when:
all active writers use tags_v2
old clients are blocked or forced to resync
repair tools read tags_v2
snapshots include tags_v2
observability dashboards expose tags_v2 conflicts
The small win is concrete: the learner can now explain why "add new field and backfill" is not enough. The backfill needs stable identity, old writers need a policy, and compaction needs a rejoin boundary.
Rolling Upgrades And Epochs
Some migrations change only representation. Others change authority or safety semantics. Those need epochs.
epoch tags-v1:
old add-only tags
epoch tags-v2:
observed-remove tags
old tag writes rejected
old clients must resync
An epoch says, "from this boundary onward, operations are interpreted under this rule." It gives validators and repair tools something precise to check:
operation phone:88
object: project:P7
epoch: tags-v1
effect: tags.add("urgent")
current object epoch:
tags-v2
decision:
reject or translate only if safe
Not every field needs a hard epoch. A backward-compatible metadata field may roll out without blocking old clients:
new optional field:
display_color = blue
old clients:
ignore color
still preserve source facts
But a field that changes remove semantics, permission semantics, inventory rights, or compaction safety should get an explicit boundary.
The trade-off is user experience. Blocking old clients can be painful. Allowing them to corrupt new semantics is worse. Good rollout plans make the block narrow and explainable:
you can still add notes
you cannot edit tags until this client upgrades
Compatibility Tests
The previous lesson tested generated histories. Migration tests should generate mixed-version histories.
replicas:
old client v1
new client v2
edge v2
region v2
events:
old client writes old tag
backfill creates migration dot
new client removes tag
old client reconnects with stale old state
edge rejects old tag write in tags-v2 epoch
all replicas converge
Useful properties:
stable translation:
translating the same old fact twice produces the same new identity
no silent semantic drop:
a client that cannot preserve new semantics must not apply only part of the effect
mixed-version convergence:
all supported versions converge after allowed delivery
epoch enforcement:
old operations after a boundary are rejected, translated, or quarantined by policy
rollback safety:
rolling back code does not create writers that corrupt already-migrated objects
Rollback deserves special attention. If a deployment rolls back from v2 code to v1 code after some objects entered tags-v2, the old code may be unable to interpret current data.
Plan rollback like this:
before enabling v2 writes:
old code can safely read or refuse v2 objects
after enabling v2 writes:
rollback must preserve validators that reject unsafe old writes
or rollback must be blocked until data is returned to a safe state
Rollback is not only redeploying a binary. It is returning to a previous protocol, and the data may already have moved forward.
Failure Modes
- Silent field dropping: Old clients ignore metadata that changes merge meaning, such as dots, tombstones, epochs, or authorization caveats.
- Unstable backfill identities: Re-running migration creates fresh dots, duplicate adds, or new conflicts.
- Old writers after new semantics: A stale client writes old-format state and resurrects deleted or rejected facts.
- Compacting too early: Old clients or repair tools still need metadata that the new code removed.
- Assuming server rollout equals system rollout: Offline clients and edge replicas may remain old long after servers deploy.
- No rollback plan: Once new operations exist, old code may not be safe to run without validators or adapters.
- Testing only all-old and all-new: The dangerous cases are mixed-version histories.
- Treating compatibility as UI concern: The merge protocol, validators, snapshots, and repair tools all need compatibility behavior.
Practice
Design a migration for this change:
old field:
assignees = add-only set of user_id
new field:
assignees = observed-remove set with dots
Fill in:
1. How will you create deterministic migration dots?
2. Which clients may read old and new state?
3. When may new clients write the new operation shape?
4. What should happen when an old client writes assignees after the object enters v2?
5. Which dashboards or debug records should show rejected old writes?
6. When is it safe to compact the old field?
Then add a rollback scenario:
v2 writes have been enabled for 10 percent of projects.
The team discovers a bug and wants to roll back code.
Explain whether rollback is safe, which validators must remain active, and what data or epochs prevent old code from corrupting migrated projects.
Connections
009.mdintroduced maps, nested CRDTs, and schema evolution; this lesson makes schema evolution operational.015.mdcovered compaction; migration needs the same discipline before deleting old metadata.020.mdshowed rejected operations and repair records; compatibility failures must be observable in that same language.021.mdintroduced trust epochs; migration epochs use a similar boundary for semantic changes.023.mdturns these pieces into a design review checklist for deciding where coordination avoidance is appropriate.
Resources
- [ARTICLE] Evolutionary Database Design
- Focus: Use the expand, migrate, contract mindset, then adapt it to replicated clients and merge semantics.
- [DOC] Protocol Buffers: Updating A Message Type
- Focus: Study wire compatibility rules and compare them with CRDT semantic compatibility.
- [DOC] Confluent Schema Registry: Schema Evolution
- Focus: Review compatibility modes and the idea of safe reader/writer evolution.
- [DOC] Automerge: The Local-First Database
- Focus: Connect document sync and change histories to long-lived client compatibility.
- [PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Recheck how dots, tombstones, and merge semantics affect migration safety.
Key Takeaways
- CRDT migrations change both data shape and merge semantics, so compatibility must cover old clients, edge replicas, snapshots, and repair tools.
- Semantic fields such as dots, tombstones, epochs, and authorization caveats must not be silently dropped by old code.
- Safe rolling upgrades use phases: read-new, backfill with stable identity, compare views, gate new writes, and compact only after stability.
- Mixed-version histories and rollback paths are part of the design, not an afterthought.