CRDTs and Coordination Avoidance: Security, Trust Boundaries, and Malicious Updates
LESSON
CRDTs and Coordination Avoidance: Security, Trust Boundaries, and Malicious Updates
Core Insight
Imagine an offline-first field app where a technician's tablet can add notes, attach photos, and mark checklist items done. That is exactly the kind of local work CRDTs handle well. Now imagine the tablet is stolen and someone submits another operation: "grant this device admin access" or "close the safety incident and delete the warning tag."
The operation may be perfectly mergeable. It may have a stable ID, causal context, and a deterministic effect. If every replica accepts it, every replica may converge on the same unsafe state. Convergence answers "will replicas agree?" Security asks a different question: "should this operation be allowed into the replicated history at all?"
This lesson separates merge validity from trust validity. A CRDT merge function can handle duplicate delivery, reordering, and offline edits. It does not automatically authenticate the actor, authorize the field being changed, limit metadata growth, prevent replay, or decide whether a device is still trusted after revocation.
The design move is to put an admission boundary before merge and an audit-friendly repair path after merge. Highly available operations can stay local only when the actor holds the right to perform them and the operation is safe for the object's current epoch. Sensitive decisions need an authority, a capability, a remove-wins policy, or a new epoch that invalidates old updates.
Mergeable Does Not Mean Admissible
Use a simple project object:
project P7:
notes: OR-map note_id -> note body
tags: OR-set tag
admins: set of user_id
status: workflow register
This local note operation is mergeable:
operation phone:42
actor: technician-9
effect:
notes.add(note:N8, "Valve replaced")
If technician-9 is allowed to edit notes on P7, the operation can be accepted locally and synced later.
Now compare an admin grant:
operation phone:43
actor: technician-9
effect:
admins.add(user:mallory)
An add to a set is easy to merge. That does not make it safe. Permission changes decide who can make future updates. If ordinary clients can add admins through an add-wins set, a compromised device can grant lasting authority while offline.
A replica should ask before merge:
1. Is the actor authentic?
2. Is the operation well formed?
3. Is the actor authorized for this object, field, and epoch?
4. Does the operation consume a right or require an authority?
5. Does the operation obey size, schema, and rate limits?
6. If accepted, can later replicas explain why it was accepted?
If the answer is no, the operation is not admitted into the source state. The system may still record a rejection event for debugging and audit:
rejection:
rejected_op: phone:43
reason: actor lacks admin-grant right for project P7
rejected_by: madrid-edge
That rejection is a fact. The unauthorized admin grant is not.
Operation Envelopes
Security starts with the operation envelope. The envelope is the metadata around the CRDT effect that lets a replica validate who sent it and what authority it claims.
operation envelope:
op_id: phone:42
actor_id: technician-9
actor_key_id: key:tech9:2026-06
object: project:P7
field: notes
effect: add note N8
dependency_context: {region:144, phone:41}
epoch: project:P7:epoch:12
capability: can_edit_notes until 2026-06-08T18:00Z
signature: sign(envelope_without_signature)
Authentication checks that the operation really came from the claimed actor or device:
verify signature with actor_key_id
Authorization checks that the actor can perform this effect:
technician-9 can edit notes on project:P7 in epoch 12
technician-9 cannot grant admins
technician-9 cannot close incident after right expired
Schema validation checks that the operation is shaped safely:
note body length <= limit
attachment count <= limit
known field names only
no unknown operation kind applied silently
The envelope should be covered by the signature. If a malicious peer can change the object ID, field name, or dependency context after signing, the signature proves very little.
The trade-off is extra metadata and verification work. In return, every replica can reject the same invalid operation for the same reason without coordinating on every harmless edit.
Trust Boundaries In A Replicated Topology
Different replicas deserve different trust.
client device:
can create local edits for a logged-in user
may be offline or compromised
edge replica:
can validate common operations near the user
may not hold every global secret or policy
regional authority:
decides rights, unique claims, sensitive status transitions
audit store:
preserves accepted, rejected, and repaired security-relevant facts
Do not design the merge layer as if every peer is equally honest. A client can be a useful replica and still be untrusted for sensitive decisions.
A practical policy might be:
client may emit:
add note
edit own draft comment
attach photo within quota
mark checklist observation
edge may accept:
operations whose capabilities are verifiable locally
operations within object, size, and rate limits
authority must decide:
grant admin
close regulated incident
allocate unique audit number
revoke device or key
change access policy
This keeps the fast path available without turning every replica into an omnipotent authority.
When authority changes, use epochs. An epoch is a named version of the trust boundary:
project:P7 epoch 12:
key:tech9 allowed to edit notes
project:P7 epoch 13:
key:tech9 revoked
An operation signed in epoch 12 should not automatically apply in epoch 13:
operation phone:44
epoch: 12
effect: add sensitive attachment
current epoch:
13
decision:
reject, quarantine, or route to authority
Epochs are coordination boundaries. They make revocation understandable. Without them, old offline operations can keep arriving with rights that no longer exist.
Worked Example: Malicious Close Request
Consider a safety incident:
incident I9:
notes: OR-map
warning_tags: OR-set
status: workflow register
The technician has a capability to add notes:
capability:
actor: technician-9
object: incident:I9
allows: notes.add
expires: 18:00
This operation is accepted:
operation phone:50
effect: notes.add("Gas valve inspected")
capability: notes.add
signature: valid
Now the stolen device emits:
operation phone:51
effect:
status.write(closed)
warning_tags.remove("gas-risk")
The CRDT fields can merge those effects. The security policy should reject them:
validation:
signature valid
actor authenticated
status.close right missing
warning tag removal requires safety authority
decision:
reject phone:51
emit rejection event
A nearby edge can handle the first line of defense:
edge:
verify signature
check local capability
reject missing right
sync rejection to other replicas
The regional authority may also emit a revocation:
operation region:900
revoke key:tech9:device-tablet
new epoch: incident:I9:epoch:14
reason: stolen device
From epoch 14 onward, old device operations are not automatically trusted. If a late offline note from epoch 13 arrives, the system can quarantine it:
late operation phone:49
epoch: 13
effect: notes.add("Photo before shutdown")
decision:
quarantine for review
or accept only if policy allows pre-revocation notes
That policy decision is explicit. The CRDT does not hide it inside merge.
Attacks To Test
Security testing for CRDT systems should generate hostile histories, not only random network histories.
forged actor:
operation claims actor_id admin-1 but signature does not match
replay:
old valid operation arrives after revocation or epoch change
permission escalation:
ordinary actor tries to add admin or grant capability
semantic abuse:
add-wins permission set makes a revoked grant reappear
metadata denial:
operation creates huge dots, many tombstones, or deep nested state
dependency lie:
operation claims to have seen a prerequisite it never actually had
schema downgrade:
old client drops unknown security field and applies effect anyway
derived-view poisoning:
attacker sends update that corrupts search, feed, or counter if treated as source truth
Each attack should have an expected decision:
accept:
valid local note within capability
reject:
invalid signature or missing right
buffer:
dependency missing but operation may become valid
quarantine:
operation signed before revocation but arriving after sensitive epoch change
repair:
previously accepted operation later superseded by authority
The key is to test the admission path and the merge path together. A merge test alone may happily converge on a malicious update. An authorization test alone may miss duplicate delivery, stale epochs, and repair replay.
Failure Modes
- Treating signatures as authorization: A valid signature proves who sent an operation, not that the actor may change that field.
- Using add-wins sets for permissions: A stale or malicious permission grant can survive revocation unless policy and epochs are explicit.
- Accepting unknown operation kinds: Old clients should not silently apply effects after dropping security metadata they do not understand.
- Letting clients decide global rights: Offline clients can hold delegated rights, but they should not mint sensitive authority for themselves.
- Forgetting replay: An operation that was valid yesterday may be invalid after revocation, expiry, or epoch change.
- Merging before validation: Once an invalid update becomes source state, repair becomes harder and audit trails get muddy.
- Ignoring resource exhaustion: Malicious updates can attack metadata, tombstones, indexes, and sync bandwidth.
- Dropping rejection evidence: Operators need to know whether an operation was missing, rejected, quarantined, or repaired.
Practice
Design admission checks for this operation:
operation phone:77
actor: contractor-3
object: incident:I9
epoch: 12
effect:
warning_tags.remove("gas-risk")
status.write(closed)
Fill in:
1. What must be authenticated?
2. What field-level authorization is required?
3. Which part can an edge decide locally?
4. Which part must route to an authority?
5. What should be logged if the operation is rejected?
6. What should happen if the same operation arrives again after epoch 13?
Then change the operation:
effect:
notes.add("Valve inspected")
Explain which validation checks remain, and why this second operation may be accepted locally when the close request cannot.
Connections
011.mdintroduced rights transfer; security checks decide whether an actor actually holds the right it spends.018.mdtreated offline clients and edges as replicas; this lesson marks which of those replicas are trusted for which decisions.020.mdshowed rejected operations and repairs; malicious updates need the same visibility plus stronger audit trails.022.mdcontinues with rolling upgrades, where old clients must not accidentally bypass new security fields.
Resources
- [PAPER] Zanzibar: Google's Consistent, Global Authorization System
- Focus: Study relationship-based authorization and why some permission decisions need a clear authority.
- [PAPER] Macaroons: Cookies with Contextual Caveats for Decentralized Authorization in the Cloud
- Focus: Look at delegated capabilities, caveats, and bounded authority.
- [RFC] JSON Web Signature (JWS)
- Focus: Connect signed operation envelopes to authenticity and tamper detection.
- [GUIDE] OWASP API Security Top 10 2023: Broken Object Level Authorization
- Focus: Relate object-level authorization failures to per-object replicated operations.
- [PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Revisit CRDT assumptions and notice where trust and authorization sit outside pure merge laws.
Key Takeaways
- A CRDT operation can be mergeable and still be inadmissible because the actor lacks authority.
- Replicas should validate signatures, object rights, epochs, schema, and resource limits before admitting an update into source state.
- Offline clients and edges can remain useful replicas without being trusted to decide every sensitive fact.
- Security incidents should be handled with explicit rejection, quarantine, revocation, or repair operations so the whole system can explain the outcome.