CRDTs and Coordination Avoidance: Security, Trust Boundaries, and Malicious Updates

LESSON

021 30 min intermediate

CRDTs and Coordination Avoidance: Security, Trust Boundaries, and Malicious Updates

Core Insight

Imagine an offline-first field app where a technician's tablet can add notes, attach photos, and mark checklist items done. That is exactly the kind of local work CRDTs handle well. Now imagine the tablet is stolen and someone submits another operation: "grant this device admin access" or "close the safety incident and delete the warning tag."

The operation may be perfectly mergeable. It may have a stable ID, causal context, and a deterministic effect. If every replica accepts it, every replica may converge on the same unsafe state. Convergence answers "will replicas agree?" Security asks a different question: "should this operation be allowed into the replicated history at all?"

This lesson separates merge validity from trust validity. A CRDT merge function can handle duplicate delivery, reordering, and offline edits. It does not automatically authenticate the actor, authorize the field being changed, limit metadata growth, prevent replay, or decide whether a device is still trusted after revocation.

The design move is to put an admission boundary before merge and an audit-friendly repair path after merge. Highly available operations can stay local only when the actor holds the right to perform them and the operation is safe for the object's current epoch. Sensitive decisions need an authority, a capability, a remove-wins policy, or a new epoch that invalidates old updates.

Mergeable Does Not Mean Admissible

Use a simple project object:

project P7:
  notes: OR-map note_id -> note body
  tags: OR-set tag
  admins: set of user_id
  status: workflow register

This local note operation is mergeable:

operation phone:42
  actor: technician-9
  effect:
    notes.add(note:N8, "Valve replaced")

If technician-9 is allowed to edit notes on P7, the operation can be accepted locally and synced later.

Now compare an admin grant:

operation phone:43
  actor: technician-9
  effect:
    admins.add(user:mallory)

An add to a set is easy to merge. That does not make it safe. Permission changes decide who can make future updates. If ordinary clients can add admins through an add-wins set, a compromised device can grant lasting authority while offline.

A replica should ask before merge:

1. Is the actor authentic?
2. Is the operation well formed?
3. Is the actor authorized for this object, field, and epoch?
4. Does the operation consume a right or require an authority?
5. Does the operation obey size, schema, and rate limits?
6. If accepted, can later replicas explain why it was accepted?

If the answer is no, the operation is not admitted into the source state. The system may still record a rejection event for debugging and audit:

rejection:
  rejected_op: phone:43
  reason: actor lacks admin-grant right for project P7
  rejected_by: madrid-edge

That rejection is a fact. The unauthorized admin grant is not.

Operation Envelopes

Security starts with the operation envelope. The envelope is the metadata around the CRDT effect that lets a replica validate who sent it and what authority it claims.

operation envelope:
  op_id: phone:42
  actor_id: technician-9
  actor_key_id: key:tech9:2026-06
  object: project:P7
  field: notes
  effect: add note N8
  dependency_context: {region:144, phone:41}
  epoch: project:P7:epoch:12
  capability: can_edit_notes until 2026-06-08T18:00Z
  signature: sign(envelope_without_signature)

Authentication checks that the operation really came from the claimed actor or device:

verify signature with actor_key_id

Authorization checks that the actor can perform this effect:

technician-9 can edit notes on project:P7 in epoch 12
technician-9 cannot grant admins
technician-9 cannot close incident after right expired

Schema validation checks that the operation is shaped safely:

note body length <= limit
attachment count <= limit
known field names only
no unknown operation kind applied silently

The envelope should be covered by the signature. If a malicious peer can change the object ID, field name, or dependency context after signing, the signature proves very little.

The trade-off is extra metadata and verification work. In return, every replica can reject the same invalid operation for the same reason without coordinating on every harmless edit.

Trust Boundaries In A Replicated Topology

Different replicas deserve different trust.

client device:
  can create local edits for a logged-in user
  may be offline or compromised

edge replica:
  can validate common operations near the user
  may not hold every global secret or policy

regional authority:
  decides rights, unique claims, sensitive status transitions

audit store:
  preserves accepted, rejected, and repaired security-relevant facts

Do not design the merge layer as if every peer is equally honest. A client can be a useful replica and still be untrusted for sensitive decisions.

A practical policy might be:

client may emit:
  add note
  edit own draft comment
  attach photo within quota
  mark checklist observation

edge may accept:
  operations whose capabilities are verifiable locally
  operations within object, size, and rate limits

authority must decide:
  grant admin
  close regulated incident
  allocate unique audit number
  revoke device or key
  change access policy

This keeps the fast path available without turning every replica into an omnipotent authority.

When authority changes, use epochs. An epoch is a named version of the trust boundary:

project:P7 epoch 12:
  key:tech9 allowed to edit notes

project:P7 epoch 13:
  key:tech9 revoked

An operation signed in epoch 12 should not automatically apply in epoch 13:

operation phone:44
  epoch: 12
  effect: add sensitive attachment

current epoch:
  13

decision:
  reject, quarantine, or route to authority

Epochs are coordination boundaries. They make revocation understandable. Without them, old offline operations can keep arriving with rights that no longer exist.

Worked Example: Malicious Close Request

Consider a safety incident:

incident I9:
  notes: OR-map
  warning_tags: OR-set
  status: workflow register

The technician has a capability to add notes:

capability:
  actor: technician-9
  object: incident:I9
  allows: notes.add
  expires: 18:00

This operation is accepted:

operation phone:50
  effect: notes.add("Gas valve inspected")
  capability: notes.add
  signature: valid

Now the stolen device emits:

operation phone:51
  effect:
    status.write(closed)
    warning_tags.remove("gas-risk")

The CRDT fields can merge those effects. The security policy should reject them:

validation:
  signature valid
  actor authenticated
  status.close right missing
  warning tag removal requires safety authority

decision:
  reject phone:51
  emit rejection event

A nearby edge can handle the first line of defense:

edge:
  verify signature
  check local capability
  reject missing right
  sync rejection to other replicas

The regional authority may also emit a revocation:

operation region:900
  revoke key:tech9:device-tablet
  new epoch: incident:I9:epoch:14
  reason: stolen device

From epoch 14 onward, old device operations are not automatically trusted. If a late offline note from epoch 13 arrives, the system can quarantine it:

late operation phone:49
  epoch: 13
  effect: notes.add("Photo before shutdown")

decision:
  quarantine for review
  or accept only if policy allows pre-revocation notes

That policy decision is explicit. The CRDT does not hide it inside merge.

Attacks To Test

Security testing for CRDT systems should generate hostile histories, not only random network histories.

forged actor:
  operation claims actor_id admin-1 but signature does not match

replay:
  old valid operation arrives after revocation or epoch change

permission escalation:
  ordinary actor tries to add admin or grant capability

semantic abuse:
  add-wins permission set makes a revoked grant reappear

metadata denial:
  operation creates huge dots, many tombstones, or deep nested state

dependency lie:
  operation claims to have seen a prerequisite it never actually had

schema downgrade:
  old client drops unknown security field and applies effect anyway

derived-view poisoning:
  attacker sends update that corrupts search, feed, or counter if treated as source truth

Each attack should have an expected decision:

accept:
  valid local note within capability

reject:
  invalid signature or missing right

buffer:
  dependency missing but operation may become valid

quarantine:
  operation signed before revocation but arriving after sensitive epoch change

repair:
  previously accepted operation later superseded by authority

The key is to test the admission path and the merge path together. A merge test alone may happily converge on a malicious update. An authorization test alone may miss duplicate delivery, stale epochs, and repair replay.

Failure Modes

Treating signatures as authorization: A valid signature proves who sent an operation, not that the actor may change that field.
Using add-wins sets for permissions: A stale or malicious permission grant can survive revocation unless policy and epochs are explicit.
Accepting unknown operation kinds: Old clients should not silently apply effects after dropping security metadata they do not understand.
Letting clients decide global rights: Offline clients can hold delegated rights, but they should not mint sensitive authority for themselves.
Forgetting replay: An operation that was valid yesterday may be invalid after revocation, expiry, or epoch change.
Merging before validation: Once an invalid update becomes source state, repair becomes harder and audit trails get muddy.
Ignoring resource exhaustion: Malicious updates can attack metadata, tombstones, indexes, and sync bandwidth.
Dropping rejection evidence: Operators need to know whether an operation was missing, rejected, quarantined, or repaired.

Practice

Design admission checks for this operation:

operation phone:77
  actor: contractor-3
  object: incident:I9
  epoch: 12
  effect:
    warning_tags.remove("gas-risk")
    status.write(closed)

Fill in:

1. What must be authenticated?
2. What field-level authorization is required?
3. Which part can an edge decide locally?
4. Which part must route to an authority?
5. What should be logged if the operation is rejected?
6. What should happen if the same operation arrives again after epoch 13?

Then change the operation:

effect:
  notes.add("Valve inspected")

Explain which validation checks remain, and why this second operation may be accepted locally when the close request cannot.

Connections

011.md introduced rights transfer; security checks decide whether an actor actually holds the right it spends.
018.md treated offline clients and edges as replicas; this lesson marks which of those replicas are trusted for which decisions.
020.md showed rejected operations and repairs; malicious updates need the same visibility plus stronger audit trails.
022.md continues with rolling upgrades, where old clients must not accidentally bypass new security fields.

Resources

[PAPER] Zanzibar: Google's Consistent, Global Authorization System
- Focus: Study relationship-based authorization and why some permission decisions need a clear authority.
[PAPER] Macaroons: Cookies with Contextual Caveats for Decentralized Authorization in the Cloud
- Focus: Look at delegated capabilities, caveats, and bounded authority.
[RFC] JSON Web Signature (JWS)
- Focus: Connect signed operation envelopes to authenticity and tamper detection.
[GUIDE] OWASP API Security Top 10 2023: Broken Object Level Authorization
- Focus: Relate object-level authorization failures to per-object replicated operations.
[PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
- Focus: Revisit CRDT assumptions and notice where trust and authorization sit outside pure merge laws.

Key Takeaways

A CRDT operation can be mergeable and still be inadmissible because the actor lacks authority.
Replicas should validate signatures, object rights, epochs, schema, and resource limits before admitting an update into source state.
Offline clients and edges can remain useful replicas without being trusted to decide every sensitive fact.
Security incidents should be handled with explicit rejection, quarantine, revocation, or repair operations so the whole system can explain the outcome.

← Back to CRDTs and Coordination Avoidance

← Back to Distributed Systems

← Back to Learning Hub