CRDTs and Coordination Avoidance: Escrow, Bounded Counters, and Rights Transfer

LESSON

CRDTs and Coordination Avoidance

011 30 min intermediate

CRDTs and Coordination Avoidance: Escrow, Bounded Counters, and Rights Transfer

Core Insight

Imagine a ticketing service with 100 seats left for a workshop. The previous lesson showed why a plain replicated counter is dangerous: Madrid can sell 70 seats while Dublin sells 50 seats during a network partition, and both local replicas may think they are safe until the merged state says 120 seats were sold.

Escrow changes the shape of the problem. Instead of letting every region spend from one invisible global pool, the system divides the remaining capacity into local rights. Madrid might hold 60 sale rights, Dublin 30, and Lisbon 10. A region can sell locally only while it has rights left. If it runs out, it must obtain more rights or stop accepting that operation.

A bounded counter is the CRDT-style version of this idea for numeric limits. It still merges asynchronously and converges, but it stores more than "how many happened." It stores who owns rights, who consumed rights, and which rights were transferred. The invariant is protected because no local decrement can consume rights that were not allocated to that replica.

The trade-off is attractive but real. Escrow removes coordination from the common operation path when rights are available, but it moves coordination pressure to allocation, transfer, rebalancing, and recovery. The learner's useful mental model is: spend locally, rebalance occasionally, and treat rights as correctness state, not just capacity hints.

The Naive Counter Fails At The Bound

Start with a non-negative inventory:

stock = 100
invariant:
  stock >= 0

Selling one item decrements stock. Restocking increments stock.

A PN-counter can merge increments and decrements without losing updates:

value = initial + sum(increments) - sum(decrements)

That is good for convergence, but not enough for the invariant. During a partition:

Madrid sees stock = 100
Madrid sells 70

Dublin sees stock = 100
Dublin sells 50

merged stock = 100 - 70 - 50 = -20

The problem is not that the counter failed to converge. It converged perfectly to a bad value. The invariant was about the global sum, and each region spent as if it owned the whole remaining stock.

The naive intuition says "check before decrement." Escrow says "checking is local only if the replica owns enough rights to make the check meaningful."

Rights Make The Bound Local

For a lower-bounded counter, such as stock >= 0, a right is permission to decrement once.

If stock is 100, there are 100 decrement rights:

total rights = stock - lower_bound
             = 100 - 0
             = 100

Now distribute rights:

Madrid: 60 rights
Dublin: 30 rights
Lisbon: 10 rights

Each region can sell only from its local rights:

Madrid sells 1:
  Madrid rights: 60 -> 59
  sale succeeds locally

Madrid tries to sell 61 while holding 60:
  sale fails locally, or waits for rights transfer

This is the small win. The local replica no longer needs a fresh global stock value for every sale. It only needs to know its own rights balance.

The invariant follows from accounting:

global decrements <= total allocated rights
total allocated rights <= stock - lower_bound
therefore stock never crosses the bound

The system can still be available for operations covered by local rights. It is not magically available for every possible operation. If one region burns through its rights, it has reached a real boundary.

Bounded Counter State

A bounded counter needs to track rights in mergeable state. One useful teaching model stores two pieces of information:

transfers[from][to] = rights transferred from one replica to another
used[replica]       = rights consumed by that replica

Initial allocation can be represented as self-transfers:

transfers[Madrid][Madrid] = 60
transfers[Dublin][Dublin] = 30
transfers[Lisbon][Lisbon] = 10

Madrid's local rights are:

rights(Madrid) =
  rights granted to Madrid
  - rights Madrid transferred away
  - rights Madrid already used

In a compact formula:

rights(i) =
  sum(transfers[*][i])
  - sum(transfers[i][* where * != i])
  - used[i]

A local sale at Madrid increments used[Madrid] only if rights(Madrid) is high enough. A rights transfer from Dublin to Madrid increments transfers[Dublin][Madrid] only if Dublin has enough rights to give away.

The merge rule is component-wise maximum, as with many state-based counters:

merge used:
  used[i] = max(local.used[i], remote.used[i])

merge transfers:
  transfers[i][j] = max(local.transfers[i][j], remote.transfers[i][j])

The maximum is important because state may arrive twice or out of order. Seeing the same sale state twice should not spend the right twice.

Worked Example: Rights Transfer

Start with:

Madrid: 60 rights
Dublin: 30 rights
Lisbon: 10 rights

Madrid sells 55 seats:

used[Madrid] = 55
rights(Madrid) = 60 - 55 = 5

Dublin sells 5 seats:

used[Dublin] = 5
rights(Dublin) = 30 - 5 = 25

Madrid now receives a burst of demand for 20 more seats. It has only 5 rights. It cannot safely accept all 20 locally.

Madrid can ask Dublin for 15 rights. If Dublin can spare them, Dublin records:

transfers[Dublin][Madrid] += 15

After that transfer reaches Madrid, Madrid's rights become:

initial grants to Madrid:          60
rights received from Dublin:      +15
rights used by Madrid so far:     -55

rights(Madrid) = 20

Madrid can now sell 20 more seats locally:

used[Madrid] = 75
rights(Madrid) = 0

Check the global picture:

Madrid used 75
Dublin used 5
Lisbon used 0

total sold = 80
stock left = 20
invariant holds

The transfer did not create new capacity. It moved permission to spend capacity from one replica to another. That is the heart of escrow.

What Happens During Partitions

Escrow gives a precise answer during a network partition:

If the replica has rights:
  the bounded operation can proceed locally.

If the replica lacks rights:
  the operation must fail, wait, route elsewhere, or use a coordinated path.

That behavior is useful because it is predictable. The system does not oversell just because a link is down. It also does not pretend availability is unlimited.

Operators now have a new tuning problem: rights placement.

too many rights in one region:
  other regions may block even though global capacity remains

too evenly spread:
  every regional demand spike may need transfers

too much transfer churn:
  the system spends coordination effort rebalancing rights

Rights allocation should follow traffic, locality, and risk. A region with predictable demand can hold more rights. A low-traffic edge replica may hold a small reserve. A safety-critical operation may choose smaller local grants and accept more coordination.

Restocks, Upper Bounds, And Direction

The same idea works for different numeric invariants, but the direction matters.

For a lower bound:

stock >= 0

decrement consumes rights
increment creates rights

For an upper bound:

issued_coupons <= 1000

increment consumes rights
decrement may release rights

This is where bounded counters differ from ordinary counters. The operation is not judged only by arithmetic. It is judged by whether it consumes scarce permission under the invariant.

Some operations can create rights. If the ticketing service releases 30 more seats, that restock increases the available rights budget. The system still needs to decide where the new rights go:

restock creates 30 rights
allocate:
  Madrid +20
  Dublin +5
  Lisbon +5

That allocation decision may itself require coordination or an authority. The point is not that bounded counters remove all coordination. They keep the frequent bounded operation local when the rights have already been placed.

Failure Modes

Practice

Design a bounded counter for a workshop with 200 seats across three regions:

Madrid expected demand: 120
Dublin expected demand: 50
Lisbon expected demand: 30

Choose an initial rights allocation. Then answer:

1. What happens if Madrid sells out its local rights first?
2. Which region should it request rights from?
3. What user-visible behavior happens while the transfer is unavailable?
4. What metric would tell operators that rights are placed badly?

The goal is not to find one perfect allocation. The goal is to see that bounded counters turn "global invariant on every write" into "local spend plus explicit rebalancing."

Connections

Resources

Key Takeaways

PREVIOUS CRDTs and Coordination Avoidance: Invariant Confluence, CALM, and Monotonic Programs NEXT CRDTs and Coordination Avoidance: Uniqueness, Allocation, and Coordination Requirements