Conflict Resolution and Convergence Policies

LESSON

Consistency and Replication

009 30 min advanced

Conflict Resolution and Convergence Policies

The core idea: Conflict resolution is the rule that decides which user intent survives when replicas accept concurrent writes, so it must be designed from domain invariants rather than treated as storage cleanup.

Core Insight

Imagine Harbor Point's leaderless store after a regional network flap. A trader in Lisbon extends reservation hold H-8821 for issuer MUNI-77 because the client is still negotiating. One second later, the Baltimore approval workflow confirms that same hold as reservation R-88421 after risk and payment checks pass. Both writes start from version 18 because neither side has seen the other's update yet.

When replicas repair, they do not have a simple old value and new value. They have two children of the same parent:

v18  status=active, expires_at=09:35
|- v19-lis  status=active, expires_at=09:40
`- v19-iad  status=confirmed, reservation_id=R-88421

That is a conflict. The system accepted two writes that did not causally supersede each other. Copying one version to every replica will make the cluster converge, but convergence alone does not mean the business outcome is correct.

The trade-off is that easy policies are often semantically dangerous. Last-writer-wins is simple and fast, but it can erase a confirmed reservation because one clock timestamp happened to be later. Domain-specific resolution protects invariants, but it takes more metadata, tests, client behavior, and operational visibility.

Detect Concurrency Before Resolving It

A replica can disagree with another replica for two very different reasons. One value may be stale, or two values may be concurrent.

stale case:
v18 -> v19

concurrent case:
v18 -> v19-a
v18 -> v19-b

In the stale case, the newer descendant can replace the older value. In the concurrent case, neither branch saw the other. The resolver must apply a conflict policy.

Harbor Point therefore stores version ancestry for each entity. The exact representation could be a revision tree, vector-clock-style metadata, origin-plus-counter pairs, or another causal marker. The important capability is comparison:

def classify(local, incoming):
    if incoming.version.descends_from(local.version):
        return "incoming_supersedes_local"
    if local.version.descends_from(incoming.version):
        return "incoming_is_stale"
    return "concurrent_conflict"

Wall-clock timestamps are not enough for this classification. A timestamp can tell you which write was stamped later by some node. It cannot prove that one writer observed the other's value before writing. Even a well-synchronized clock is not causality.

This matters because the wrong classification applies the wrong policy. Treating a concurrent write as merely stale can erase real user intent. Treating a stale retry as a conflict can force unnecessary manual review. Good convergence starts with honest metadata.

Match Policy to the Invariant

Harbor Point uses more than one conflict policy because not all fields mean the same thing.

Data shape                         Possible policy             Why
---------------------------------  --------------------------  -------------------------
trader dashboard layout            last-writer-wins            replaceable preference
set of desks that viewed a hold     set union                   additive membership
reservation hold lifecycle          domain resolver / reject    state machine invariant
audit notes                         preserve siblings           human review may matter

Last-writer-wins is acceptable for the dashboard layout because losing an older layout preference is annoying, not financially dangerous. It is a poor default for reservation state. If confirmed can be overwritten by active merely because a later timestamp wins, the database has converged to a false business truth.

For the hold lifecycle, Harbor Point encodes a domain rule:

def resolve_hold(left, right):
    if left.version.descends_from(right.version):
        return left
    if right.version.descends_from(left.version):
        return right

    states = {left.status, right.status}

    if "confirmed" in states:
        return choose_confirmed_branch(left, right)

    if left.status == right.status == "active":
        return branch_with_later_expiry(left, right)

    return ConflictRequiresRetry([left, right])

The policy is not arbitrary. confirmed is an absorbing state because a downstream workflow has already allocated the reservation and produced audit records. Two concurrent active extensions can be merged by keeping the later expiry only if both branches refer to the same hold and the business allows monotonic extension. A conflict between released and active should not be guessed; it should be retried or reviewed.

The invariant drives the policy. The storage system can provide version ancestry and sibling branches, but the domain decides which outcome is safe.

Worked Example: Three Outcomes

When replicas converge after conflict, the resolver has three broad options.

Outcome             Example                          Trade-off
------------------  -------------------------------  ------------------------------------
choose one branch    dashboard layout LWW             fast, but may lose intent
merge branches       set of involved desks            preserves compatible facts
surface conflict     hold lifecycle mismatch          safer, but pushes work upward

Choosing one branch is attractive when values are replaceable. It keeps APIs simple and latency low. The risk is silent loss when the data was not actually replaceable.

Merging branches is attractive when the data type is naturally composable. Sets, counters, and append-only histories can often merge safely if the merge operation preserves the invariant. This is the world where CRDT-style thinking helps.

Surfacing the conflict is appropriate when the system lacks enough information to merge safely. That may mean rejecting a conditional write, asking the client to refetch, putting an item into a review queue, or running a compensating workflow.

For Harbor Point's confirmed reservation conflict, surfacing or domain resolution is better than pretending the latest timestamp knows business law. The system should preserve evidence, protect the confirmed state, and make any ambiguous branch visible to the workflow that owns the invariant.

API and Operations Impact

Conflict resolution is not hidden database plumbing. It changes the API contract.

Harbor Point exposes different write styles:

API style                    Use case                         Conflict behavior
---------------------------  -------------------------------  --------------------------
blind overwrite              replaceable preferences           LWW permitted
conditional write             lifecycle state changes           requires expected version
merge-aware command           additive or monotonic updates     resolver tested by domain
manual/retry path             unsafe lifecycle conflicts        conflict surfaced

A conditional lifecycle write might require If-Match: v18. If the system has advanced to a concurrent branch, the API rejects the write and forces the caller to refetch. That is not a nuisance; it is how the API avoids pretending that every concurrent decision can be silently merged.

Operations need evidence too. A replicated store should track:

Aggressive auto-resolution reduces retries and keeps latency low, but it can hide semantic loss. Conservative rejection protects invariants, but it pushes complexity into clients and operational queues. The right boundary depends on the cost of losing user intent versus the cost of asking for human or workflow help.

Failure Modes

Treating last-writer-wins as harmless convergence. LWW makes replicas agree, but it may agree on the wrong business result. Use it only for data that is genuinely replaceable.

Merging fields instead of preserving invariants. A field-level merge can create an impossible entity state. Test conflict policies against real state-machine transitions and side effects.

Using timestamps as causality. Timestamp order is not the same as happened-before order. Keep ancestry metadata when concurrent writes matter.

Hiding conflicts from APIs and operators. If clients, metrics, and runbooks cannot tell whether a write was overwritten, merged, or rejected, the system will be hard to trust during failures.

Resources

Key Takeaways

  1. A conflict is a concurrency fact: two writes were accepted without either causally superseding the other.
  2. Conflict policy is domain logic. Choosing, merging, or rejecting versions decides which user intent survives.
  3. Convergence is not enough; replicas can converge to an invalid business state if the resolver ignores invariants.
  4. APIs and operations must expose conflict behavior through conditional writes, merge-aware commands, metrics, and review paths.
PREVIOUS Read Repair, Anti-Entropy, and Merkle Divergence Checks NEXT Replication Lag and Read-Your-Writes