Conflict Resolution Policies in Distributed Stores

LESSON

Consistency and Replication

047 30 min advanced

Day 476: Conflict Resolution Policies in Distributed Stores

The core idea: Once a distributed store accepts concurrent writes, conflict resolution is no longer storage plumbing; it is the rule that decides which user intent survives, which intent is merged, and which must be rejected.

Today's "Aha!" Moment

In 046.md, Harbor Point gave its Lisbon sales desk permission to read slightly stale cabin-hold data because search results could tolerate a small freshness budget. That decision pays off in latency, but it creates a follow-on problem the team cannot dodge. At 14:04:52, a Lisbon agent extends hold H-8821 on cabin C14 from 14:10 to 14:15 because the customer is still on the phone. At 14:04:53, the Baltimore payment service turns that same hold into confirmed booking B-3107 after the card authorization clears. Both writes start from version 18 of the hold record because neither region has seen the other's update yet.

When the replicas reconcile, they do not have "one newer value and one older value." They have two children of the same parent:

v18  status=active, expires_at=14:10
|- v19-lis  status=active, expires_at=14:15
\- v19-iad  status=confirmed, booking_id=B-3107

That is the non-obvious insight: a conflict is not merely two replicas disagreeing. A conflict is the system admitting two writes that neither causally supersedes the other. From that moment on, "last write wins" is not a neutral cleanup step. It is a product decision to let timestamp order decide whether Harbor Point keeps a confirmed booking or silently falls back to an extended hold.

Teams get hurt when they treat conflict resolution as an implementation detail hidden inside the database. The policy leaks straight into support load, compensation workflows, and customer trust. If the policy is wrong, the database will still converge, but it will converge to the wrong truth.

Why This Matters

Distributed stores accumulate conflicts anywhere coordination is relaxed: multi-leader replication, leaderless quorums, offline clients, CDC-fed materializations, and even async workflows that write back into the same entity from different regions. The store must eventually choose among three broad outcomes: keep one version, merge versions, or surface the conflict to a higher layer. Each option spends a different trade-off budget across latency, availability, and semantic safety.

Harbor Point's hold record makes the pressure visible. Some fields can be overwritten cheaply. An agent's draft note or a UI preference can often tolerate a last-writer-wins rule because losing one intermediate value does not break the booking system. A payment-backed state transition cannot. If a conflict policy lets a later timestamp erase status=confirmed, the database has preserved convergence while destroying revenue and auditability.

The production consequence is that conflict policy must be designed from invariants outward. Ask what property has to remain true after replicas diverge and heal. If the answer is "there can be only one confirmed owner of cabin C14," then a generic merge rule is not enough. If the answer is "combine the set of sales channels that touched this hold," a merge may be perfect. The important step is making that choice explicit before the incident makes it for you.

Core Walkthrough

Part 1: Detect concurrency before trying to resolve it

A store cannot resolve a conflict correctly if it cannot distinguish "stale write arrived late" from "two writes were concurrent." Harbor Point attaches revision metadata to each hold update. In a single-leader system, the metadata might be a log position. In a multi-leader or occasionally disconnected system, it is usually some per-entity ancestry marker such as a revision tree, vector-clock-style version, or origin-plus-counter pair that lets the store compare descendants.

The comparison rule is simple:

That logic looks like this:

def classify(local_version, incoming_version):
    if incoming_version.descends_from(local_version):
        return "incoming-wins"
    if local_version.descends_from(incoming_version):
        return "stale-write"
    return "concurrent-conflict"

The key point is that timestamps alone do not answer this question. Two clocks can disagree, and even perfectly synchronized clocks only tell you when an update was stamped, not whether one writer had seen the other writer's result. Harbor Point can use timestamps as one input to a resolver, but not as its entire notion of causality.

This detection step matters because each downstream policy assumes different information. Last-writer-wins assumes a total order exists, even if it is synthetic. Merge functions assume the entity can be decomposed into fields or operations that combine safely. Rejection assumes the application can make progress by retrying from a newer version. If concurrency is misclassified, the store applies the wrong semantics before anyone sees the mistake.

Part 2: Match the resolution policy to the invariant, not to convenience

Harbor Point eventually uses three different policies inside the same reservation domain because the data does not all mean the same thing:

Data shape Policy Why it fits Where it fails
Agent draft note on a hold Last-writer-wins The note is replaceable text, and losing an older draft is acceptable It is unsafe for payment or inventory state
Set of agencies that touched the hold Merge by set union Membership is additive, so combining both branches preserves meaning It would be wrong for a single-valued field such as cabin owner
Hold lifecycle state: active, released, confirmed Domain-specific resolver with rejection paths The meaning depends on a state machine and side effects such as payment capture It requires more code, more tests, and sometimes a manual or retried path

For hold H-8821, Harbor Point encodes the lifecycle rule directly:

def resolve_hold(left, right):
    if left.version.descends_from(right.version):
        return left
    if right.version.descends_from(left.version):
        return right

    states = {left.status, right.status}

    if "confirmed" in states:
        return choose_confirmed_branch(left, right)

    if left.status == right.status == "active":
        return branch_with_later_expiry(left, right)

    return ConflictRequiresRetry([left, right])

The important detail is not the exact function names. It is the semantic precedence:

This is why last-writer-wins is often too blunt. It answers every conflict with "pick one timestamp," even when the real question is "which state transition is legally allowed after payment, release, or expiry?" A merge function can be equally dangerous if it combines fields that look composable but break the entity-level invariant. For example, taking the max of two expires_at values is only safe when both branches refer to the same hold and no other process has already consumed the cabin.

Part 3: Resolution policy changes APIs and operations

Conflict resolution does not stop at storage internals. It changes what the API is allowed to promise. Harbor Point exposes three write styles to make the policy visible:

That API split is not extra ceremony. It prevents clients from assuming that every conflict is silently handled in their favor. The next lesson, 048.md, turns this into a broader consistency question: what exact semantic contract does an API make when it serves stale data, retries a write, or rejects a concurrent update?

Operations change as well. Harbor Point no longer watches only "replication lag" and "write errors." It tracks:

This is the real trade-off surface. Aggressive auto-resolution keeps latency low and reduces client retries, but it can hide semantic loss until finance or support notices. Conservative rejection preserves invariants more reliably, but it pushes complexity into clients, queues, and operational workflows. Good systems choose that boundary on purpose.

Failure Modes and Misconceptions

Connections

Connection 1: 046.md showed how stale reads create the preconditions for concurrent writes

Once Harbor Point allowed slightly stale regional reads for holds, two actors could make valid decisions from different snapshots. This lesson is the consequence: when both writes are accepted, the system needs an explicit rule for which intent survives.

Connection 2: 044.md explains why some invariants cannot be repaired by merge policy alone

Conflict resolution can decide between concurrent versions of one entity, but it cannot magically enforce "exactly one winner" for cabin ownership after independent confirmations. Some domains still need stronger coordination around the invariant itself.

Connection 3: 048.md turns storage policy into API semantics

A resolver is only half the story. The next lesson looks at how systems expose these choices to clients through stale-read contracts, conditional writes, and consistency levels that shape user-visible behavior.

Resources

Key Insights

  1. A conflict is a concurrency fact, not just a value mismatch - The store needs version ancestry to tell whether one write superseded another or whether both were accepted independently.
  2. Conflict policy is domain logic in disguise - Last-writer-wins, merge functions, and rejection paths each encode a different answer about which user intent matters most.
  3. Merge is only correct when it preserves the real invariant - Safe field composition does not automatically imply a safe entity state.
  4. The policy must be visible above the database - APIs, retries, observability, and compensation workflows all need to know whether a write is overwritten, merged, or rejected.
PREVIOUS Replication Lag Budgets and Read Staleness NEXT Consistency Spectrum and API Semantics

← Back to Consistency and Replication

← Back to Learning Hub