Day 476: Conflict Resolution Policies in Distributed Stores

The core idea: Once a distributed store accepts concurrent writes, conflict resolution is no longer storage plumbing; it is the rule that decides which user intent survives, which intent is merged, and which must be rejected.

Today's "Aha!" Moment

In 046.md, Harbor Point gave its Lisbon sales desk permission to read slightly stale cabin-hold data because search results could tolerate a small freshness budget. That decision pays off in latency, but it creates a follow-on problem the team cannot dodge. At 14:04:52, a Lisbon agent extends hold H-8821 on cabin C14 from 14:10 to 14:15 because the customer is still on the phone. At 14:04:53, the Baltimore payment service turns that same hold into confirmed booking B-3107 after the card authorization clears. Both writes start from version 18 of the hold record because neither region has seen the other's update yet.

When the replicas reconcile, they do not have "one newer value and one older value." They have two children of the same parent:

v18  status=active, expires_at=14:10
|- v19-lis  status=active, expires_at=14:15
\- v19-iad  status=confirmed, booking_id=B-3107

That is the non-obvious insight: a conflict is not merely two replicas disagreeing. A conflict is the system admitting two writes that neither causally supersedes the other. From that moment on, "last write wins" is not a neutral cleanup step. It is a product decision to let timestamp order decide whether Harbor Point keeps a confirmed booking or silently falls back to an extended hold.

Teams get hurt when they treat conflict resolution as an implementation detail hidden inside the database. The policy leaks straight into support load, compensation workflows, and customer trust. If the policy is wrong, the database will still converge, but it will converge to the wrong truth.

Why This Matters

Distributed stores accumulate conflicts anywhere coordination is relaxed: multi-leader replication, leaderless quorums, offline clients, CDC-fed materializations, and even async workflows that write back into the same entity from different regions. The store must eventually choose among three broad outcomes: keep one version, merge versions, or surface the conflict to a higher layer. Each option spends a different trade-off budget across latency, availability, and semantic safety.

Harbor Point's hold record makes the pressure visible. Some fields can be overwritten cheaply. An agent's draft note or a UI preference can often tolerate a last-writer-wins rule because losing one intermediate value does not break the booking system. A payment-backed state transition cannot. If a conflict policy lets a later timestamp erase status=confirmed, the database has preserved convergence while destroying revenue and auditability.

The production consequence is that conflict policy must be designed from invariants outward. Ask what property has to remain true after replicas diverge and heal. If the answer is "there can be only one confirmed owner of cabin C14," then a generic merge rule is not enough. If the answer is "combine the set of sales channels that touched this hold," a merge may be perfect. The important step is making that choice explicit before the incident makes it for you.

Core Walkthrough

Part 1: Detect concurrency before trying to resolve it

A store cannot resolve a conflict correctly if it cannot distinguish "stale write arrived late" from "two writes were concurrent." Harbor Point attaches revision metadata to each hold update. In a single-leader system, the metadata might be a log position. In a multi-leader or occasionally disconnected system, it is usually some per-entity ancestry marker such as a revision tree, vector-clock-style version, or origin-plus-counter pair that lets the store compare descendants.

The comparison rule is simple:

If the incoming version descends from the current one, the incoming write is the new truth.
If the current version descends from the incoming one, the incoming write is stale and should not overwrite the newer state.
If neither descends from the other, the writes are concurrent and the conflict policy must run.

That logic looks like this:

def classify(local_version, incoming_version):
    if incoming_version.descends_from(local_version):
        return "incoming-wins"
    if local_version.descends_from(incoming_version):
        return "stale-write"
    return "concurrent-conflict"

The key point is that timestamps alone do not answer this question. Two clocks can disagree, and even perfectly synchronized clocks only tell you when an update was stamped, not whether one writer had seen the other writer's result. Harbor Point can use timestamps as one input to a resolver, but not as its entire notion of causality.

This detection step matters because each downstream policy assumes different information. Last-writer-wins assumes a total order exists, even if it is synthetic. Merge functions assume the entity can be decomposed into fields or operations that combine safely. Rejection assumes the application can make progress by retrying from a newer version. If concurrency is misclassified, the store applies the wrong semantics before anyone sees the mistake.

Part 2: Match the resolution policy to the invariant, not to convenience

Harbor Point eventually uses three different policies inside the same reservation domain because the data does not all mean the same thing:

Data shape	Policy	Why it fits	Where it fails
Agent draft note on a hold	Last-writer-wins	The note is replaceable text, and losing an older draft is acceptable	It is unsafe for payment or inventory state
Set of agencies that touched the hold	Merge by set union	Membership is additive, so combining both branches preserves meaning	It would be wrong for a single-valued field such as cabin owner
Hold lifecycle state: `active`, `released`, `confirmed`	Domain-specific resolver with rejection paths	The meaning depends on a state machine and side effects such as payment capture	It requires more code, more tests, and sometimes a manual or retried path

For hold H-8821, Harbor Point encodes the lifecycle rule directly:

def resolve_hold(left, right):
    if left.version.descends_from(right.version):
        return left
    if right.version.descends_from(left.version):
        return right

    states = {left.status, right.status}

    if "confirmed" in states:
        return choose_confirmed_branch(left, right)

    if left.status == right.status == "active":
        return branch_with_later_expiry(left, right)

    return ConflictRequiresRetry([left, right])

The important detail is not the exact function names. It is the semantic precedence:

confirmed is an absorbing state because Harbor Point has already captured payment and issued booking B-3107.
Two concurrent active extensions can be merged by taking the later expiry only because both branches represent the same hold identity and the business rule allows a monotonic extension.
Some combinations are not merged at all. If one branch says released and another says active, Harbor Point forces a retry or review because blindly choosing one branch could reopen a hold that another workflow intentionally ended.

This is why last-writer-wins is often too blunt. It answers every conflict with "pick one timestamp," even when the real question is "which state transition is legally allowed after payment, release, or expiry?" A merge function can be equally dangerous if it combines fields that look composable but break the entity-level invariant. For example, taking the max of two expires_at values is only safe when both branches refer to the same hold and no other process has already consumed the cabin.

Part 3: Resolution policy changes APIs and operations

Conflict resolution does not stop at storage internals. It changes what the API is allowed to promise. Harbor Point exposes three write styles to make the policy visible:

Blind overwrite endpoints are reserved for replaceable fields such as internal notes. They behave like last-writer-wins and say so clearly.
Conditional writes require an If-Match revision for lifecycle changes. If the caller writes against version 18 but the system has already advanced to a concurrent branch, the API rejects the request and forces the client to refetch.
Merge-aware commands exist only where Harbor Point has a tested domain merge, such as extending an already active hold under the same idempotency key.

That API split is not extra ceremony. It prevents clients from assuming that every conflict is silently handled in their favor. The next lesson, 048.md, turns this into a broader consistency question: what exact semantic contract does an API make when it serves stale data, retries a write, or rejects a concurrent update?

Operations change as well. Harbor Point no longer watches only "replication lag" and "write errors." It tracks:

how many conflicts were auto-resolved by policy,
how many were rejected back to clients,
how many entered a compensation or review queue,
which policies fired for each entity type, and
whether timestamp-based fallbacks correlate with clock-skew alarms.

This is the real trade-off surface. Aggressive auto-resolution keeps latency low and reduces client retries, but it can hide semantic loss until finance or support notices. Conservative rejection preserves invariants more reliably, but it pushes complexity into clients, queues, and operational workflows. Good systems choose that boundary on purpose.

Failure Modes and Misconceptions

Issue: "Last-writer-wins is the safe default because the system always converges."
- Why it is tempting: It is simple to implement and removes sibling versions quickly.
- Corrective mental model: Convergence only means replicas agree. It does not mean they agree on the right business outcome.
- Operational fix: Use LWW only for state that is genuinely replaceable and harmless to overwrite.
Issue: "If a merge function composes field values, it preserves the entity invariant."
- Why it is tempting: Field-level merges look elegant and eliminate retries.
- Corrective mental model: Invariants live at the entity or workflow level. A valid merge of fields can still create an impossible state transition.
- Operational fix: Test conflict policies against real state-machine transitions and side effects, not just against JSON shape compatibility.
Issue: "Clock synchronization is enough to make timestamp ordering trustworthy."
- Why it is tempting: Better NTP reduces obvious skew, so teams assume the ordering problem is solved.
- Corrective mental model: Timestamp order is not causality. A well-synchronized clock still cannot prove one writer observed another writer's change.
- Operational fix: Keep explicit version ancestry and use timestamps only as a narrow tiebreaker when the domain permits it.
Issue: "Conflicts are rare, so manual review is fine."
- Why it is tempting: Under normal traffic, concurrent branches may be uncommon.
- Corrective mental model: Conflict bursts happen exactly when the system is stressed by retries, partitions, offline clients, or regional failover.
- Operational fix: Design the resolver, rejection path, and review queue before the burst arrives, and instrument the rate per entity type.

Connections

Connection 1: 046.md showed how stale reads create the preconditions for concurrent writes

Once Harbor Point allowed slightly stale regional reads for holds, two actors could make valid decisions from different snapshots. This lesson is the consequence: when both writes are accepted, the system needs an explicit rule for which intent survives.

Connection 2: 044.md explains why some invariants cannot be repaired by merge policy alone

Conflict resolution can decide between concurrent versions of one entity, but it cannot magically enforce "exactly one winner" for cabin ownership after independent confirmations. Some domains still need stronger coordination around the invariant itself.

Connection 3: 048.md turns storage policy into API semantics

A resolver is only half the story. The next lesson looks at how systems expose these choices to clients through stale-read contracts, conditional writes, and consistency levels that shape user-visible behavior.

Resources

[PAPER] Dynamo: Amazon's Highly Available Key-value Store
- Focus: Read the sections on vector clocks, sibling versions, and application-assisted conflict resolution to see why high availability pushes semantics upward.
[DOC] Apache CouchDB Documentation: Conflicts
- Focus: Study revision trees and how CouchDB surfaces conflicts instead of pretending timestamp order always captures intent.
[PAPER] A Comprehensive Study of Convergent and Commutative Replicated Data Types
- Focus: Use this as the reference for when merge-based resolution is mathematically sound and when CRDT-style convergence is a better fit than ad hoc field merging.
[BOOK] Designing Data-Intensive Applications
- Focus: Revisit the replication chapter with attention to multi-leader conflicts, causal metadata, and why LWW is often data loss with a cleaner dashboard.

Key Insights

A conflict is a concurrency fact, not just a value mismatch - The store needs version ancestry to tell whether one write superseded another or whether both were accepted independently.
Conflict policy is domain logic in disguise - Last-writer-wins, merge functions, and rejection paths each encode a different answer about which user intent matters most.
Merge is only correct when it preserves the real invariant - Safe field composition does not automatically imply a safe entity state.
The policy must be visible above the database - APIs, retries, observability, and compensation workflows all need to know whether a write is overwritten, merged, or rejected.

← Back to Consistency and Replication

← Back to Learning Hub