Snapshot Isolation and Anomaly Boundaries

LESSON

Consistency and Replication

053 30 min advanced

Day 484: Snapshot Isolation and Anomaly Boundaries

The core idea: Snapshot isolation gives each transaction a stable MVCC view and rejects direct write-write collisions, but it does not automatically protect invariants that depend on a set of rows rather than one row version.

Today's "Aha!" Moment

In 052.md, Harbor Point finally gave POST /bookings/confirm a real transaction boundary. That fixed the obvious failure mode: cabin state, loyalty debit, promo decrement, and booking record now commit together instead of leaking half-finished work. Product then adds a new rule for disruption handling: every Atlantic sailing must keep at least one unsold suite in reserve until departure day so operations can reaccommodate stranded passengers.

At 09:00, two agents open separate upgrade confirmations. Each transaction reads the same snapshot and sees two unsold suites, S12 and S14. Agent one books S12; agent two books S14. Because the transactions write different rows, a snapshot-isolated database can let both commits succeed. No dirty reads occurred. No partial transaction leaked. Each agent saw a perfectly consistent picture. Harbor Point still broke the policy.

That is the useful boundary to internalize. Snapshot isolation is not "weak" because it shows inconsistent data; it is strong at giving each transaction a coherent point-in-time view. The gap is elsewhere: it usually notices direct write-write conflicts, not every cross-row business rule. The moment an invariant sounds like "at least one," "no more than N across this set," or "none exists matching this predicate," you have to ask whether the database will create a conflict for that rule or whether two valid local decisions can still combine into an invalid global outcome.

Why This Matters

Snapshot isolation is popular because it buys a lot of concurrency for OLTP systems. Readers do not block writers the way lock-heavy schemes do, long-running reports can read a stable picture, and many hot paths stop seeing spurious non-repeatable reads. For Harbor Point, that means customer-service agents can keep confirming bookings during a flash sale without every read waiting behind unrelated updates.

The production problem starts when teams confuse "transactional" with "serializable." A transactional booking flow under snapshot isolation can still violate invariants that live across multiple rows, multiple index entries, or the absence of a row. The database is honoring its contract; the team is asking it to enforce a stronger one than it promised. That mismatch usually appears only after a product rule changes from single-record correctness to set-level correctness.

Knowing the anomaly boundary changes design decisions. If the rule is row-local, snapshot isolation may be the right trade-off: high concurrency, simple recovery, and predictable latency. If the rule is predicate-based, the options change: use serializable isolation, materialize a shared conflict record, take explicit locks, or redesign the invariant into escrow-like quotas. The key is to choose consciously instead of treating a missing anomaly as a lucky benchmark result.

Core Walkthrough

Part 1: What snapshot isolation actually gives you

Under classic snapshot isolation, each transaction starts by selecting a visible snapshot of committed versions. Reads come from that fixed snapshot; later commits from other transactions do not suddenly appear halfway through the transaction. Writes create new versions that stay private until commit. At commit time, the engine usually applies a first-committer-wins rule for rows both transactions tried to modify.

For Harbor Point, the booking database behaves roughly like this:

1. T1 starts and receives snapshot S100.
2. T2 starts and receives snapshot S101, which is still effectively the same business state.
3. T1 reads suites S12 and S14 as available.
4. T2 reads suites S12 and S14 as available.
5. T1 updates S12 and commits.
6. T2 updates S14 and commits.
7. Both commits succeed because the write sets do not overlap.

The important detail is that snapshot isolation gives stability, not omniscience. T2 never sees a dirty intermediate write from T1. T2 also never rereads the table and notices that the world changed before commit. That fixed snapshot is precisely why analytics, long-running reads, and many request flows behave more predictably. It is also why a transaction can make a correct decision relative to an old snapshot and still help create an incorrect final state.

A useful working rule is this: snapshot isolation protects you best when the invariant collapses to the same versioned object that the transaction writes. If two concurrent transactions must update the exact same cabin row, one of them will usually abort. If the invariant lives across a set of rows and each transaction can satisfy its local check while touching a different member of that set, snapshot isolation may never notice the conflict.

Part 2: The write-skew boundary in Harbor Point's reserve-suite policy

Harbor Point stores each suite as its own row:

SELECT cabin_id
FROM cabins
WHERE voyage_id = 9001
  AND class = 'suite'
  AND status = 'available';

The application rule is simple: reject a sale if fewer than two suites are available, because booking one of the remaining two would consume the last reserve cabin. The confirmation code therefore does something like this:

def confirm_suite_upgrade(tx, voyage_id, booking_id):
    available = tx.query_available_suites(voyage_id)

    if len(available) < 2:
        raise ReservePolicyViolation()

    chosen = choose_suite(available)
    tx.mark_booked(chosen.cabin_id, booking_id)

Nothing is obviously wrong with the code. Each transaction reads a stable snapshot, checks the policy, and books one suite. The anomaly appears because both transactions read the same pre-commit snapshot and then update different rows. This is the classic write-skew shape: the dangerous dependency is through the predicate "there must remain at least one suite," not through a shared row version.

That distinction is the boundary in the lesson title. Snapshot isolation usually prevents:

Snapshot isolation does not automatically prevent:

If Harbor Point changes the schema so every suite sale must also decrement a single reserve_suite_budget row for that voyage, the anomaly boundary changes. Now both transactions contend on the same record and snapshot isolation will surface a direct conflict. The invariant became enforceable not because the database got smarter, but because the data model materialized the business rule into one object the concurrency control system can see.

Part 3: Choosing the right response when snapshot isolation is not enough

Once the anomaly is visible, the temptation is to say "use serializable everywhere." That can work, but it is not the only answer and it is not free. Serializable isolation closes the gap by detecting or preventing dangerous read-write cycles, which means more aborts, heavier locking or validation, and more pressure on hot predicates. For Harbor Point's busiest sailings, that may be acceptable for booking confirmation and unnecessary for back-office reporting.

The practical options are narrower and more mechanical than the slogan suggests:

The trade-off is straightforward. Snapshot isolation often gives better throughput and less blocking because it avoids turning every read into a lock negotiation. Serializable techniques reduce anomaly space, but they spend coordination budget through retries, blocking, predicate locks, or conflict tracking. Production engineering is deciding where that budget belongs. Harbor Point should spend it on the customer-visible booking promise, not on every dashboard or audit query that merely needs a stable historical view.

Failure Modes and Misconceptions

Connections

Connection 1: 052.md separated transaction boundaries from side effects

The previous lesson established that Harbor Point needs one atomic commit decision for booking state. This lesson narrows the question: even with a real transaction, what concurrent histories is the database still allowed to accept?

Connection 2: 051.md matters because sharding only solved ownership, not anomaly shape

Keeping decisive writes inside one shard made the booking path easier to reason about, but snapshot isolation shows that a single shard can still admit invalid multi-row outcomes when the invariant is predicate-based.

Connection 3: 054.md is the natural next step

Once you can name write skew and other snapshot-isolation boundaries, the next design question is how real systems approximate serializable behavior without turning every workload into one global lock.

Resources

Key Takeaways

  1. Snapshot isolation gives each transaction a stable committed snapshot and typically rejects direct write-write collisions, which is why it performs well on many OLTP paths.
  2. The main anomaly boundary is cross-row or predicate-based invariants: two transactions can each make a locally valid decision and still violate a global rule together.
  3. When snapshot isolation is too weak, the fix is to create a real conflict surface with serializable isolation, explicit locks, or a materialized guard record rather than relying on blind retries.
  4. The question is never "is snapshot isolation good or bad?"; it is "does this invariant map to the conflicts snapshot isolation actually knows how to see?"
PREVIOUS Transaction Semantics in Data-Intensive Systems NEXT Serializable Techniques in Practice

← Back to Consistency and Replication

← Back to Learning Hub