Day 484: Snapshot Isolation and Anomaly Boundaries

The core idea: Snapshot isolation gives each transaction a stable MVCC view and rejects direct write-write collisions, but it does not automatically protect invariants that depend on a set of rows rather than one row version.

Today's "Aha!" Moment

In 052.md, Harbor Point finally gave POST /bookings/confirm a real transaction boundary. That fixed the obvious failure mode: cabin state, loyalty debit, promo decrement, and booking record now commit together instead of leaking half-finished work. Product then adds a new rule for disruption handling: every Atlantic sailing must keep at least one unsold suite in reserve until departure day so operations can reaccommodate stranded passengers.

At 09:00, two agents open separate upgrade confirmations. Each transaction reads the same snapshot and sees two unsold suites, S12 and S14. Agent one books S12; agent two books S14. Because the transactions write different rows, a snapshot-isolated database can let both commits succeed. No dirty reads occurred. No partial transaction leaked. Each agent saw a perfectly consistent picture. Harbor Point still broke the policy.

That is the useful boundary to internalize. Snapshot isolation is not "weak" because it shows inconsistent data; it is strong at giving each transaction a coherent point-in-time view. The gap is elsewhere: it usually notices direct write-write conflicts, not every cross-row business rule. The moment an invariant sounds like "at least one," "no more than N across this set," or "none exists matching this predicate," you have to ask whether the database will create a conflict for that rule or whether two valid local decisions can still combine into an invalid global outcome.

Why This Matters

Snapshot isolation is popular because it buys a lot of concurrency for OLTP systems. Readers do not block writers the way lock-heavy schemes do, long-running reports can read a stable picture, and many hot paths stop seeing spurious non-repeatable reads. For Harbor Point, that means customer-service agents can keep confirming bookings during a flash sale without every read waiting behind unrelated updates.

The production problem starts when teams confuse "transactional" with "serializable." A transactional booking flow under snapshot isolation can still violate invariants that live across multiple rows, multiple index entries, or the absence of a row. The database is honoring its contract; the team is asking it to enforce a stronger one than it promised. That mismatch usually appears only after a product rule changes from single-record correctness to set-level correctness.

Knowing the anomaly boundary changes design decisions. If the rule is row-local, snapshot isolation may be the right trade-off: high concurrency, simple recovery, and predictable latency. If the rule is predicate-based, the options change: use serializable isolation, materialize a shared conflict record, take explicit locks, or redesign the invariant into escrow-like quotas. The key is to choose consciously instead of treating a missing anomaly as a lucky benchmark result.

Core Walkthrough

Part 1: What snapshot isolation actually gives you

Under classic snapshot isolation, each transaction starts by selecting a visible snapshot of committed versions. Reads come from that fixed snapshot; later commits from other transactions do not suddenly appear halfway through the transaction. Writes create new versions that stay private until commit. At commit time, the engine usually applies a first-committer-wins rule for rows both transactions tried to modify.

For Harbor Point, the booking database behaves roughly like this:

1. T1 starts and receives snapshot S100.
2. T2 starts and receives snapshot S101, which is still effectively the same business state.
3. T1 reads suites S12 and S14 as available.
4. T2 reads suites S12 and S14 as available.
5. T1 updates S12 and commits.
6. T2 updates S14 and commits.
7. Both commits succeed because the write sets do not overlap.

The important detail is that snapshot isolation gives stability, not omniscience. T2 never sees a dirty intermediate write from T1. T2 also never rereads the table and notices that the world changed before commit. That fixed snapshot is precisely why analytics, long-running reads, and many request flows behave more predictably. It is also why a transaction can make a correct decision relative to an old snapshot and still help create an incorrect final state.

A useful working rule is this: snapshot isolation protects you best when the invariant collapses to the same versioned object that the transaction writes. If two concurrent transactions must update the exact same cabin row, one of them will usually abort. If the invariant lives across a set of rows and each transaction can satisfy its local check while touching a different member of that set, snapshot isolation may never notice the conflict.

Part 2: The write-skew boundary in Harbor Point's reserve-suite policy

Harbor Point stores each suite as its own row:

SELECT cabin_id
FROM cabins
WHERE voyage_id = 9001
  AND class = 'suite'
  AND status = 'available';

The application rule is simple: reject a sale if fewer than two suites are available, because booking one of the remaining two would consume the last reserve cabin. The confirmation code therefore does something like this:

def confirm_suite_upgrade(tx, voyage_id, booking_id):
    available = tx.query_available_suites(voyage_id)

    if len(available) < 2:
        raise ReservePolicyViolation()

    chosen = choose_suite(available)
    tx.mark_booked(chosen.cabin_id, booking_id)

Nothing is obviously wrong with the code. Each transaction reads a stable snapshot, checks the policy, and books one suite. The anomaly appears because both transactions read the same pre-commit snapshot and then update different rows. This is the classic write-skew shape: the dangerous dependency is through the predicate "there must remain at least one suite," not through a shared row version.

That distinction is the boundary in the lesson title. Snapshot isolation usually prevents:

dirty reads and half-committed visibility
non-repeatable reads inside one transaction's snapshot
many direct lost updates where two transactions overwrite the same row version

Snapshot isolation does not automatically prevent:

write skew on disjoint rows that jointly enforce one invariant
predicate-based anomalies such as "no row matching X may exist" unless the engine turns the predicate into a real conflict
stale decisions in long transactions that never revalidate business rules against newer commits

If Harbor Point changes the schema so every suite sale must also decrement a single reserve_suite_budget row for that voyage, the anomaly boundary changes. Now both transactions contend on the same record and snapshot isolation will surface a direct conflict. The invariant became enforceable not because the database got smarter, but because the data model materialized the business rule into one object the concurrency control system can see.

Part 3: Choosing the right response when snapshot isolation is not enough

Once the anomaly is visible, the temptation is to say "use serializable everywhere." That can work, but it is not the only answer and it is not free. Serializable isolation closes the gap by detecting or preventing dangerous read-write cycles, which means more aborts, heavier locking or validation, and more pressure on hot predicates. For Harbor Point's busiest sailings, that may be acceptable for booking confirmation and unnecessary for back-office reporting.

The practical options are narrower and more mechanical than the slogan suggests:

keep snapshot isolation when the critical invariant is row-local and direct write conflicts are exactly the signal you want
use serializable isolation when the invariant genuinely spans a predicate and the path is important enough to pay extra coordination cost
materialize the invariant into a shared row or quota bucket when you want snapshot isolation's performance but need the business rule to create an explicit conflict
move some checks into asynchronous reconciliation only when the product can tolerate temporary violation and repair

The trade-off is straightforward. Snapshot isolation often gives better throughput and less blocking because it avoids turning every read into a lock negotiation. Serializable techniques reduce anomaly space, but they spend coordination budget through retries, blocking, predicate locks, or conflict tracking. Production engineering is deciding where that budget belongs. Harbor Point should spend it on the customer-visible booking promise, not on every dashboard or audit query that merely needs a stable historical view.

Failure Modes and Misconceptions

Issue: "Snapshot isolation means transactions behave as if they ran one at a time."
- Why it is tempting: Each transaction reads a clean, stable picture, so the result feels serial.
- Corrective mental model: Snapshot isolation guarantees point-in-time visibility plus conflict checks on overlapping writes, not a single serial order for all predicates.
- Operational fix: List the invariants that span sets of rows and test them under concurrent commits, not only under single-request correctness tests.
Issue: "If both transactions commit, the invariant must have been safe."
- Why it is tempting: Commit success looks like database approval.
- Corrective mental model: Commit only proves the schedule satisfied the isolation level the engine implements. It does not prove the schedule satisfied your business rule.
- Operational fix: Translate critical rules into concrete conflict points or use serializable isolation where the predicate itself must be protected.
Issue: "Retries solve snapshot-isolation anomalies."
- Why it is tempting: Retries are the right answer for deadlocks and direct write conflicts.
- Corrective mental model: Retrying the same predicate-based logic under snapshot isolation can recreate the same write skew if nothing about the conflict structure changes.
- Operational fix: Change the concurrency contract by adding a shared guard row, stronger isolation, or an explicit lock on the predicate set.
Issue: "The fix is always to lock every row the query touched."
- Why it is tempting: It seems like the safest possible move.
- Corrective mental model: Blanket locking can destroy the throughput advantage that made snapshot isolation attractive in the first place.
- Operational fix: Lock or serialize only the part of the state that actually carries the invariant, and measure abort rate and tail latency after the change.

Connections

Connection 1: 052.md separated transaction boundaries from side effects

The previous lesson established that Harbor Point needs one atomic commit decision for booking state. This lesson narrows the question: even with a real transaction, what concurrent histories is the database still allowed to accept?

Connection 2: 051.md matters because sharding only solved ownership, not anomaly shape

Keeping decisive writes inside one shard made the booking path easier to reason about, but snapshot isolation shows that a single shard can still admit invalid multi-row outcomes when the invariant is predicate-based.

Connection 3: 054.md is the natural next step

Once you can name write skew and other snapshot-isolation boundaries, the next design question is how real systems approximate serializable behavior without turning every workload into one global lock.

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Read the chapters on transactions and multi-version concurrency control together; they explain why a stable snapshot is not the same thing as serial execution.
[PAPER] A Critique of ANSI SQL Isolation Levels
- Focus: Use it to separate named isolation levels from the actual anomaly classes they do or do not rule out.
[PAPER] Making Snapshot Isolation Serializable
- Focus: Pay attention to dangerous structures and why serializable snapshot isolation adds conflict detection instead of merely "stronger locks everywhere."
[DOC] PostgreSQL Documentation: Transaction Isolation
- Focus: Map the formal concepts to an engine engineers actually operate, including which anomalies PostgreSQL prevents at each level.

Key Takeaways

Snapshot isolation gives each transaction a stable committed snapshot and typically rejects direct write-write collisions, which is why it performs well on many OLTP paths.
The main anomaly boundary is cross-row or predicate-based invariants: two transactions can each make a locally valid decision and still violate a global rule together.
When snapshot isolation is too weak, the fix is to create a real conflict surface with serializable isolation, explicit locks, or a materialized guard record rather than relying on blind retries.
The question is never "is snapshot isolation good or bad?"; it is "does this invariant map to the conflicts snapshot isolation actually knows how to see?"

← Back to Consistency and Replication

← Back to Learning Hub