LESSON

Database Engine Internals and Implementation

029 30 min advanced

Day 406: Timestamp Ordering and Optimistic Concurrency

The core idea: Once MVCC stops readers from waiting on every writer, the database still needs a way to decide which overlapping transactions can commit together; timestamp ordering and optimistic concurrency control do that by choosing a serial order and aborting work that no longer fits it.

Today's "Aha!" Moment

In 05.md, Harbor Point's risk dashboard could keep reading an older committed version of MUNI-77 exposure instead of blocking behind every trader update. That solved one kind of waiting, but it did not solve the harder question: what if two transactions each read a valid snapshot and still make a jointly invalid decision?

The bond desk hits that problem at 09:31. The issuer cap for MUNI-77 is 10.0M, and open reservations currently total 9.2M. Trader R1 tries to reserve 0.5M for one client, while trader R2 tries to reserve another 0.5M for a different client. Each transaction reads the same snapshot, sees enough headroom, and inserts a different reservation row. If the engine only checks whether they touched the same physical row, both can appear safe and the desk ends up at 10.2M, violating the issuer limit even though neither transaction looked wrong in isolation.

That is the mental shift for this lesson. MVCC visibility explains what each transaction was allowed to read. Timestamp ordering and OCC explain whether the system can still pretend those reads and writes happened in one serial order. If not, one transaction has to restart. Production systems vary in the details, but they all rely on the same contract: speculative work is cheap only if the engine can reject it before the bad history becomes durable.

This is also the bridge to 07.md. A successful validation only says, "this transaction may commit." The next lesson asks how the engine makes that commit order crash-safe with WAL sequence numbers, page state, and flush ordering.

Why This Matters

Harbor Point cannot protect the issuer cap with a vague promise that "concurrent transactions usually work out." The desk needs high throughput during the market-open burst, and it also needs a serializable answer to a business rule that spans many rows. Locking everything pessimistically would prevent mistakes, but it would also put every reservation request into a wait queue around a hot issuer. That is a poor fit when most reservations do not conflict and the desk cares about latency as much as correctness.

Timestamp ordering and OCC offer a different bargain. Let transactions read, compute, and even stage writes without grabbing every lock up front. Then, at carefully chosen points, compare what they observed with a chosen serialization order. If another committed transaction invalidated the assumptions behind the work, abort and retry instead of letting the invariant drift.

The production relevance is immediate. When Harbor Point sees a sudden spike in abort retries on MUNI-77, that is not a random application bug. It usually means contention is high enough that speculative execution is wasting CPU and tail latency. When validation is implemented too weakly, the symptom is worse: the system looks fast and silently admits impossible histories. This lesson is about telling those two failure modes apart.

Learning Objectives

By the end of this session, you will be able to:

Explain how timestamp-based serial order constrains concurrency - Describe why a transaction may need to abort even after reading a valid snapshot.
Trace the OCC lifecycle from read phase to commit validation - Follow Harbor Point's reservation transactions through speculative execution, conflict checking, and retry.
Reason about production trade-offs - Decide when optimistic schemes outperform lock-heavy coordination and when contention turns retries into the dominant cost.

Core Concepts Explained

Concept 1: Timestamp ordering is a claim about which transaction must appear first

Harbor Point's R1 and R2 problem is not "who touched the row last?" It is "can the final history still be explained as if one transaction happened entirely before the other?" Timestamp ordering answers that by assigning each transaction a position in a serial order, often with a start timestamp, a commit timestamp, or both. Once the engine makes that choice, every read and write has to be consistent with it.

In the textbook form, each logical item keeps metadata such as read_ts(X) and write_ts(X). If transaction T=105 wants to read item X, but write_ts(X)=110, then a newer transaction has already written a value that should appear after T in serial order. Letting T read now would mean an "older" transaction saw the future, so T must abort. If T wants to write X, but some newer transaction has already read or written X, the same problem appears in the other direction: the write would arrive too late to fit the chosen ordering.

For Harbor Point, imagine the issuer exposure aggregate or the relevant index range for open reservations carries version metadata that effectively says, "a transaction newer than R1 has already consumed this state." If R1 now tries to write based on an older assumption, timestamp ordering rejects it. The database is not being conservative for style points; it is refusing a history that cannot be serialized without contradiction.

That explains both the power and the limit of basic timestamp ordering. It gives a clean rule for conflicts on tracked items, and it can avoid long lock waits because the losing transaction restarts instead of blocking. But real invariants often live on predicates and ranges, not single rows. Harbor Point's issuer cap depends on "all open reservations for MUNI-77," so a production engine needs index-range tracking, predicate locking, dependency tracking, or another way to represent the fact that a new row in that range changes what the transaction was allowed to conclude.

Concept 2: OCC lets transactions run first and proves safety at validation time

Optimistic concurrency control takes the same serial-order goal and turns it into an execution strategy. Instead of checking every operation against the global order immediately, the engine lets the transaction do most of its work speculatively. The classic shape is read phase, validation phase, then write phase.

For Harbor Point, R1 reads the current open reservations for issuer MUNI-77, computes that another 0.5M reservation would still fit under the cap, and prepares its insert in a private workspace. It has not changed the durable table yet. While R1 is thinking, R2 does the same thing. Both transactions look locally correct because each is reading a consistent snapshot and staging private writes.

At commit time, the engine chooses a serialization point and asks a narrower question than "did anything happen?" It asks, "did any transaction that committed since my snapshot change something that would have changed my decision?" For a simple item update, that might mean a read-set or write-set intersection. For Harbor Point's issuer-cap rule, the validation surface is broader: a newly inserted reservation in the MUNI-77 range changes the result of the aggregate query, so it conflicts even if the two traders inserted different rows.

One useful way to picture it is this:

R1: read snapshot S100 -> compute reservation -> validate against commits since S100 -> commit C110
R2: read snapshot S101 -> compute reservation -> validate against commits since S101 -> sees R1 changed MUNI-77 exposure range -> abort/retry

The crucial detail is that OCC does not mean "ignore conflicts until the end and hope." It means "record enough about what I depended on that I can detect whether my chosen serial position is still legal." If the engine only tracks exact rows and not the range or predicate behind Harbor Point's cap check, validation is too weak and the invariant can still break. Correct OCC therefore depends as much on what the read set means as on when validation runs.

Concept 3: Optimism wins on low-conflict workloads and falls apart when retries become the queue

Harbor Point prefers optimism during normal trading because most reservations are for different issuers and different clients. In that regime, locking every candidate row or index range up front would create unnecessary waits. OCC keeps the hot path short: read from a snapshot, compute privately, validate quickly, and write once. Abort rates stay low, CPU does useful work, and throughput is good.

The failure mode is not subtle. If many traders hammer MUNI-77 at once, the system can degrade from "little waiting" to "lots of wasted work." Transactions finish their read phase, burn application and database CPU, reach validation, and then lose because a slightly earlier transaction already committed a conflicting change. A pessimistic system pays in blocking time; an optimistic system pays in retries. Under enough contention, retries become a hidden queue with worse tail latency than explicit locks because the discarded work has already consumed resources.

That is why production systems rarely use one pure textbook algorithm. They mix MVCC snapshots with optimistic validation, attach commit timestamps to version chains, or fall back to stronger locking on the hottest keys and ranges. Engineers then watch concrete signals: abort rate, retry budget, hot-key skew, validation latency, and whether retries amplify upstream traffic. Harbor Point does not care whether a whitepaper called the design "timestamp ordering" or "serializable OCC." It cares whether the issuer-cap invariant holds and whether the desk can still trade through the open.

This is the point where durability re-enters the story. Once R1 wins validation, the engine still has to assign a commit record, establish a durable order, and make sure crash recovery reproduces the same winner. That is exactly why 07.md follows next: validation decides which history is allowed, while WAL decides how that history survives a crash.

Troubleshooting

Issue: Harbor Point sees repeated transaction retries on one issuer even though average database CPU is not maxed out.

Why it happens / is confusing: OCC pushes contention cost into aborted work, so the database can look superficially healthy while many transactions are doing speculative reads that never commit.

Clarification / Fix: Inspect retry counts, abort reasons, and hot-key distribution alongside CPU. If conflicts cluster on MUNI-77, shard the invariant differently, serialize that path more explicitly, or reduce how much work happens before validation.

Issue: Two concurrent reservations inserted different rows, yet the issuer cap was still violated after both committed.

Why it happens / is confusing: Validation may be tracking exact row conflicts but not the predicate or index range behind the cap check, so the engine misses a write-skew-style dependency.

Clarification / Fix: Validate the actual read dependency, not just touched row IDs. In practice that can mean predicate locks, index-range tracking, materialized aggregate rows, or a serializable isolation mechanism that records read-write dependencies.

Issue: Engineers assume timestamp ordering removes the need for waiting entirely.

Why it happens / is confusing: The algorithm often aborts instead of blocking on logical conflicts, but engines still wait on latches, log flushes, index-page access, and sometimes explicit locks used alongside optimistic schemes.

Clarification / Fix: Separate concurrency-control policy from the rest of the storage stack. A transaction may be optimistic about logical conflicts and still wait on physical structures or durability boundaries.

Issue: Retry logic "fixes" failed validations, but client latency becomes unpredictable during bursts.

Why it happens / is confusing: Automatic retries preserve correctness, but they can multiply load when every retry replays the same expensive read and compute path against the same hot contention point.

Clarification / Fix: Bound retries, add jitter or queueing upstream, and measure end-to-end request cost rather than counting only successful commit latency.

Advanced Connections

Connection 1: MVCC snapshots ↔ optimistic validation

The previous lesson's snapshot rules tell Harbor Point what each transaction may read without blocking. OCC adds the missing serializability check: after reading from a legal snapshot, is the transaction still compatible with everything that committed in the meantime? Many production engines combine both ideas because snapshots make speculative reads cheap, while validation prevents those reads from justifying an impossible final history.

Connection 2: Timestamp ordering ↔ serializable transaction systems

FoundationDB, Spanner-style designs, and serializable MVCC engines all need a way to map overlapping real execution into a legal serial order. Some expose that as optimistic retries, some as dependency tracking, and some as timestamp-based commit rules. The common thread is not the API shape but the mechanism: commit is allowed only if the engine can still assign the transaction a position in a history that preserves the invariant it read.

Resources

Optional Deepening Resources

[PAPER] On Optimistic Methods for Concurrency Control
- Focus: The classic OCC paper by Kung and Robinson; read it for the original read/validate/write framing and the assumptions behind optimistic execution.
[DOC] FoundationDB Developer Guide: Transactions
- Focus: A production system that exposes optimistic transactions and retry loops directly to application developers.
[DOC] PostgreSQL Documentation: Transaction Isolation
- Focus: How snapshot-based reads interact with serializable behavior and why a legal snapshot is not the same as a legal final history.
[BOOK] Designing Data-Intensive Applications
- Focus: Broader context for serializability, write skew, and the engineering trade-offs between pessimistic and optimistic concurrency control.

Key Insights

A valid snapshot is not a final correctness proof - Transactions can each read legal data and still conflict once the engine tries to serialize their combined effects.
Timestamp ordering chooses who is allowed to be "earlier" - Reads and writes that violate that order must abort because they imply an impossible history.
OCC turns contention into retries instead of waits - That is excellent on low-conflict workloads and dangerous when hot keys or hot ranges make speculative work mostly disposable.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub