Distributed Transactions and Two-Phase Commit

LESSON

Consistency and Replication

024 30 min advanced

Day 423: Distributed Transactions and Two-Phase Commit

The core idea: Two-phase commit lets one logical transaction span multiple shards by separating "can you commit?" from "commit now," but that atomic outcome costs durable coordination, held locks, and blocking whenever the coordinator cannot finish the decision.

Today's "Aha!" Moment

In 06.md, Harbor Point learned how to move shard ownership without creating two writers for the same bucket. That solved the placement problem. It did not solve the business-operation problem. A reservation for CA-MUNI-77 still has to do two writes that live on different shards: decrement available inventory on bucket CA/3, and increment the desk's aggregate exposure on shard GLOBAL-RISK. If only one write lands, the platform either oversells inventory or rejects later trades against a risk number that never actually became true.

The tempting implementation is "write shard A, then write shard B, and retry if the second call fails." That is not atomicity. It is wishful sequencing. Once Harbor Point commits the first write, the system has already exposed a partial outcome that other readers and writers can build on. Retries help with transient communication failures, but they do not erase the fact that one shard may already have committed durable state while another has not.

Two-phase commit changes the shape of the problem. The important event is not the final COMMIT message. The important event is that each participant durably records a prepared state saying, in effect, "I have validated this transaction, reserved the resources it touches, and I promise not to contradict this outcome unless I later hear the global decision." The coordinator only announces COMMIT after every participant has made that promise. That is why 2PC can deliver an all-or-nothing outcome across shards, and also why it can leave the system waiting if the coordinator disappears at exactly the wrong moment.

That trade-off is the bridge to the next lesson, 08.md. When Harbor Point needs strict atomicity inside a tightly controlled data system, 2PC can be the right tool. When the workflow stretches across looser service boundaries or external side effects, the blocking cost becomes the main design constraint and the system usually reaches for sagas or an outbox instead.

Why This Matters

Harbor Point's reservation engine is not doing distributed transactions for elegance. It is doing them because the product exposes invariants that cross shard boundaries. A reservation is valid only if inventory and risk move together. If a trader sees inventory drop but risk not increase, the next decision is made against a lie. If risk increases but inventory does not drop, desks start getting blocked by phantom exposure. In both directions, the bug is not merely technical inconsistency. It changes the trading behavior of the whole platform.

Two-phase commit matters because local ACID stops at the shard boundary. Each shard can protect its own rows with a normal transaction, but no shard can unilaterally promise what the other shard will do. The platform therefore needs a coordination protocol that answers one concrete question: can these participants either all commit this business change, or all back away from it, even if failures happen between messages?

This lesson sits directly after rebalancing for a reason. 06.md established that each logical bucket needs one clear owner at a time. 2PC assumes that ownership discipline already exists, then adds a second layer: once the coordinator chooses the participants for a transaction, they must hold enough state and lock enough data to honor one global decision. The gain is stronger cross-shard correctness. The cost is longer-lived locks, more operational recovery work, and reduced availability whenever one participant or the coordinator is unhealthy.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain why local transactions are insufficient for Harbor Point's cross-shard reservation flow - Show exactly where partial commit appears when one business operation touches CA/3 and GLOBAL-RISK.
  2. Trace the internal 2PC state machine - Follow coordinator and participant behavior through prepare, durable decision logging, final commit or abort, and recovery.
  3. Evaluate when 2PC is worth its operational cost - Reason about blocking, coordinator failure, lock hold time, and why many systems restrict 2PC to narrow, high-value invariants.

Core Concepts Explained

Concept 1: Cross-shard invariants need one commit decision, not two unrelated local commits

Harbor Point's reservation request for CA-MUNI-77 looks simple from the API edge: reserve $2M of inventory for desk ALPHA. Under the hood it touches two different authorities. Bucket CA/3 owns the issuer-level inventory rows. Shard GLOBAL-RISK owns the desk's aggregate exposure rows. Each shard can run a normal transaction over its own data, but neither shard can see the other's locks, WAL records, or post-commit outcome.

If Harbor Point updates CA/3 first and GLOBAL-RISK second, the failure window is obvious. Suppose the inventory shard commits, then the coordinator process crashes before the risk shard even receives its write. The system wakes back up with inventory reduced and desk exposure unchanged. A retry may double-apply the second step, or an operator may need a manual repair job to discover which reservations are half-finished. The same bug appears in reverse if risk commits and inventory does not.

The key point is that "both shards support transactions" does not mean "the combined workflow is transactional." Local ACID protects one storage engine's write set. A distributed transaction appears the moment one invariant spans multiple independent commit logs. Harbor Point therefore needs a component that speaks to both shards, assigns one transaction identifier, and refuses to let either shard finalize early.

That component is the transaction coordinator. Before it starts, it resolves the participating shard owners under one placement epoch. This is where the previous lesson still matters: if CA/3 is in the middle of a rebalance, Harbor Point should either pin the transaction to the current owner set or abort and retry against fresh metadata. 2PC assumes the coordinator knows which participants are authoritative for the full lifetime of the transaction.

The trade-off begins right here. One global decision prevents split business state, but it also couples the liveness of the transaction to every participant. The reservation cannot finish faster than the slowest shard that must vote on it.

Concept 2: The prepare phase works because participants turn intent into a durable promise

For Harbor Point's reservation, the coordinator sends a PREPARE(txn-8841) request to both CA/3 and GLOBAL-RISK. Each message includes the exact mutation the participant would perform if the transaction commits: reduce available inventory by $2M on one shard, increase desk exposure by $2M on the other. The participant does not commit yet. It first checks that the mutation is valid against current state and that no conflicting transaction already holds the same rows.

If validation succeeds, the participant writes a prepared record to durable storage before replying YES. That record contains enough information to either commit or abort after a crash. The participant also keeps the affected rows or keys locked so some other transaction cannot sneak in and change the premise underneath the coordinator. Once Harbor Point has voted YES, it is no longer free to forget the transaction just because the coordinator becomes slow. The durable prepared record is the promise that makes a later global commit possible.

client
  |
  v
coordinator
  |-- PREPARE txn-8841 --> CA/3
  |-- PREPARE txn-8841 --> GLOBAL-RISK

CA/3:
  validate inventory >= 2M
  write PREPARED(txn-8841, redo/undo)
  lock issuer row
  reply YES

GLOBAL-RISK:
  validate limit not exceeded
  write PREPARED(txn-8841, redo/undo)
  lock desk exposure row
  reply YES

Only after every participant replies YES does the coordinator make the global decision. It writes COMMIT(txn-8841) to its own durable log, then sends COMMIT to each participant. If any participant votes NO, the coordinator durably records ABORT and tells everyone to roll back. The important ordering is subtle but critical: the coordinator must persist its final decision before it tells participants to act on that decision. Otherwise a crash can erase the only authoritative answer to "was this transaction committed?"

def prepare_participant(txn):
    if not constraints_hold(txn) or has_conflict(txn):
        log.write(("vote_no", txn.id))
        return "NO"
    lock_keys(txn.write_set)
    log.write(("prepared", txn.id, txn.redo, txn.undo))
    return "YES"

After a participant receives COMMIT, it applies the change, writes a commit record, releases locks, and acknowledges completion. If it receives ABORT, it discards the prepared intent and releases locks without making the mutation visible. The mechanism is boring by design: durable intent first, one durable decision second, visible effects last. That ordering is what converts two local storage engines into one all-or-nothing commit path.

Concept 3: 2PC is valuable precisely where its blocking behavior is acceptable

The happy path is not the reason engineers fear 2PC. The difficult path is the one where every participant voted YES, but the coordinator crashes before some participants learn the final answer. Those participants are now in doubt. They have promised to honor a future global decision, so they cannot safely commit on their own, and they cannot safely abort on their own either. Until Harbor Point recovers the coordinator log or learns the decision from a surviving replica of that log, the locked rows may remain unavailable to other work.

This is why 2PC is often described as a blocking protocol. The problem is not that the messages are slow in the normal case. The problem is that progress depends on the decision record staying recoverable. If the coordinator is backed by one fragile process and one local disk, Harbor Point has turned a cross-shard invariant into a single-point-of-waiting. If the coordinator state is replicated more robustly, the system reduces that risk, but it has moved closer to a consensus-backed transaction manager with all the operational complexity that implies.

The protocol also stretches lock hold time. CA/3 and GLOBAL-RISK keep resources reserved from PREPARE until they learn the outcome. Under bursty load, that can inflate tail latency or create deadlock pressure even when every transaction eventually succeeds. Harbor Point therefore should use 2PC where the invariant is genuinely atomic and the participant set is small, predictable, and inside infrastructure the team controls closely.

That is also why the next lesson exists. The moment Harbor Point wants the same workflow to include email notifications, third-party settlement rails, or loosely coupled microservices that cannot stay prepared for long, 2PC becomes the wrong abstraction. The system usually shifts to local commits plus compensation, accepting that "eventually repaired" is a different contract from "never partially committed."

Troubleshooting

Issue: Prepared transactions accumulate and block new reservations on CA/3 even though CPU usage looks normal.

Why it happens / is confusing: The expensive part of 2PC is often not compute. It is time spent waiting for a final decision while locks remain held. A coordinator stall or a slow participant can make the system feel idle and stuck at the same time.

Clarification / Fix: Monitor prepared-transaction age, lock wait time, and count of in-doubt transactions. Recovery alerts should trigger on long-lived prepared state, not only on process death.

Issue: The client times out, retries the reservation, and the desk exposure is incremented twice.

Why it happens / is confusing: A timeout at the client does not tell you whether the coordinator already committed. Without a transaction identifier that survives retries, Harbor Point cannot distinguish "same transaction retry" from "new reservation request."

Clarification / Fix: Make the coordinator idempotent on a stable transaction key. Client retries should ask for the outcome of txn-8841, not start a new anonymous transaction.

Issue: A rebalance starts while a distributed transaction is preparing, and one participant rejects the final commit as stale.

Why it happens / is confusing: The coordinator chose participants from one placement epoch, but ownership changed before the transaction finished. The transaction is now trying to complete against a shard that no longer believes it is authoritative.

Clarification / Fix: Pin routing to a placement epoch for the duration of the transaction or abort and retry when ownership changes. Rebalancing and distributed transactions must share the same authority model.

Issue: The team tries to include email delivery or a card network call inside the 2PC workflow.

Why it happens / is confusing: 2PC works best when every participant can hold prepared state durably and follow the same recovery protocol. External side effects usually cannot do that, and some of them are irreversible once observed.

Clarification / Fix: Keep the atomic core inside storage boundaries you control, then publish downstream work through an outbox or saga-style flow. That is the transition into 08.md.

Advanced Connections

Connection 1: 06.md defines who owns the shard; 2PC assumes that ownership is stable long enough to coordinate across shards

Rebalancing introduced placement epochs so Harbor Point could transfer authority for CA/3 safely. A distributed transaction depends on the same idea. The coordinator cannot ask one owner to prepare and another owner to commit. Stable shard ownership is therefore a prerequisite for atomic cross-shard coordination, not an unrelated storage concern.

Connection 2: 08.md addresses the cases where atomic prepare state is too expensive to hold

Two-phase commit resolves failure before state becomes visible. Sagas and the outbox pattern resolve failure after some local state has already committed. That difference is the central design choice for cross-service consistency: wait for one global decision now, or allow partial progress and compensate later.

Resources

Optional Deepening Resources

Key Insights

  1. A distributed transaction starts when one invariant spans multiple commit logs - Local transactions do not compose into atomicity just because each shard is ACID on its own.
  2. Prepared state is the heart of 2PC - Participants must durably promise a future outcome before the coordinator can safely choose one global decision.
  3. 2PC spends availability and lock time to buy atomicity - It is the right tool for narrow, high-value invariants, not a default workflow pattern for every cross-service action.
PREVIOUS Rebalancing and Consistent Hashing NEXT Sagas and Outbox for Cross-Service Consistency

← Back to Consistency and Replication

← Back to Learning Hub