Day 486: Distributed Transactions and 2PC Limits

The core idea: Two-phase commit gives multiple transactional systems one atomic commit decision, but it does that by forcing participants into a durable prepared state that can block progress when the coordinator or network fails.

Today's "Aha!" Moment

In 054.md, Harbor Point learned how to keep one database serializable enough to protect the "leave one suite unsold" rule. That solved a local concurrency problem. It did not solve the next production problem that arrived after the company split its system boundaries: confirming a suite upgrade now has to touch booking-db, which owns cabin assignment, and ledger-db, which owns the receivable that finance reconciles at the end of the day. If one database commits and the other does not, the company either promised a cabin without charging for it or charged a guest for an upgrade that operations never honored.

The non-obvious insight is that distributed transactions are not mainly about making many systems feel like one giant database. They are about making one yes-or-no commit decision durable across several resource managers. Two-phase commit, or 2PC, works by asking each participant to do all of its local transaction work up to the edge of commit, record that prepared state durably, and then wait for the coordinator's final decision. Once a participant votes "yes," it has given up the right to decide alone.

That is why 2PC feels so different from ordinary retry logic. Retries are cheap because a component is free to start over. A prepared participant is not free. It may be holding row locks, index locks, or MVCC cleanup state while it waits. The protocol buys atomicity, but the trade-off is coordination latency, operational coupling, and the possibility of blocking when the decision maker disappears at exactly the wrong moment.

Why This Matters

Harbor Point can reconcile a few inconsistent upgrades by hand during a quiet week. It cannot run that way during peak summer sailings, when the booking team is confirming hundreds of last-minute cabin changes and finance is closing revenue at the same time. If booking-db says suite S12 is sold while ledger-db is missing the receivable entry, support has to investigate which screen the agent saw, whether the guest was promised the room, and whether a compensating charge is still legally acceptable. Those are not "eventual consistency" inconveniences. They are customer-facing and audit-facing failures.

This is the moment when distributed transactions look appealing. The business invariant is concrete: either the upgrade exists in both places or it exists in neither. 2PC can enforce that invariant if both databases support prepare/commit semantics. But the production cost is equally concrete. Every successful transaction now needs an extra coordination round, every participant must fsync a prepared record before voting yes, and any slow or crashed coordinator can strand prepared work. Engineers need to recognize both sides of that bargain before they adopt it.

Core Walkthrough

Part 1: Why local transactions are not enough

Harbor Point's upgrade flow now crosses service ownership boundaries:

booking-db changes cabins.status from available to booked
ledger-db inserts a receivable row that finance later settles

If the application tries to do that with two unrelated local transactions, it gets a dangerous gap between the commits:

1. Begin local transaction in booking-db.
2. Mark suite S12 as booked.
3. Commit booking-db.
4. Begin local transaction in ledger-db.
5. Network timeout or process crash before ledger-db commits.

After step 3, the guest may already see the upgrade in the booking portal. After step 5, finance has no matching receivable. Running each database at SERIALIZABLE would not fix that split. Serializability controls concurrency inside one participant; it does not create atomic commit across multiple participants.

That distinction matters operationally. When Harbor Point discusses 2PC, it is not replacing local concurrency control from 054.md. Each participant still needs its own correct local transaction semantics. 2PC sits one layer above that and answers a narrower question: after every participant has done its local work, do they all cross the commit line together or not at all?

Part 2: What 2PC asks each participant to do

In Harbor Point's case, the coordinator is the upgrade service. It assigns a transaction id such as txn-8841 and talks to both databases using a transaction manager or XA-capable driver.

The protocol looks like this:

Coordinator                  booking-db                     ledger-db
    |                             |                             |
    |---- PREPARE txn-8841 -----> |                             |
    |                             |-- validate + write prepare -|
    |                             |<--------- YES ------------- |
    |---- PREPARE txn-8841 ------------------------------->     |
    |                                                           |
    |                                      validate + write prepare
    |<------------------------------------------- YES ----------|
    |---- durable COMMIT decision ----------------------------->|
    |---- COMMIT txn-8841 -----> |                             |
    |                             |-- make changes visible -----|
    |<--------- ACK -------------|                             |
    |---- COMMIT txn-8841 ------------------------------->     |
    |<------------------------------------------- ACK ----------|

Three details make the mechanism work.

First, PREPARE is not a vote about intention. It is a vote backed by durable state. Before booking-db says yes, it has already done the local writes, checked constraints, and stored enough log information to either commit or roll back later. In PostgreSQL terms, a prepared transaction becomes visible in pg_prepared_xacts; its locks and transactional state survive client disconnects until COMMIT PREPARED or ROLLBACK PREPARED.

Second, the coordinator must persist its own decision before telling anyone to commit. If Harbor Point's upgrade service sent COMMIT optimistically and crashed before durably recording that choice, recovery could not tell whether the real outcome should be commit or abort. The coordinator's decision log is therefore part of the protocol, not a convenience.

Third, a "no" vote is much cheaper than a "yes" vote. If either participant rejects during prepare because of a constraint failure, timeout, or local crash, the coordinator can choose abort and tell everyone to roll back. The hard case is after both participants have said yes. At that point, neither participant can safely invent its own outcome without risking a split decision.

Part 3: Where the limits show up in production

The easiest way to understand 2PC limits is to look at the Harbor Point alert that wakes the on-call engineer: prepared_transactions > 0 for 10 minutes. That alert means the protocol has reached the state where local work is done but the final decision has not finished propagating.

The failure modes behind that alert are the real design lesson:

Limit	What Harbor Point observes	Why it happens
Blocking after coordinator failure	suite rows stay locked, agents see timeouts, finance rows remain invisible	prepared participants cannot unilaterally decide commit or abort without risking disagreement
Latency coupled to the slowest participant	a healthy `booking-db` still waits on a slow `ledger-db` fsync	prepare and final decision both require all yes-voting participants to stay in the protocol
Availability drops with every required participant	one unavailable database prevents the whole upgrade from committing	atomicity requires all participants to reach a common decision
Narrow applicability	card processor calls, emails, and human approval steps do not fit	those systems usually cannot hold a recoverable prepared state and wait for `COMMIT`

Blocking is the famous 2PC limitation because it is the place where engineers discover what "prepared" really means. Suppose both databases have voted yes, the coordinator has written nothing durable yet, and then a network partition isolates booking-db. booking-db cannot assume abort, because ledger-db might already have learned commit. It also cannot assume commit, because the coordinator might still choose abort after detecting a failure elsewhere. So it waits. That waiting is not a bug in one implementation. It is the consequence of choosing atomic commit without a fault-tolerant decision process.

This is also why 2PC is a poor fit for long-running business workflows. If Harbor Point tried to include an external card processor, a fraud review, and an email confirmation in the same distributed transaction, the "prepared" window would become enormous or impossible to represent. Real payment gateways do not expose "prepare a charge and wait indefinitely for commit." They expose authorization, capture, void, and refund semantics. Those require sagas, compensating actions, or outbox-driven orchestration, not 2PC.

There are still cases where 2PC is the right tool. If both participants are transactional databases inside one datacenter boundary, the invariant is financially important, and the transactions are short, 2PC may be the simplest correct choice. But that decision is defensible only if Harbor Point also budgets for coordinator recovery, prepared-transaction monitoring, timeout handling, and a clear rule that no irreversible side effect is emitted before the final commit decision is durable.

Failure Modes and Misconceptions

Issue: "2PC makes several services behave like one fully available database."
- Why it is tempting: The word "transaction" suggests strong correctness and familiar local-database behavior.
- Corrective mental model: 2PC gives atomic commit, not magically high availability; it coordinates a shared decision, and that decision can block when the coordinator or network disappears.
- Operational fix: Track coordinator health, participant prepare latency, and prepared-transaction age as first-class production signals.
Issue: "A participant that voted yes can time out and roll itself back."
- Why it is tempting: Local retry habits teach engineers that timeouts usually mean "start over."
- Corrective mental model: After PREPARE, the participant has promised to follow the final outcome and must preserve the option to commit later.
- Operational fix: Make recovery procedures able to reload the coordinator's durable decision log and resolve in-doubt transactions explicitly.
Issue: "If every step is idempotent, 2PC is unnecessary."
- Why it is tempting: Idempotency is a powerful tool for retries and duplicate delivery.
- Corrective mental model: Idempotency helps when an operation repeats; it does not answer the atomicity question of whether two resource managers reached the same final decision.
- Operational fix: Use idempotency together with 2PC for safe retries around the coordinator, or use sagas when atomic commit across participants is not required or not supported.
Issue: "2PC is fine for external APIs as long as timeouts are generous."
- Why it is tempting: Longer timeouts can hide transient failures in testing.
- Corrective mental model: The protocol assumes participants can durably prepare and later honor COMMIT or ROLLBACK; most external APIs cannot do that.
- Operational fix: Keep 2PC limited to transactional resource managers and move external side effects to patterns such as an outbox, reservation protocol, or saga.

Connections

Connection 1: 054.md explained local correctness; this lesson explains cross-resource atomicity

Harbor Point still needs serializable or otherwise correct local transactions inside each participant. 2PC does not prevent write skew or bad predicate handling inside booking-db; it only coordinates the final commit decision after local transaction logic has already run.

Connection 2: 056.md picks up exactly where 2PC breaks

Once you see that prepared participants block because no one can safely invent the final decision, the next question is unavoidable: what fault model would let a commit decision survive crashes and partitions without a single fragile coordinator log? That is the door into consensus and replicated decision-making.

Connection 3: ../event-driven-and-streaming/003.md is the alternative for workflows that cannot prepare

When Harbor Point needs to involve a payment gateway or send external notifications, the right comparison is not "2PC versus nothing." It is "2PC for XA-capable resource managers" versus "saga-style orchestration with compensations for systems that cannot join an atomic commit protocol."

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Read the transaction chapters together to separate local isolation, distributed atomic commit, and consensus instead of treating them as one interchangeable feature.
[DOC] PostgreSQL Documentation: Two-Phase Transactions
- Focus: Pay attention to what PostgreSQL keeps durable during PREPARE TRANSACTION, how prepared transactions survive disconnects, and why they need active cleanup.
[DOC] MySQL 8.4 Reference Manual: XA Transactions
- Focus: Use it to see how a mainstream engine exposes 2PC semantics, including XA PREPARE, recovery, and in-doubt transaction handling.
[PAPER] Spanner: Google's Globally Distributed Database
- Focus: Read the transaction section to see how 2PC is often paired with replicated metadata and precise time assumptions in real distributed databases.

Key Takeaways

2PC solves one specific problem: making several transactional participants reach the same commit or abort decision.
The prepared state is the heart of the protocol and the source of its cost; once a participant votes yes, it may have to wait while holding scarce resources.
Local serializability and distributed atomic commit are different layers; correct participants can still produce an inconsistent workflow if their commit decisions are not coordinated.
Use 2PC only when the invariant justifies the latency and blocking trade-off and every participant can actually implement prepare/commit semantics.

← Back to Consistency and Replication

← Back to Learning Hub