Distributed Transactions: 2PC, Sagas, and Practical Limits

LESSON

Consistency and Replication

015 30 min intermediate

Day 286: Distributed Transactions: 2PC, Sagas, and Practical Limits

The core idea: once one business operation spans multiple shards or services, local ACID is no longer enough. You must choose how to coordinate success and failure across multiple owners, and every choice pays in blocking, latency, complexity, or compensation logic.


Today's "Aha!" Moment

The insight: Distributed transactions are not one single technique. 2PC and sagas solve different problems with very different guarantees. 2PC aims for atomic commit across participants. Sagas aim for eventual business completion through a sequence of local commits plus compensations.

Why this matters: Teams often reach for one label without clarifying the invariant they actually need. That is where pain starts. If you need strict all-or-nothing semantics, sagas may be too weak. If you use 2PC everywhere, you may buy more blocking and fragility than the workflow can afford.

Concrete anchor: An order workflow touches payment, inventory, and shipment. If payment succeeds and inventory fails, what should happen? Should the whole transaction block until everyone agrees? Or should the system commit local steps and then compensate if later steps fail?

The practical sentence to remember:
Distributed transactions are about choosing where failure is resolved: before commit with coordination, or after partial success with compensation.


Why This Matters

The problem: Sharding and service boundaries improve scale and ownership, but they also destroy the comfort of one local transaction. As soon as one operation spans multiple owners, you must answer:

Without this model:

With this model:

Operational payoff: Better workflow design, fewer stuck transactions, clearer compensations, and fewer incidents where "partial success" was treated as if it were atomic success.


Learning Objectives

By the end of this lesson, you should be able to:

  1. Explain why distributed transactions appear once one logical operation spans multiple independent owners.
  2. Describe how 2PC and sagas differ in guarantees, failure handling, and operational cost.
  3. Reason about practical limits around blocking, coordinator failure, compensation, and external side effects.

Core Concepts Explained

Concept 1: Why Cross-Boundary Transactions Are Hard

Concrete example / mini-scenario: A booking operation must reserve a seat, charge a card, and write an order record. Each part lives in a different storage boundary.

Intuition: Local transactions are easy by comparison because one engine owns the write set. Distributed transactions are hard because no single participant can unilaterally guarantee the outcome for everyone else.

What makes this difficult:

The core question:
Can all participants either commit or abort together, and if not, what is the fallback story?

That question creates the main split:

Practical framing: Before choosing a protocol, ask whether the invariant is truly atomic or whether business compensation is acceptable.


Concept 2: 2PC Buys Atomicity by Holding Everyone Together

Concrete example / mini-scenario: A coordinator asks payment, inventory, and orders to prepare. Each participant says "I am ready to commit and have durably recorded that readiness." Only after every participant votes yes does the coordinator tell them to commit.

Intuition: 2PC tries to produce one atomic yes/no outcome across participants.

The two phases:

  1. Prepare / vote

    • Coordinator asks whether each participant can commit
    • Participants durably record their prepared state and vote yes/no
  2. Commit / abort

    • If all vote yes, coordinator instructs commit
    • Otherwise, coordinator instructs abort

What this buys you:

What it costs you:

The hard truth: 2PC is not magic distributed ACID. It is coordinated waiting with failure bookkeeping. It can be the right choice, but only when the invariant is important enough to justify that cost.


Concept 3: Sagas Accept Partial Commit and Repair Later

Concrete example / mini-scenario: An order workflow first creates the order, then reserves inventory, then charges payment, then books shipping. If shipping fails, the system may refund payment and release inventory rather than trying to pretend nothing ever happened.

Intuition: A saga decomposes one large transaction into a series of local commits, each with a possible compensating action.

How it works:

  1. Run local transaction A
  2. If A succeeds, run local transaction B
  3. If B succeeds, run C, and so on
  4. If step N fails, execute compensations for prior successful steps as needed

What this buys you:

What it does not buy you:

This is the critical question:
Can you meaningfully compensate the earlier steps?

If not, calling it a saga does not solve the problem. It just renames it.


Practical Limits and Decision Trade-offs

Concrete example / mini-scenario: A team wants strict correctness, low latency, high availability, and simple operations for a cross-shard workflow.

Intuition: Distributed transactions force you to choose which pain you are willing to own.

Main trade-offs:

  1. 2PC

    • Stronger atomicity
    • More blocking, lower availability under participant/coordinator trouble
  2. Sagas

    • Better availability and decomposition
    • Weaker atomicity and more business-level compensation complexity
  3. Try to avoid crossing the boundary

    • Often the best answer is design, not protocol
    • Co-locate data or reshape ownership so the transaction becomes local

Typical good practice:

Red flag:


Troubleshooting

Issue: The system is correct, but cross-service workflows become slow or unavailable during partial outages.

Why it happens: Atomic coordination is coupling the liveness of the whole workflow to the slowest or least available participant.

Clarification / Fix: Revisit whether the invariant truly requires atomic commit across all parties.

Issue: The saga "works," but user-visible state looks inconsistent for a while.

Why it happens: Sagas explicitly allow intermediate committed states before later compensation brings the system back toward a valid business outcome.

Clarification / Fix: Make that window explicit in product and operational design. Eventual compensation is not invisible atomicity.

Issue: Compensation fails or is incomplete.

Why it happens: Compensating actions are often harder than forward actions, especially when side effects are irreversible or externally visible.

Clarification / Fix: Treat compensation logic as first-class production code, not as a theoretical appendix.

Issue: A coordinator crash leaves participants in uncertain states.

Why it happens: Prepared participants may be waiting for the final global decision.

Clarification / Fix: This is a structural property of coordinated commit protocols. Plan coordinator durability and recovery accordingly.


Advanced Connections

Connection 1: Distributed Transactions <-> Sharding

The bridge: Sharding creates multiple authority domains. Distributed transactions appear the moment one business operation crosses those boundaries.

Why this matters: The previous lesson was not just about scale. It was also about creating the conditions that make coordination harder.

Connection 2: Distributed Transactions <-> Consistency Models

The bridge: Transaction protocols define how one operation spans multiple owners. Consistency models later define what clients are allowed to observe across replicas and time.

Why this matters: Both are really about legal histories under failure and concurrency, but at different scopes.


Resources

Suggested Resources


Key Insights

  1. Distributed transactions are about cross-owner invariants, not just network calls.
  2. 2PC and sagas solve different problems and should not be treated as interchangeable branding.
  3. The best distributed transaction is often the one you avoid by redesigning ownership so the critical invariant stays local.

PREVIOUS Sharding Strategies and Rebalancing in Production NEXT Consistency Models: Strong, Causal, and Eventual Guarantees

← Back to Consistency and Replication

← Back to Learning Hub