Day 019: Saga Pattern and Distributed Workflows

A saga keeps distributed work moving by making progress step by step and handling failure with explicit compensation instead of pretending one global rollback still exists.

Today's "Aha!" Moment

Imagine the order platform from the last two lessons. A customer places an order, inventory must be reserved, payment must be captured, and shipping must be scheduled. In a monolith with one database, you might dream of one transaction around everything. In a real distributed system, those steps usually belong to different services with different data stores and different failure modes.

That changes the question completely. You are no longer asking, "How do I make this one atomic transaction?" You are asking, "How do I move this workflow forward safely when each service can only commit its own local work, and any step may fail after earlier steps already succeeded?"

That is what a saga is for. It turns one big business action into a sequence of local commits plus compensating actions if later steps fail. The system does not rewind time. It makes the recovery path explicit. If payment succeeds but shipping cannot be created, the answer is not a magic distributed rollback. The answer is a new business action such as refunding payment and releasing reserved stock.

Signals that a saga is the right topic:

one business action crosses multiple service boundaries
each participant owns its own state and commits locally
partial success is possible and must be handled deliberately
business consistency matters, but one shared ACID transaction is no longer realistic

The common mistake is to think of sagas as "weaker transactions." That framing is misleading. A saga is not trying to imitate one database transaction imperfectly. It is solving a different problem: long-running coordination across ownership boundaries.

Why This Matters

Distributed products still have workflows that users experience as one thing: placing an order, booking a trip, starting a subscription, onboarding a tenant. But under the hood those workflows touch services with separate truth, separate latency, and separate failure behavior. If the design does not model that explicitly, partial failure leaks out as confusion, duplicates, phantom reservations, or manual repair work.

Sagas matter because they force teams to admit that business workflows are not only about the happy path. They are about progress, rollback-like recovery, idempotency, and visibility when the system has already done some work and cannot pretend it did nothing.

This also makes sagas one of the most operationally important patterns in event-driven systems. The happy path is easy to sketch. The real value is in making failure handling, compensation, and workflow state explicit enough that the system can recover without humans guessing what happened.

Learning Objectives

By the end of this session, you will be able to:

Explain why sagas exist - Describe why distributed workflows use local commits plus compensation instead of one global transaction.
Compare choreography and orchestration clearly - Explain how each control style coordinates steps and where each becomes painful.
Recognize the real design surface - Treat compensation, idempotency, and workflow visibility as first-class concerns rather than as afterthoughts.

Core Concepts Explained

Concept 1: A Saga Replaces Global Atomicity with Explicit Stepwise Progress

Use a simple order workflow:

PlaceOrder
  -> ReserveInventory
  -> CapturePayment
  -> CreateShipment

In a distributed system, each step usually commits to a different local store. Inventory can reserve stock. Payments can capture money. Shipping can create a label. None of them owns a global transaction across all three.

That is the starting point for sagas. The workflow is treated as a sequence of durable local actions, not as one all-or-nothing database operation.

ReserveInventory succeeded
CapturePayment succeeded
CreateShipment failed
=> compensate:
   RefundPayment
   ReleaseInventory

This is the crucial mental shift: compensation is not rollback. It is new business work performed after earlier work has already happened.

That is why sagas fit long-running workflows. They allow progress across service boundaries without demanding that every participant freeze its state inside a global lock-step commit protocol.

The trade-off is direct. You gain autonomy and availability across services. You pay by making partial completion visible and by needing semantic recovery steps instead of simple technical rollback.

Concept 2: Choreography and Orchestration Are Two Different Ways to Drive the Saga

Once you accept the saga model, the next question is who drives it.

In choreography, services react to events. Inventory emits InventoryReserved, payment reacts and emits PaymentCaptured, shipping reacts later, and so on. No single service owns the whole picture.

In orchestration, one workflow coordinator keeps the state machine explicit. It tells inventory to reserve, then payment to charge, then shipping to proceed, and it decides when compensation should start.

A compact comparison helps:

choreography:
event-driven reactions
more local autonomy
harder to see full flow

orchestration:
explicit workflow owner
clearer sequencing and visibility
more central coordination responsibility

Neither style is always superior. Choreography can feel elegant for simpler flows and keeps services loosely coupled at the control plane. Orchestration becomes attractive when the workflow is long, failure paths are complex, and operators need one place to inspect the saga's current state.

The practical mistake is to choose choreography because it looks "more distributed" even when the business process really needs a clear owner. A hidden workflow is not a simpler workflow. It is often just a harder one to debug.

The trade-off is that choreography preserves local autonomy but can blur the global process, while orchestration clarifies the process but introduces an explicit coordinator whose durability and reliability now matter.

Concept 3: Compensation, Idempotency, and Visibility Are the Real Hard Parts

The happy path of a saga is rarely the difficult part. The hard part starts when something fails after other steps have already succeeded.

For the order flow, that means designing questions like:

If payment capture is retried, how do we avoid charging twice?
If compensation is triggered twice, how do we avoid refunding twice?
If shipping times out, do we know whether it failed or merely responded late?
What should the customer and operator see while the saga is incomplete?

This is why production sagas need more than a sequence diagram. They need:

idempotent step handlers
durable workflow state
correlation IDs across steps
explicit status transitions
clear compensating actions

Without those, a saga degenerates into distributed guesswork. You know something went wrong, but you cannot tell what the system believes has already happened or which compensations are still safe to run.

The key lesson is that failure visibility is part of correctness here. A business workflow is only trustworthy if the system can explain which steps committed, which compensations ran, and what state the workflow is currently in.

The trade-off is that sagas let distributed workflows stay available and modular, but they demand more operational discipline than a local transaction. You are trading one kind of coordination pain for another, and that trade only works if compensation and observability are designed well.

Troubleshooting

Issue: "Compensation means the system becomes exactly as if nothing happened."
Why it happens / is confusing: People import the mental model of database rollback into a distributed business workflow.
Clarification / Fix: Compensation is a new business action, such as refunding or releasing inventory. It restores business consistency, but it does not erase history or guarantee that no side effects were ever visible.

Issue: "A saga diagram is enough; the rest is implementation detail."
Why it happens / is confusing: The happy path is easy to draw, so the operational work looks secondary.
Clarification / Fix: Retries, idempotency, status tracking, and visibility are the pattern. Without them, the saga is not really designed.

Issue: "Choreography is always better because it avoids a central coordinator."
Why it happens / is confusing: It sounds more decentralized and therefore more elegant.
Clarification / Fix: For complex workflows, choreography can hide ownership and make recovery harder to reason about. Choose the style that keeps the workflow understandable and operable.

Advanced Connections

Connection 1: Local Transactions <-> Distributed Business Consistency

The parallel: A local database transaction and a saga both protect business correctness, but they do so under very different ownership and failure assumptions.

Real-world case: A monolith may complete order placement inside one transaction, while a service-based platform coordinates order, payment, and shipping through a saga because ownership is split.

Connection 2: Workflow Engines <-> Orchestrated Sagas

The parallel: Workflow systems make long-running state, retries, and compensations explicit instead of scattering them across service handlers.

Real-world case: Tools such as Temporal or cloud workflow engines are often used because they give sagas durable execution history and operational visibility.

Resources

Optional Deepening Resources

[ARTICLE] Microservices.io - Saga Pattern
- Link: https://microservices.io/patterns/data/saga.html
- Focus: Read it for the canonical shape of the pattern and why sagas appear when data ownership is split.
[ARTICLE] Microsoft Azure Architecture - Saga Pattern
- Link: https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/saga/saga
- Focus: Use it for the practical comparison between choreography and orchestration.
[DOC] Temporal - Compensating Actions, Part of a Complete Breakfast with Sagas
- Link: https://temporal.io/blog/compensating-actions-part-of-a-complete-breakfast-with-sagas
- Focus: Read it for compensation-oriented implementation concerns and why happy-path thinking is not enough.

Key Insights

Sagas are about stepwise progress, not global rollback - They coordinate distributed workflows through local commits and compensating actions.
Control style changes operability - Choreography and orchestration differ mainly in workflow visibility, ownership, and debugging cost.
The real work is in failure handling - Idempotency, compensation, and durable workflow state matter more than the happy path sketch.

Knowledge Check (Test Questions)

Why do sagas appear in many distributed systems?
- A) Because they restore one shared ACID transaction across all services.
- B) Because they coordinate multi-step business workflows when each participant can commit only local state.
- C) Because they remove the need for compensation logic.
What is the main difference between choreography and orchestration in a saga?
- A) Choreography coordinates through event reactions, while orchestration uses an explicit workflow owner.
- B) Choreography guarantees stronger consistency than orchestration.
- C) Orchestration means events cannot exist anywhere in the system.
Why is compensation not the same thing as rollback?
- A) Because compensation is a new business action that handles already-committed work rather than erasing history.
- B) Because compensation only works inside one database.
- C) Because rollback is always available across services.

Answers

1. B: Sagas are useful when one business action crosses several ownership boundaries and each service can commit only its own local transaction.

2. A: Choreography distributes control through event reactions, while orchestration keeps the workflow explicit in one coordinator or engine.

3. A: Compensation preserves business consistency after partial progress, but it does not pretend that earlier actions never happened.

← Back to Learning