Fault Tolerance, Retries, and Idempotency

LESSON

Distributed Systems Foundations

004 25 min beginner

Fault Tolerance, Retries, and Idempotency

Core Insight

Imagine buying a book online. You press Pay once. The page waits, then says the payment could not be confirmed. Behind the page, the payment service may already have sent the charge to the card provider. The provider may have accepted it. The response may have arrived late, been dropped, or reached a caller that had already stopped waiting.

The store now has a dangerous choice. If it blindly sends another charge, the customer may pay twice. If it refuses to do anything else, the order may stay stuck even though the customer still wanted the book. The system must keep a product promise while evidence is incomplete: one customer intent should produce at most one successful charge, and the order should not disappear just because a response was late.

Fault tolerance means the workflow keeps its important promise when ordinary faults interrupt the happy path. The faults here are not exotic: timeout, process restart, duplicate message, lost response, slow queue, provider callback arriving late, or worker crash after writing one record.

Retries are useful because many faults are temporary. Idempotency is what makes retries safe for side-effecting work. An operation is idempotent when repeating the same logical request does not repeat the side effect. For a payment, retrying pay:order-42 should return or continue the same payment attempt, not create a fresh charge every time the caller gets nervous.

The Failure Window

Use a checkout workflow with one side effect: charging a card.

customer intent:
  buy order-42 once

gateway:
  receives Pay click
  calls payment service
  waits for response

payment service:
  records payment attempt
  calls external card provider
  stores provider result

The risky window is the gap between "the side effect may have happened" and "the caller received a reliable answer."

gateway -> payment: authorize order-42
payment -> provider: charge card
provider -> payment: accepted
payment -> database: store charged

gateway timeout fires before response arrives

The gateway's local fact is real: it did not receive a timely answer. The payment service's local fact may also be real: it charged the card and stored the result. A fault-tolerant design must protect both realities without inventing a global truth too early.

A naive retry looks like this:

attempt 1:
  gateway -> payment: charge order-42
  gateway <- timeout

attempt 2:
  gateway -> payment: charge order-42

result:
  two independent charge attempts may exist

The retry improved liveness, because the system tried again. It damaged safety, because the repeated message may create a duplicate side effect. Fault tolerance is not just "try harder." It is "try again without breaking the promise."

Idempotency As Receiver Memory

The central mechanism is receiver-side memory tied to stable operation identity.

The caller sends an idempotency key that names the logical operation:

idempotency_key = pay:order-42
amount = 29.00
currency = EUR
customer = customer-17

The receiver stores the key before or as part of starting the side effect. A simplified table might look like this:

payment_attempts

key           request_hash  status       provider_id  response
pay:order-42  h7a9          running      null         null

When the same key arrives again, the receiver does not simply run the charge again. It checks the record.

if key is new:
  create payment_attempt record
  start provider charge

if key exists and request_hash matches:
  if status is running: return pending
  if status is succeeded: return stored success response
  if status is failed_final: return stored failure response

if key exists and request_hash differs:
  reject as idempotency conflict

The request hash matters. If a client reuses pay:order-42 but changes the amount from 29.00 to 290.00, the receiver should not treat that as the same operation. Stable identity protects repeated attempts of the same intent. It should not hide a changed request.

This is why a trace id is not an idempotency key. A trace id is useful for following one attempt through logs. It may change on every retry. An idempotency key must stay stable across retries of the same logical operation and must be checked by the side-effect owner before the side effect is applied again.

Idempotency is therefore not a magic HTTP header. It is a contract:

Without durable receiver memory, the key is just decoration.

There is one more detail that often gets missed: the first write of the idempotency record must be protected from races. If two retry attempts for pay:order-42 arrive at nearly the same time, both should not be able to look for the key, find nothing, and start two provider charges. The payment service needs an atomic create-or-read step, usually enforced by a unique constraint, transaction, compare-and-set operation, or other owner-side guard.

The shape is:

try to create key=pay:order-42
if create succeeds:
  this request owns the first execution
if create fails because key already exists:
  read the existing attempt and follow its status

That small guard is where the promise becomes real. Idempotency is not only about choosing a nice key; it is about making the side-effect owner refuse to start two independent executions for the same key, even under concurrent retries.

A Worked Retry Path

Now replay the book purchase with an idempotency-aware payment service.

attempt 1 at 0 ms:
  gateway -> payment: authorize key=pay:order-42
  payment stores key with status=running
  payment calls provider
  gateway timeout at 800 ms

provider response at 1100 ms:
  provider -> payment: accepted provider_id=ch_9
  payment stores status=succeeded, response=charged

attempt 2 at 1300 ms:
  gateway -> payment: authorize key=pay:order-42
  payment finds existing succeeded key
  payment returns stored charged response

The second attempt did not perform a second charge. It converted a missing response into a recovered response.

There is also an important in-progress case:

attempt 2 arrives while provider call is still running:
  payment finds key=pay:order-42 with status=running
  payment returns payment_pending

The caller can then show a pending state or schedule a later check. It should not keep creating new attempts. It should not tell the user that payment failed unless the payment owner has durable evidence that the attempt failed safely.

This worked path shows the difference between fault tolerance and blind repetition. A fault-tolerant workflow keeps an operation alive across uncertain boundaries, but keeps the side effect tied to one identity.

Retry Policy: Which Faults Deserve Another Attempt?

Retries recover from temporary faults, but they are not free. A retry adds load to a dependency that may already be slow. If many callers retry at the same time, retries can turn a small slowdown into a wider outage.

A useful retry policy answers five questions.

First: what identity makes a retry safe? For payment, the answer is the idempotency key owned by the payment service.

Second: which failures are retryable? A timeout, connection reset, 503 Service Unavailable, or 429 Too Many Requests may be retryable with care. A validation error, bad currency, expired card, or idempotency conflict usually should not be retried unchanged.

Third: how many attempts fit inside the deadline? The caller should not retry forever inside one user request.

Fourth: how are attempts spaced? Backoff waits longer between attempts. Jitter adds randomness so many callers do not retry in lockstep.

Fifth: what state appears when live retries stop? If the outcome remains unknown, the workflow needs a named state such as payment_pending or needs_reconciliation.

A compact policy might be:

payment live deadline: 3000 ms

attempt 1 at 0 ms
attempt 2 after 200 ms plus jitter
attempt 3 after 600 ms plus jitter
then stop live retries

if outcome still unknown:
  mark order payment_pending
  schedule reconciliation
  show user an honest pending state

The trade-off is explicit. The system tries to recover quickly from temporary faults, but it protects the dependency and the user promise by stopping live retries before they become uncontrolled load.

Repair After The Live Request

Some failures outlast the browser request. The user closes the tab. The provider callback arrives ten minutes later. A worker restarts after charging but before publishing an OrderPaid event. A queue delivers a message twice.

That is why fault-tolerant workflows need durable states. A boolean such as paid=true or paid=false is often too small. The workflow may need states that separate evidence from the final product promise:

new
  -> authorizing
  -> charged
  -> order_confirmed

authorizing
  -> payment_pending
  -> failed_safely

payment_pending
  -> charged
  -> failed_safely
  -> needs_human_review

payment_pending is not a failure to finish the design. It is the design admitting that evidence may arrive late. A reconciliation worker can later ask the payment service, read provider receipts, compare order records, and move the order to the next safe state.

This repair loop should also be idempotent. If the reconciliation worker runs twice for pay:order-42, the second run should observe the same final state or do harmless work. Otherwise the repair path can create the same duplicate side effects that the live path avoided.

The practical rule is: every path that can repeat should either be read-only, idempotent, or guarded by a durable owner that recognizes repeated work.

Failure Modes And Limits

The first failure mode is duplicate side effects. Retrying without stable identity can charge a card twice, reserve two seats, send duplicate notifications, or create multiple shipments.

The second failure mode is a retry storm. If callers time out and immediately retry, they multiply traffic against a dependency that may already be saturated. Backoff, jitter, deadlines, and circuit breakers reduce this risk, but the safest retry is still one that is bounded and tied to product meaning.

The third failure mode is weak receiver memory. If the payment service stores idempotency keys only in process memory, a restart can forget an in-flight attempt and allow a duplicate. The receiver's memory must survive the failures the system claims to tolerate.

The fourth failure mode is expiring keys too early. Keeping idempotency records forever can be expensive, but expiring them before late retries, provider callbacks, or human repair are impossible can reopen the duplicate window.

Idempotency also has limits. It protects repeated execution of one logical operation. It does not decide global ordering across many operations. It does not elect a leader. It does not guarantee that all replicas agree. Those problems lead into coordination and consensus in the next lesson.

Operational Signals

Fault-tolerant retry design needs evidence in production. Useful signals include:

These signals tell you whether retries are healing temporary faults or hiding a deeper problem. If payment_pending grows for hours, repair is not keeping up. If idempotency conflicts spike, clients may be reusing keys incorrectly. If duplicate key hits are common during provider latency, idempotency may be protecting users from duplicate charges exactly as intended.

Practice Prompt

Pick one workflow with a visible side effect: charging a card, sending a notification, creating an account, reserving a seat, or submitting a form. Write:

user-visible promise:
side effect owner:
idempotency key:
fields included in the request hash:
retryable failures:
non-retryable failures:
live retry budget:
state after live retries stop:
repair evidence:
signal that would show retries are causing harm:

If the idempotency key changes on every attempt, it is not protecting the logical operation. If the receiver does not store the key durably, the system is only pretending to be idempotent.

Resources

Key Takeaways

PREVIOUS Network Boundaries, Latency, and Partial Failure NEXT Consensus, Quorums, and Coordination