Day 491: Idempotency and Retry-Safe APIs

The core idea: A retry-safe API gives one business operation one durable identity, so a client that cannot tell whether the first attempt succeeded can retry without creating a second effect.

Today's "Aha!" Moment

In 059.md, Harbor Point learned how to place commits in a defensible global order. That solved the question "what happened first?" but it did not solve the question users actually trigger under bad networks: "did my booking go through, or should I try again?" A guest taps POST /bookings for cabin S12, the service reserves the cabin and authorizes the deposit, and then the phone drops off the network before the 201 Created response arrives. The guest retries from the hotel Wi-Fi thirty seconds later.

The non-obvious point is that idempotency is not just "drop duplicate packets." The server has to define what counts as the same intent. Harbor Point wants "book this cabin for this sailing for this guest" to execute once even if the request is delivered twice, but it does not want to block the same guest from intentionally booking a different excursion five minutes later. That means the API needs an operation identity that survives retries and a durable record that ties that identity to the original outcome.

Once you see that, retries stop looking like a networking detail and start looking like part of the write contract. The booking schema, the payment call, the outbox event, and the API documentation all need to agree on the same operation boundary. The trade-off is straightforward: Harbor Point adds state, retention rules, and conflict handling to the write path so clients can retry aggressively without risking double charges or duplicate reservations.

Why This Matters

Harbor Point's booking traffic is bursty and messy. Mobile clients switch between ship Wi-Fi and cellular links, the edge layer retries some upstream failures automatically, and customer-service agents sometimes resubmit a request after a timeout because the admin panel shows no confirmation. If the API treats every arrival as a fresh command, the same human action can fan out into two bookings, two payment authorizations, and two confirmation emails.

Those are not cosmetic bugs. Duplicate side effects create inventory corruption, payment disputes, and support work that is hard to unwind because every subsystem can tell a different story. The booking database may show one final booking after a compensating cleanup, while the payment provider still holds two authorizations and the email system has already sent two receipts. A production team cannot paper over that with "just don't retry on POST."

An idempotency contract changes the failure mode. Instead of forcing the client to guess whether the first attempt committed, the server says: if you retry the same operation with the same key, you will either get the original answer back or a clear signal that the first attempt is still in flight. That promise costs extra writes and operational care, but it is usually cheaper than teaching every client and every operator to reason correctly about ambiguous outcomes.

Core Walkthrough

Part 1: Grounded Situation

Keep one Harbor Point flow in view. The mobile app sends:

POST /bookings
Idempotency-Key: hp-booking-8841-s12-2026-07-14

The request reaches booking-api, which asks booking-authority to reserve cabin S12, records a pending booking row, and calls the payment provider to authorize a deposit. All of that succeeds, but the API pod crashes before the response gets back to the guest. A retry lands on a different pod with the same header and the same JSON body.

If Harbor Point generates a fresh booking ID on every attempt and has no durable idempotency record, the second pod has no evidence that the first request already crossed the commit point. It may reserve a second cabin, create a second payment authorization, or fail with a vague "duplicate" message after the business effect has already happened. None of those outcomes are retry-safe because the API boundary does not remember the first decision.

A unique constraint on one table is not enough. Even if bookings(cabin_id, sailing_id) prevents a second booking row, the payment authorization and confirmation email may already have happened twice. Retry safety has to cover the whole operation the client thinks it is performing, not just one insert.

Part 2: Mechanism

Harbor Point needs an idempotency ledger keyed by operation identity, not by transport delivery. A practical record looks like this:

(tenant, endpoint, idempotency_key) -> {
  request_hash,
  status,          # in_progress | completed | failed_retriable
  resource_id,
  response_code,
  response_body,
  expires_at
}

The request_hash matters because clients sometimes accidentally reuse keys. If the same key comes back with a different booking body, the correct answer is not "close enough." It is a client error, because the server can no longer tell which operation the key represents.

The write path usually follows four steps:

Normalize the request into the semantic payload Harbor Point cares about and hash it.
Insert the idempotency record with status = in_progress under a unique constraint.
Perform the business work and store the canonical outcome.
Mark the record completed in the same durable boundary that commits the business result.

That gives the second pod somewhere authoritative to look. In pseudocode:

record = claim_key(scope, key, request_hash)

if record.state == "hash_mismatch":
    return error(422, "Idempotency key reused with different payload")
if record.state == "completed":
    return record.saved_response
if record.state == "in_progress" and not record.claimed_by_this_attempt:
    return error(409, "Original request still running", retry_after=2)

with transaction():
    booking = reserve_cabin(...)
    payment = authorize_deposit(provider_key=key)
    enqueue_outbox(event_id=key, topic="booking-confirmed", booking_id=booking.id)
    save_completed_response(key, booking, payment)

return created(booking)

The important detail is not the syntax. It is the atomicity boundary. If Harbor Point marks the idempotency key as completed only after the booking transaction commits, then a retry after a crash can replay the saved 201 response instead of running the business logic again. If it stores the key after the side effects, the crash window stays open and duplicates slip through.

External side effects need the same discipline. The payment provider should receive the same operation reference so two API attempts collapse into one authorization on the provider side as well. The confirmation email should come from an outbox event keyed by the booking operation, not from "send email now" logic inside the HTTP handler. Otherwise Harbor Point makes the HTTP layer idempotent while downstream systems still duplicate the effect.

Part 3: Implications and Trade-offs

This design buys Harbor Point a much better client contract. Mobile apps can retry after timeouts instead of forcing the guest to call support. Edge proxies can use conservative retry policies without gambling on duplicate charges. Operators can search by idempotency key during incidents and see whether an operation is still running, finished successfully, or was rejected because the key was misused.

The costs are concrete. The service now owns an extra table or key-value store, response retention policy, and cleanup process. The retention window has to outlive realistic client behavior: if Harbor Point expires keys after ten minutes but shipboard connectivity drops for an hour, a perfectly valid retry can become a second booking. The scope also matters. A key that is too narrow allows duplicates; a key that is too broad blocks legitimate repeat actions. "One key per HTTP request" is often wrong. The real unit is "one key per business intent."

There is also a subtle response-design choice. Some teams store and replay the exact original HTTP body; others store the resource identifier and regenerate the response from current state. Replaying the exact body is simple and makes duplicates invisible to clients, but it couples storage to response shape. Regenerating from a resource pointer tolerates schema evolution, but only if the current representation still matches what a retry should mean. Harbor Point has to choose which trade-off fits its API stability expectations.

Finally, idempotency is not the same as exactly-once execution. Harbor Point is still running on at-least-once delivery paths, crash-prone servers, and external systems with their own semantics. Idempotency makes duplicate requests converge on one logical result. It does not prove the operation only executed once in every subsystem. That distinction is what the next lesson, 061.md, makes explicit.

Failure Modes and Misconceptions

"POST cannot be retry-safe because POST is not idempotent in HTTP." HTTP method semantics and application semantics are different layers. Harbor Point is free to make POST /bookings retry-safe by introducing an explicit idempotency contract for that endpoint.
"A unique constraint on the booking table solves the problem." It only solves one part of the operation. Payments, emails, inventory side effects, and outbox events can still duplicate unless they share the same operation identity.
"If the key already exists, just return 200." A duplicate request needs the original outcome, not a generic success code. Clients often need the real booking ID, payment status, and error body that the first attempt produced.
"Short key retention keeps storage cheap, so it is always better." Aggressive expiry turns legitimate retries into new commands. Retention should be tied to client retry windows, offline behavior, and support workflows, not just storage convenience.
"Idempotency means the handler may safely run twice." The goal is stricter than that. The client should observe one logical booking result even if multiple deliveries happen. Achieving that usually requires concurrency control, downstream dedupe, and durable response replay.

Connections

Connection 1: 059.md explains why retries are ambiguous in the first place

The previous lesson showed how Harbor Point decides whether a commit belongs before or after another event. Idempotency uses that ordered history at the API edge: once the first booking attempt has committed, later retries should map back to that same committed result instead of creating a new one.

Connection 2: 061.md separates retry safety from exactly-once claims

This lesson gives Harbor Point a practical contract for duplicate requests. The next lesson tightens the language around delivery guarantees and shows why "I added idempotency keys" is still not the same as "the system processes each event exactly once."

Connection 3: ../event-driven-and-streaming/003.md applies the same idea to multi-step workflows

A saga step that compensates or retries also needs a stable operation identity. The same design move appears there: durable state plus replay-safe side effects beats hoping distributed retries never collide.

Resources

[DOC] HTTP Semantics, RFC 9110
- Link: https://www.rfc-editor.org/rfc/rfc9110#section-9.2.2
- Focus: Distinguish HTTP's built-in idempotent methods from the application-level contract Harbor Point has to add for POST /bookings.
[ARTICLE] Making retries safe with idempotent APIs
- Link: https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/
- Focus: Read how client request identifiers, late arrivals, and semantic equivalence shape a production API design.
[DOC] Stripe API: Idempotent requests
- Link: https://docs.stripe.com/api/idempotent_requests
- Focus: Notice how a public API stores request parameters, replays prior results, and rejects mismatched reuse of the same key.
[BOOK] Designing Data-Intensive Applications
- Link: https://dataintensive.net/
- Focus: Use the chapters on unreliable networks and exactly-once semantics to place idempotency inside the larger failure model of distributed systems.

Key Takeaways

Retry safety starts with a stable identity for one business intent, not with a best-effort comparison of two HTTP payloads.
The idempotency record has to be durable and tied to the committed outcome, otherwise crashes leave the duplicate-execution window open.
A retry-safe API is only as strong as its downstream side effects, so payment calls and outbox consumers need the same operation identity.
Idempotency lets clients recover from ambiguous outcomes safely, but it does not by itself deliver exactly-once execution across the whole system.

← Back to Consistency and Replication

← Back to Learning Hub