Foundations Review and Capstone Synthesis
LESSON
Foundations Review and Capstone Synthesis
Core Insight
Imagine you are asked to review the design for a regional checkout workflow. The product wants fast checkout, no duplicate charges, useful behavior during payment trouble, and enough evidence for support to explain what happened after an incident.
That request is too large for one magic tool. A queue can decouple work, but it can also hide overload. Replication can improve locality, but it can expose stale reads. Retries can recover temporary failure, but they can duplicate side effects. A dashboard can show symptoms, but only durable evidence can prove what happened to one order.
The capstone move is to turn the track into an architecture decision record. An architecture decision record, or ADR, is a short design note that names context, decision, consequences, and open risks. Here, the ADR should explain one distributed workflow through promises, boundaries, failure modes, mechanisms, and evidence.
The standard is not perfect architecture. The standard is reviewable architecture: another engineer should be able to see what the system promises, what it refuses to promise, and what evidence will show whether the design is working.
Capstone Scenario
Design a checkout workflow for a store that runs in two regions.
The product requirements are:
users can:
add items to a cart
place an order
pay at most once for one order
see whether the order is confirmed or pending
operators can:
pause risky payment attempts
repair pending orders
explain one user's checkout path after the fact
The system has these parts:
browser
-> API gateway
-> cart service
-> inventory owner
-> payment service
-> order store
-> confirmation queue
-> observability and incident tools
The design must handle normal operation, timeout ambiguity, retry pressure, payment-provider slowness, regional partition, schema changes, and recovery after a partial incident.
Do not try to solve the whole world. Write a narrow decision for this workflow.
Architecture Decision Shape
Use this ADR structure.
Title:
Checkout order placement under timeout, overload, and payment uncertainty
Context:
What the workflow must promise.
Which boundaries create uncertainty.
Which failures are expected.
Decision:
The mechanisms chosen.
The states exposed to users and operators.
The evidence retained for repair and review.
Consequences:
What gets safer.
What gets slower, more complex, or less available.
What remains out of scope.
A strong ADR names the promise first:
One customer intent creates at most one successful payment
and one clear order outcome.
Then it connects mechanisms to that promise:
- Use an idempotency key such as
pay:order_idfor payment attempts. - Let one order owner decide final order state.
- Show
payment_pendingwhen payment evidence and order evidence disagree. - Apply retry budgets and backoff so payment trouble does not become a retry storm.
- Keep cart reads available during payment-degraded mode.
- Record correlation ids, operation ids, durable events, and repair status.
Each mechanism should answer a failure, not decorate the design.
Walk The Workflow Twice
First, write the product story:
The user places order-42.
The system either confirms it, marks it pending, or refuses safely.
The user is not charged twice for the same intent.
Then write the evidence story:
gateway received checkout request checkout-abc
cart snapshot version 18 was used
inventory reserved sku-9
payment operation pay:order-42 timed out at the gateway
provider later accepted pay:order-42
order_confirmed event is missing
repair job marked order-42 payment_pending
The gap between those stories is the distributed design. The product story wants one clean outcome. The evidence story may be incomplete, late, or contradictory. The ADR must say what state the system enters when the evidence is not enough for a final answer.
Useful states might include:
order_intent_saved
inventory_reserved
payment_authorizing
payment_pending
order_confirmed
needs_reconciliation
failed_safely
The names matter because they prevent the system from pretending that unknown means failed or that partial success means confirmed.
Design Review Checklist
Before calling the capstone done, check these decisions.
Boundaries and ownership
- Which service owns cart state, inventory reservation, payment attempt, and final order state?
- Which facts are local, and which facts are inferred from messages?
Retries and side effects
- Which operations need stable identity?
- Which receiver stores idempotency records?
- What happens when a retry arrives after the first attempt succeeded?
Consistency and placement
- Which reads may be stale?
- Which reads must reflect the user's own write?
- Which region may accept order writes during partition?
Load and degraded modes
- What closes admission during provider trouble?
- Which optional work is shed first?
- What smaller promise remains useful to users?
Contracts and evolution
- Which messages cross queues or services?
- What old and new message versions may coexist during deploy?
- What happens to replayed messages?
Evidence and recovery
- Which ids join logs, traces, events, and records?
- What proves payment happened?
- What proves the order was confirmed?
- What evidence is required before leaving degraded mode?
The trade-off should be visible in each section. For example, waiting for stronger payment evidence may increase latency, but it prevents false confirmation. Serving cart reads from a nearby replica may improve responsiveness, but the product must state the freshness boundary.
Failure Modes To Name
A capstone answer should explicitly handle these failures:
- Gateway times out after payment provider accepts the charge.
- User retries the checkout from the browser.
- Payment provider becomes slow and causes retry pressure.
- Confirmation queue grows faster than workers can drain it.
- A regional partition prevents order owners from coordinating.
- A new event field is deployed while old consumers still exist.
- Support asks why one user's order is pending after the incident.
You do not need one mechanism per failure. A good mechanism can cover several. For example, stable operation identity helps timeout ambiguity and retries. Durable events help repair and support investigation. Admission control helps overload and degraded mode.
The weak answer is a list of technologies. The strong answer explains how each technology changes what the system can safely promise.
Capstone Deliverable
Write a short ADR with these headings:
# ADR: Checkout Order Placement Under Partial Failure
## Context
## Decision
## Consequences
## Evidence To Retain
## Open Risks
Keep it concise. The goal is not to specify every endpoint. The goal is to show that you can reason from promise to mechanism to trade-off.
Your ADR is ready when a reviewer can answer:
- What promise is protected?
- What uncertainty remains?
- What state is exposed while evidence is incomplete?
- What work is rejected, retried, repaired, or degraded?
- What evidence proves the workflow is safe after an incident?
Worked ADR: Checkout Order Placement Under Partial Failure
The following is a deliberately narrow answer to the scenario. It is not a universal checkout architecture. Its value is that every mechanism has a named promise, boundary, and cost.
Context
The store accepts orders in two regions. Users expect one purchase intent to create at most one charge and one final order outcome. Payment is external, so the gateway can time out before it knows whether the provider accepted the operation. Order confirmation may continue asynchronously. Carts can tolerate a short stale read; payment and final order state cannot silently fork.
user intent: order-42
payment operation: pay:order-42
official order owner: order shard for order-42
durable evidence: order record, payment record, outbox events, repair record
The important boundary is between a caller's timeout and the provider's outcome. A timeout proves that one caller stopped waiting. It does not prove payment failed. The design needs an honest intermediate state instead of forcing that uncertainty into “success” or “failure.”
Decision
The gateway creates order-42 and the stable payment operation pay:order-42 before calling the payment service. The payment receiver stores outcomes by that operation id, so a retry from the browser can ask about the same intent rather than create another charge.
browser -> gateway: place order, idempotency key = order-42
gateway -> order owner: save order_intent
gateway -> payment: authorize pay:order-42
payment result known:
write durable authorization result
publish payment_authorized through an outbox
payment result unknown at deadline:
write payment_pending
expose a status lookup; do not issue uncontrolled retries
Only the order owner may make the final transition to order_confirmed or failed_safely. It consumes payment evidence and inventory evidence through durable events. A confirmation worker may retry delivery, but its state transition is idempotent: seeing the same payment_authorized event twice does not create two orders.
order state machine:
order_intent_saved
-> inventory_reserved
-> payment_authorizing
-> payment_pending (gateway lacks enough evidence)
-> order_confirmed (one durable payment and inventory decision)
-> failed_safely (no payment, or a reconciled compensation)
-> needs_reconciliation (evidence conflicts or is incomplete)
For normal reads, carts and product details may come from nearby replicas with an explicit freshness boundary. After a customer places an order, the status endpoint carries the order version or routes to the owner so the customer's session does not move backward from payment_pending to an earlier “no order” view. During a regional partition, a region without the order owner's coordination path does not independently confirm final orders. It can preserve cart state and order intent, then show a pending or unavailable outcome.
Walk The Normal Path
In the ordinary case, the user submits once. The gateway records intent, reserves inventory, sends one payment operation, and receives an authorization. The payment service records that authorization durably before publishing an event. The order owner records the final decision and emits order_confirmed through an outbox for email and analytics.
1. gateway: order-42 intent saved
2. inventory owner: sku-9 reserved for order-42
3. payment: pay:order-42 authorized
4. order owner: order-42 confirmed
5. outbox: order_confirmed delivered to subscribers
The response can say “confirmed” only after step 4. Email delivery may lag because it is not the authority for order state. If an email queue is delayed, the order remains confirmed and the operator has a separate, repairable notification problem.
Walk The Ambiguous Timeout Path
Now the provider accepts pay:order-42 after the gateway's deadline. The gateway records payment_pending and tells the user that the result is being confirmed. The user retries, but the retry carries pay:order-42; the payment service returns the existing result rather than contacting the provider again.
gateway timeout -> payment_pending
provider later records authorization -> payment_authorized event
order worker sees event -> order_confirmed
status endpoint returns confirmed or pending, never two charges
If inventory cannot be honored by the time payment evidence arrives, the repair worker records needs_reconciliation, releases what it safely can, and voids or refunds the authorization according to provider rules. The system has not made the user happy in every path, but it has avoided lying about the outcome and retained the evidence needed to repair it.
Overload And Degraded Operation
The payment provider may become slow for many orders. The system watches provider latency, timeout rate, retry-budget use, queue age, and authorizations lacking final order decisions. When the evidence crosses a threshold, the payment-degraded playbook activates.
preserve:
cart reads, order intent, order-status lookup, reconciliation
degrade:
recommendations, nonessential email, immediate payment confirmation
admission control:
bound new payment attempts and return Retry-After or payment_unavailable
The confirmation queue has a capacity and age budget. Workers have a concurrency limit so they do not overload the order store while it is slow. Retries use backoff, jitter, deadlines, and a shared budget. A message that is too old to be useful is discarded only if it has no unresolved side effect; a payment authorization instead enters reconciliation. This is the cost of preserving trustworthy state transitions under pressure.
Contracts And Operational Evidence
The events payment_authorized, order_confirmed, and payment_reconciled have versioned schemas. Readers are deployed before writers depend on a new field, and consumers can process retained old messages for the documented replay window. Fields such as amount, currency, and provider reference have explicit meanings; a type-compatible change that changes meaning is treated as a new contract.
Every boundary retains joinable identity:
request_id: one gateway attempt
trace_id: one observed execution path
order_id: durable business object
payment_operation: one external side effect
message_id: one queue delivery
Logs explain local outcomes, traces show cross-service waiting, metrics expose systemic pressure, and durable records establish payment, order, and repair state. Support can answer “why is order-42 pending?” without reconstructing the incident from memory.
Consequences And Open Risks
The design improves duplicate-charge protection, gives users an honest pending state, and keeps a repair path when evidence arrives late. It costs additional state transitions, outbox processing, idempotency storage, reconciliation logic, and some latency before confirmation. During partitions or provider trouble, some checkouts are intentionally unavailable or pending rather than falsely confirmed.
Open risks remain: inventory reservation expiry must match payment reconciliation timing; provider idempotency guarantees must be verified; a regional outage may delay final order confirmation; and schema retirement requires replay evidence. These are not reasons to avoid the design. They are the risks that a reviewer can now test, monitor, and assign to an owner.
Decision Trace During A Regional Partition
Assume the region hosting the order owner becomes unreachable from the gateway's region just after an order intent is saved. The local gateway might still have a cart replica and a reachable payment provider. Confirming the order locally would create a second authority for the final order state. The design refuses that shortcut.
gateway can prove:
the customer submitted an intent
its local region cannot reach the order owner
gateway cannot prove:
whether another region already finalized order-42
whether inventory and payment can still be combined safely
safe response:
preserve idempotent intent when allowed
return pending or unavailable
do not confirm final order state
When communication returns, the order owner reads the durable intent and the payment operation record, then performs the same idempotent transition or reconciliation path as any delayed workflow. This costs availability for final confirmation during the partition, but it protects the promise that one order has one official outcome.
The recovery review checks more than network health. It samples order and payment agreement, checks the age of pending intents, verifies that old-epoch writers cannot make new final decisions, and ramps admission gradually. A green latency graph without reconciled records is insufficient evidence to restore normal promises.
The status endpoint exposes the reconciliation state and a stable order reference, so the customer does not create a new intent merely because one browser request ended. Each manual repair records the operator, evidence, decision, and provider reference. That audit trail makes later refunds, support conversations, and incident review part of the same architecture rather than a separate spreadsheet process.
It also reveals recurring ambiguity patterns that should become product and reliability improvements, not permanent manual operational labor.
Readiness Check
Close the lesson and sketch your own ADR for a different workflow: password reset, file upload, seat booking, message send, or account deletion. Include one normal path, one ambiguous timeout, one overload response, and one recovery query. If a reviewer cannot identify the owner of the final fact, the operation id that survives retries, the honest intermediate state, and the evidence for exit from degraded mode, revise the design before choosing more technology.
Resources
- [ARTICLE] Architecture Decision Records
- Focus: A lightweight format for making design decisions reviewable.
- [BOOK] Designing Data-Intensive Applications
- Focus: Use the replication, consistency, partitioning, and reliability chapters as capstone references.
- [BOOK] Site Reliability Engineering: Emergency Response
- Focus: Connect degraded modes, recovery criteria, and incident evidence to operational design.
Key Takeaways
- The capstone is an architecture decision, not a vocabulary inventory.
- Start from the user-visible promise, then name boundaries, evidence, mechanisms, and trade-offs.
- A good design has honest intermediate states for unknown, pending, degraded, and repaired outcomes.
- Reviewable distributed architecture explains what it will do when evidence is late, partial, or contradictory.