Foundations Review Check

LESSON

Distributed Systems Foundations

015 35 min beginner REVIEW

Foundations Review Check

Core Insight

Imagine a support ticket that says: "Checkout timed out, I clicked again, and now the order appears twice in one dashboard but only once in another." That one sentence touches almost the whole foundation layer.

There is a timeout, but the timeout does not prove the first attempt failed. There is a retry, but the retry may be unsafe without stable operation identity. There are two dashboards, which means replicas, read paths, or contracts may disagree. There is a user-visible promise: do not charge or order twice when the user meant to buy once.

This review check is not a vocabulary recital. It asks whether the concepts have become design judgment. A strong answer separates local facts from remote inference, names the promise at risk, chooses a mechanism that protects it, and explains the trade-off.

Use the check as a rehearsal for the capstone. The capstone will ask you to write a small architecture decision. This review asks whether you can recognize the pressure points before you choose the architecture.

How To Answer Review Scenarios

For each scenario, write four short lines before choosing a mechanism:

promise:
local evidence:
missing evidence:
trade-off:

Then name the mechanism that fits the pressure. If the problem is duplicate side effects, idempotency and operation identity may matter. If the problem is one official decision, consensus or ownership may matter. If the problem is stale reads, a consistency guarantee may matter. If the problem is overload, backpressure and admission control may matter.

This order prevents a common mistake: choosing a familiar tool before the design has named what must remain true.

Worked Review: Duplicate Checkout

Start with this incident:

user intent:
  buy order-42 once

web:
  sent payment request pay:order-42
  timed out after 2 seconds
  user clicked again

payment:
  provider accepted the first attempt

orders:
  one dashboard sees pending
  another dashboard sees confirmed

The promise is: one customer intent should create at most one successful charge and one clear order outcome.

The local evidence is split. The web service knows it timed out. The payment service knows a provider accepted an attempt. The order service has not produced one consistent view. None of those facts alone is the whole truth.

The missing evidence is the joined workflow state. Did both clicks carry the same idempotency key? Which service owns the official order state? Did the confirmation event publish? Which dashboard read from which replica? Is repair running?

A strong design answer could say:

Use one payment operation id for order-42.
Make payment idempotent at the receiver.
Expose payment_pending while order and payment evidence disagree.
Reconcile payment and order records before confirming or refunding.
Instrument the path with order_id, payment_operation, trace id, and durable events.

That answer does not pretend the timeout proves failure. It protects the promise while the system gathers enough evidence to repair.

Coverage Map

Use this map to check whether you can transfer each concept into a design decision.

The goal is not to use every concept in every answer. The goal is to notice which concept is the pressure point.

Readiness Signals

You are ready for the capstone when you can read a distributed workflow and answer these questions without reaching for a slogan:

If an answer says "just retry," "just add replicas," "just use a queue," or "just make a dashboard," slow down. Those are mechanisms. The review standard is to connect each mechanism to the promise and trade-off it protects.

Timed Design Drill: The Same Incident, Three Questions

Return to the duplicate-checkout report. Before proposing a fix, separate three questions that are often collapsed.

What happened to the first payment operation?
What outcome may the user safely see now?
What work is safe while the system is uncertain?

The timeout answers none of these by itself. It proves only that one caller stopped waiting. A durable provider record might later show that pay:order-42 authorized. A different dashboard might be stale because it read from a lagging replica. A queue may contain a confirmation job that is delayed rather than missing. These are different kinds of uncertainty and need different evidence.

A disciplined response can therefore be small and specific:

promise:
  one intent creates at most one charge and one final order outcome

evidence now:
  gateway timed out; provider result is unknown; order is not confirmed

safe user state:
  payment_pending, with a status lookup rather than another charge attempt

mechanism:
  stable payment operation id, idempotent receiver, durable reconciliation record

trade-off:
  slower confirmation and a visible pending state in exchange for avoiding duplicates

Now add load. If provider latency rises for many users, the same design needs admission control and a retry budget. Otherwise each sensible individual retry becomes an unsafe collective storm. The degraded mode may keep carts readable and persist order intent while pausing new payment attempts. This is not a separate concern from correctness: it preserves the capacity needed for reconciliation and prevents new evidence from becoming ambiguous faster than the system can process it.

Finally add a deployment. If the repair worker receives a newer event shape than an old dead-letter message, it must still understand both or route the old record through a documented repair path. Recovery often reads historical data, so contract compatibility is part of incident readiness.

The transfer test is simple: for any scenario, name the protected promise, the evidence already held, the evidence still missing, and the next action that remains safe under that uncertainty. That sequence is more durable than memorizing individual distributed-systems slogans.

Self-Check Before The Capstone

For each mechanism you propose, ask one last pair of questions: what exact failure does it prevent, and what new cost or limit does it introduce? An idempotency key prevents duplicate processing of one named operation, but it does not decide which replica may own a final order state. A queue decouples work, but it needs deadlines and backpressure. A replica improves locality, but a read policy must say whether stale data is acceptable. A playbook makes degraded behavior deliberate, but it depends on controls and evidence that exist before the incident.

If you can name both the protection and the limit, you are ready to turn the review answers into the capstone's architecture decision.

Practice Prompt

Pick one of these scenarios and write a five-line design review:

scenario A:
  A password change says saved, but another device accepts the old password.

scenario B:
  A payment request times out, then a retry creates two provider records.

scenario C:
  A queue grows for thirty minutes while workers keep retrying a slow dependency.

scenario D:
  A new producer emits an event that an old consumer rejects during rollout.

Use this shape:

promise:
evidence:
uncertainty:
mechanism:
trade-off:

The capstone will expand the same shape into an architecture decision record.

Resources

Key Takeaways

PREVIOUS Degraded Modes, Playbooks, and Incident Evidence NEXT Foundations Review and Capstone Synthesis