Foundations Review Check
LESSON
Foundations Review Check
Core Insight
Imagine a support ticket that says: "Checkout timed out, I clicked again, and now the order appears twice in one dashboard but only once in another." That one sentence touches almost the whole foundation layer.
There is a timeout, but the timeout does not prove the first attempt failed. There is a retry, but the retry may be unsafe without stable operation identity. There are two dashboards, which means replicas, read paths, or contracts may disagree. There is a user-visible promise: do not charge or order twice when the user meant to buy once.
This review check is not a vocabulary recital. It asks whether the concepts have become design judgment. A strong answer separates local facts from remote inference, names the promise at risk, chooses a mechanism that protects it, and explains the trade-off.
Use the check as a rehearsal for the capstone. The capstone will ask you to write a small architecture decision. This review asks whether you can recognize the pressure points before you choose the architecture.
How To Answer Review Scenarios
For each scenario, write four short lines before choosing a mechanism:
promise:
local evidence:
missing evidence:
trade-off:
Then name the mechanism that fits the pressure. If the problem is duplicate side effects, idempotency and operation identity may matter. If the problem is one official decision, consensus or ownership may matter. If the problem is stale reads, a consistency guarantee may matter. If the problem is overload, backpressure and admission control may matter.
This order prevents a common mistake: choosing a familiar tool before the design has named what must remain true.
Worked Review: Duplicate Checkout
Start with this incident:
user intent:
buy order-42 once
web:
sent payment request pay:order-42
timed out after 2 seconds
user clicked again
payment:
provider accepted the first attempt
orders:
one dashboard sees pending
another dashboard sees confirmed
The promise is: one customer intent should create at most one successful charge and one clear order outcome.
The local evidence is split. The web service knows it timed out. The payment service knows a provider accepted an attempt. The order service has not produced one consistent view. None of those facts alone is the whole truth.
The missing evidence is the joined workflow state. Did both clicks carry the same idempotency key? Which service owns the official order state? Did the confirmation event publish? Which dashboard read from which replica? Is repair running?
A strong design answer could say:
Use one payment operation id for order-42.
Make payment idempotent at the receiver.
Expose payment_pending while order and payment evidence disagree.
Reconcile payment and order records before confirming or refunding.
Instrument the path with order_id, payment_operation, trace id, and durable events.
That answer does not pretend the timeout proves failure. It protects the promise while the system gathers enough evidence to repair.
Coverage Map
Use this map to check whether you can transfer each concept into a design decision.
- Network boundaries: what can this component know from messages, timeouts, and records?
- Retries and idempotency: what makes repeated attempts one logical operation?
- Consensus and quorums: which decision must not fork?
- Clocks and causality: which events depend on which earlier events?
- CAP and PACELC: what does this data path do during partition and on a normal day?
- Placement and replication: who owns writes, which copies may answer reads, and how do they repair?
- Consistency models: what may users observe after writes?
- Gossip and membership: which soft state can be temporarily stale?
- Observability: what evidence lets humans reconstruct the workflow?
- Backpressure: what work should slow, reject, or degrade under load?
- Contracts: which old and new message versions must coexist?
- Degraded modes: what smaller promise is safe when the full promise is not?
The goal is not to use every concept in every answer. The goal is to notice which concept is the pressure point.
Readiness Signals
You are ready for the capstone when you can read a distributed workflow and answer these questions without reaching for a slogan:
- What is the user-visible promise?
- Which component owns each official fact?
- What does a timeout prove, and what does it not prove?
- Which operation identity survives retries?
- Which reads may be stale, and for how long?
- What happens during overload or partition?
- What message versions may coexist?
- What evidence proves recovery is safe?
If an answer says "just retry," "just add replicas," "just use a queue," or "just make a dashboard," slow down. Those are mechanisms. The review standard is to connect each mechanism to the promise and trade-off it protects.
Timed Design Drill: The Same Incident, Three Questions
Return to the duplicate-checkout report. Before proposing a fix, separate three questions that are often collapsed.
What happened to the first payment operation?
What outcome may the user safely see now?
What work is safe while the system is uncertain?
The timeout answers none of these by itself. It proves only that one caller stopped waiting. A durable provider record might later show that pay:order-42 authorized. A different dashboard might be stale because it read from a lagging replica. A queue may contain a confirmation job that is delayed rather than missing. These are different kinds of uncertainty and need different evidence.
A disciplined response can therefore be small and specific:
promise:
one intent creates at most one charge and one final order outcome
evidence now:
gateway timed out; provider result is unknown; order is not confirmed
safe user state:
payment_pending, with a status lookup rather than another charge attempt
mechanism:
stable payment operation id, idempotent receiver, durable reconciliation record
trade-off:
slower confirmation and a visible pending state in exchange for avoiding duplicates
Now add load. If provider latency rises for many users, the same design needs admission control and a retry budget. Otherwise each sensible individual retry becomes an unsafe collective storm. The degraded mode may keep carts readable and persist order intent while pausing new payment attempts. This is not a separate concern from correctness: it preserves the capacity needed for reconciliation and prevents new evidence from becoming ambiguous faster than the system can process it.
Finally add a deployment. If the repair worker receives a newer event shape than an old dead-letter message, it must still understand both or route the old record through a documented repair path. Recovery often reads historical data, so contract compatibility is part of incident readiness.
The transfer test is simple: for any scenario, name the protected promise, the evidence already held, the evidence still missing, and the next action that remains safe under that uncertainty. That sequence is more durable than memorizing individual distributed-systems slogans.
Self-Check Before The Capstone
For each mechanism you propose, ask one last pair of questions: what exact failure does it prevent, and what new cost or limit does it introduce? An idempotency key prevents duplicate processing of one named operation, but it does not decide which replica may own a final order state. A queue decouples work, but it needs deadlines and backpressure. A replica improves locality, but a read policy must say whether stale data is acceptable. A playbook makes degraded behavior deliberate, but it depends on controls and evidence that exist before the incident.
If you can name both the protection and the limit, you are ready to turn the review answers into the capstone's architecture decision.
Practice Prompt
Pick one of these scenarios and write a five-line design review:
scenario A:
A password change says saved, but another device accepts the old password.
scenario B:
A payment request times out, then a retry creates two provider records.
scenario C:
A queue grows for thirty minutes while workers keep retrying a slow dependency.
scenario D:
A new producer emits an event that an old consumer rejects during rollout.
Use this shape:
promise:
evidence:
uncertainty:
mechanism:
trade-off:
The capstone will expand the same shape into an architecture decision record.
Resources
- [ARTICLE] Notes on Distributed Systems for Young Bloods
- Focus: Review the practical failure and uncertainty framing behind the track.
- [BOOK] Designing Data-Intensive Applications
- Focus: Revisit replication, consistency, partitioning, and operational trade-offs.
- [BOOK] Site Reliability Engineering: Monitoring Distributed Systems
- Focus: Connect user-visible promises to signals, evidence, and incident response.
Key Takeaways
- Strong review answers start from promises and evidence, not tool names.
- A timeout, stale replica, retry, or missing event is a clue, not the whole truth.
- Mechanisms earn their place by protecting a named promise under a named trade-off.
- The capstone uses the same reasoning in a complete architecture decision.