Network Failure Design Review

LESSON

008 45 min intermediate CAPSTONE

Network Failure Design Review

Core Insight

Imagine the learning platform is preparing a launch week for a large cohort. The product must serve lesson pages, record progress, issue certificates, and send notifications while traffic rises and one region is known to have intermittent packet loss. The team asks for a "resilient network design."

A convincing answer is not a diagram with every possible component. It is a set of explicit promises. Which requests may be stale? Which writes must not duplicate? Which services can fail open? Which operations must fail closed? Which retry policies are safe? Which health signals remove a replica from traffic? Which traces prove what happened after an incident?

The hard part is that these promises cross several boundaries at once. A client deadline affects a gateway retry. A schema choice affects whether a receiver can deduplicate a write. A health check affects routing, but only if it measures the dependency that the request class actually needs. A trace can explain a timeout, but only if the request identity survives the retry and the async follow-up work.

The network does not become reliable by wishing away ambiguity. It becomes manageable when each boundary says what it knows, what it cannot know, and what it is allowed to do next.

The trade-off is resilience versus complexity. Every extra control surface can make failure easier to contain, but it can also become another thing to misconfigure, monitor, and debug.

Capstone Scenario

Design the network-facing behavior for the learning platform under these constraints:

Lesson catalog reads should stay fast and may use cached or slightly stale data.
Progress writes must not be duplicated, even if the client or gateway retries.
Certificate issuance must fail closed if authoritative progress state is uncertain.
Recommendations may fail open and disappear from the page.
The platform runs in three zones, with one zone occasionally losing connectivity to the others.
Operators need enough telemetry to explain whether an incident was caused by routing, retries, dependency latency, partition, or application semantics.

Your output should be a short design review memo. It should explain behavior under failure, not only list technologies.

Architecture Synthesis

Start at the application boundary. Separate operations by meaning:

catalog read: may be stale, cacheable, safe to retry
progress write: side effect, requires idempotency key
certificate issue: authoritative, fail closed under uncertainty
recommendation read: optional, may be skipped

Then attach communication policy to each class. Catalog reads can use short timeouts, local replicas, and fallback caches. Progress writes need explicit idempotency, retry budgets, and enough deadline to avoid repeated ambiguous work. Certificate issuance should require confirmed progress state and should not be routed through a partitioned minority. Recommendations can use a strict deadline and graceful omission.

Next, design routing and health. Readiness checks should remove replicas that cannot reach required dependencies for the traffic they accept. A catalog replica that cannot reach the origin may still serve cached reads; a progress replica that cannot reach the authoritative store should not accept writes. Load balancing should distinguish request classes when their failure consequences differ. Draining should protect deploys from interrupting in-flight writes. Discovery should separate stable service names from changing endpoints, and service identity should prevent accidental trust in the wrong workload.

Finally, define observability. A trace should show the caller deadline, gateway decision, retry attempt, selected upstream, idempotency key, response class, and fallback if one occurred. Metrics should show aggregate latency, retry volume, health-check failures, zone reachability, and per-operation error rates. Logs should preserve the few facts that explain individual ambiguity: request key, operation class, route, deduplication outcome, and why a retry or fallback was allowed.

The result can be summarized as a policy matrix:

Operation	Freshness	Retry policy	Routing policy	Failure mode
Catalog read	Stale allowed	Short retry budget	Prefer local healthy cache	Serve stale or return read error
Progress write	Fresh authoritative state	Retry only with idempotency key	Route to replica with write dependency	Deduplicate or return ambiguous-write error
Certificate issue	Fresh authoritative state	No blind retry of issuance side effect	Require confirmed progress authority	Fail closed
Recommendations	Best effort	Tiny budget or no retry	Any healthy optional dependency	Omit from response

This matrix is useful because it prevents accidental uniformity. The same network timeout means different things for different operations.

Failure Review Table

Failure shape	Safe behavior	Required signal
Recommendation timeout	Return page without recommendations	Gateway span records skipped optional dependency
Progress write timeout	Retry only with idempotency key and remaining deadline	Logs link attempts by stable request key
Two-node minority partition	Reject authoritative writes on minority side	Quorum or reachability signal blocks commit
Replica dependency failure	Remove from readiness for affected traffic	Readiness check reflects dependency class
Deploy scale-down	Drain before termination	Router stops new traffic and tracks in-flight work
Stale discovery entry	Suppress unsafe writes to unknown endpoint	Route decision records endpoint version and identity

The table is intentionally concrete. It forces each failure response to name both behavior and evidence. A design that says "the platform retries" is not enough. A design that says "the gateway retries idempotent progress writes once, with jitter, only while 250 ms of deadline remains, and records both attempts under one request key" is testable.

Review Memo Structure

A good memo should be short enough to review but specific enough to test. One useful structure is:

Name the operation classes and their semantics.
Define deadlines, retry budgets, and idempotency requirements for each class.
Explain routing, discovery, identity, readiness, and draining rules.
Describe behavior under zone partition, dependency failure, deploy, and stale discovery.
Name the telemetry that proves each behavior happened.

For the launch-week platform, that memo should make a few uncomfortable choices explicit. A minority zone should not issue certificates if it cannot confirm authoritative progress. A progress write that timed out should be retried only through a deduplication path. A catalog page may be served from cache, but the response should say internally that stale data was used. Recommendations can disappear without breaking the page, but that omission should count as degraded success.

Readiness Check

A strong design review should answer these questions:

Which operations are idempotent, and how does the receiver deduplicate them?
Which request classes are optional, stale-tolerant, or authoritative?
Which failures should remove a replica from readiness?
Which routing decisions depend on locality, version, health, or operation type?
What happens when discovery data is stale?
Which telemetry proves that retries, fallbacks, and partitions behaved as expected?
Which game-day or fault-injection test would demonstrate that the design works before launch?

If the design cannot answer these questions, the system may still run, but its failure behavior is accidental. The goal is not perfect prediction. The goal is to make the most important uncertainty visible before production forces the choice.

Common Design Mistakes

One mistake is applying one retry and timeout policy to every dependency. Optional reads, idempotent writes, and authoritative side effects need different policies.

Another mistake is letting the routing layer make decisions without application semantics. A proxy can enforce rules, but it needs method, route, idempotency, deadline, and health information to choose safely.

A third mistake is instrumenting only errors. Under partial failure, successful responses can still hide retries, stale reads, fallback behavior, or skipped dependencies. The design should make degraded success visible.

Resources

[BOOK] Designing Data-Intensive Applications
- Link: https://dataintensive.net/
- Focus: Revisit replication, partitions, consistency, and operational trade-offs together.
[ARTICLE] Timeouts, Retries, and Backoff with Jitter
- Link: https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
- Focus: Use retry behavior as a design surface rather than a library default.
[DOC] OpenTelemetry Concepts
- Link: https://opentelemetry.io/docs/concepts/
- Focus: Connect request context, traces, metrics, and logs to failure diagnosis.
[DOC] Kubernetes Services and Probes
- Link: https://kubernetes.io/docs/concepts/services-networking/service/
- Focus: Use service naming, endpoints, and health probes as routing inputs.

Key Takeaways

Network resilience starts from operation semantics: optional reads, side-effecting writes, and authoritative decisions need different behavior.
Timeouts, retries, routing, health checks, and observability must be designed as one failure policy.
A good design review names both the safe behavior and the evidence that proves it happened.
The final trade-off is control versus complexity: every network control surface needs ownership, tests, and incident visibility.

← Back to Networking and Failure Models

← Back to Distributed Systems

← Back to Learning Hub