Network Failure Design Review
LESSON
Network Failure Design Review
Core Insight
Imagine the learning platform is preparing a launch week for a large cohort. The product must serve lesson pages, record progress, issue certificates, and send notifications while traffic rises and one region is known to have intermittent packet loss. The team asks for a "resilient network design."
A convincing answer is not a diagram with every possible component. It is a set of explicit promises. Which requests may be stale? Which writes must not duplicate? Which services can fail open? Which operations must fail closed? Which retry policies are safe? Which health signals remove a replica from traffic? Which traces prove what happened after an incident?
The hard part is that these promises cross several boundaries at once. A client deadline affects a gateway retry. A schema choice affects whether a receiver can deduplicate a write. A health check affects routing, but only if it measures the dependency that the request class actually needs. A trace can explain a timeout, but only if the request identity survives the retry and the async follow-up work.
The network does not become reliable by wishing away ambiguity. It becomes manageable when each boundary says what it knows, what it cannot know, and what it is allowed to do next.
The trade-off is resilience versus complexity. Every extra control surface can make failure easier to contain, but it can also become another thing to misconfigure, monitor, and debug.
Capstone Scenario
Design the network-facing behavior for the learning platform under these constraints:
- Lesson catalog reads should stay fast and may use cached or slightly stale data.
- Progress writes must not be duplicated, even if the client or gateway retries.
- Certificate issuance must fail closed if authoritative progress state is uncertain.
- Recommendations may fail open and disappear from the page.
- The platform runs in three zones, with one zone occasionally losing connectivity to the others.
- Operators need enough telemetry to explain whether an incident was caused by routing, retries, dependency latency, partition, or application semantics.
Your output should be a short design review memo. It should explain behavior under failure, not only list technologies.
Architecture Synthesis
Start at the application boundary. Separate operations by meaning:
catalog read: may be stale, cacheable, safe to retry
progress write: side effect, requires idempotency key
certificate issue: authoritative, fail closed under uncertainty
recommendation read: optional, may be skipped
Then attach communication policy to each class. Catalog reads can use short timeouts, local replicas, and fallback caches. Progress writes need explicit idempotency, retry budgets, and enough deadline to avoid repeated ambiguous work. Certificate issuance should require confirmed progress state and should not be routed through a partitioned minority. Recommendations can use a strict deadline and graceful omission.
Next, design routing and health. Readiness checks should remove replicas that cannot reach required dependencies for the traffic they accept. A catalog replica that cannot reach the origin may still serve cached reads; a progress replica that cannot reach the authoritative store should not accept writes. Load balancing should distinguish request classes when their failure consequences differ. Draining should protect deploys from interrupting in-flight writes. Discovery should separate stable service names from changing endpoints, and service identity should prevent accidental trust in the wrong workload.
Finally, define observability. A trace should show the caller deadline, gateway decision, retry attempt, selected upstream, idempotency key, response class, and fallback if one occurred. Metrics should show aggregate latency, retry volume, health-check failures, zone reachability, and per-operation error rates. Logs should preserve the few facts that explain individual ambiguity: request key, operation class, route, deduplication outcome, and why a retry or fallback was allowed.
The result can be summarized as a policy matrix:
| Operation | Freshness | Retry policy | Routing policy | Failure mode |
|---|---|---|---|---|
| Catalog read | Stale allowed | Short retry budget | Prefer local healthy cache | Serve stale or return read error |
| Progress write | Fresh authoritative state | Retry only with idempotency key | Route to replica with write dependency | Deduplicate or return ambiguous-write error |
| Certificate issue | Fresh authoritative state | No blind retry of issuance side effect | Require confirmed progress authority | Fail closed |
| Recommendations | Best effort | Tiny budget or no retry | Any healthy optional dependency | Omit from response |
This matrix is useful because it prevents accidental uniformity. The same network timeout means different things for different operations.
Failure Review Table
| Failure shape | Safe behavior | Required signal |
|---|---|---|
| Recommendation timeout | Return page without recommendations | Gateway span records skipped optional dependency |
| Progress write timeout | Retry only with idempotency key and remaining deadline | Logs link attempts by stable request key |
| Two-node minority partition | Reject authoritative writes on minority side | Quorum or reachability signal blocks commit |
| Replica dependency failure | Remove from readiness for affected traffic | Readiness check reflects dependency class |
| Deploy scale-down | Drain before termination | Router stops new traffic and tracks in-flight work |
| Stale discovery entry | Suppress unsafe writes to unknown endpoint | Route decision records endpoint version and identity |
The table is intentionally concrete. It forces each failure response to name both behavior and evidence. A design that says "the platform retries" is not enough. A design that says "the gateway retries idempotent progress writes once, with jitter, only while 250 ms of deadline remains, and records both attempts under one request key" is testable.
Review Memo Structure
A good memo should be short enough to review but specific enough to test. One useful structure is:
- Name the operation classes and their semantics.
- Define deadlines, retry budgets, and idempotency requirements for each class.
- Explain routing, discovery, identity, readiness, and draining rules.
- Describe behavior under zone partition, dependency failure, deploy, and stale discovery.
- Name the telemetry that proves each behavior happened.
For the launch-week platform, that memo should make a few uncomfortable choices explicit. A minority zone should not issue certificates if it cannot confirm authoritative progress. A progress write that timed out should be retried only through a deduplication path. A catalog page may be served from cache, but the response should say internally that stale data was used. Recommendations can disappear without breaking the page, but that omission should count as degraded success.
Readiness Check
A strong design review should answer these questions:
- Which operations are idempotent, and how does the receiver deduplicate them?
- Which request classes are optional, stale-tolerant, or authoritative?
- Which failures should remove a replica from readiness?
- Which routing decisions depend on locality, version, health, or operation type?
- What happens when discovery data is stale?
- Which telemetry proves that retries, fallbacks, and partitions behaved as expected?
- Which game-day or fault-injection test would demonstrate that the design works before launch?
If the design cannot answer these questions, the system may still run, but its failure behavior is accidental. The goal is not perfect prediction. The goal is to make the most important uncertainty visible before production forces the choice.
Common Design Mistakes
One mistake is applying one retry and timeout policy to every dependency. Optional reads, idempotent writes, and authoritative side effects need different policies.
Another mistake is letting the routing layer make decisions without application semantics. A proxy can enforce rules, but it needs method, route, idempotency, deadline, and health information to choose safely.
A third mistake is instrumenting only errors. Under partial failure, successful responses can still hide retries, stale reads, fallback behavior, or skipped dependencies. The design should make degraded success visible.
Resources
- [BOOK] Designing Data-Intensive Applications
- Link: https://dataintensive.net/
- Focus: Revisit replication, partitions, consistency, and operational trade-offs together.
- [ARTICLE] Timeouts, Retries, and Backoff with Jitter
- Link: https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
- Focus: Use retry behavior as a design surface rather than a library default.
- [DOC] OpenTelemetry Concepts
- Link: https://opentelemetry.io/docs/concepts/
- Focus: Connect request context, traces, metrics, and logs to failure diagnosis.
- [DOC] Kubernetes Services and Probes
- Link: https://kubernetes.io/docs/concepts/services-networking/service/
- Focus: Use service naming, endpoints, and health probes as routing inputs.
Key Takeaways
- Network resilience starts from operation semantics: optional reads, side-effecting writes, and authoritative decisions need different behavior.
- Timeouts, retries, routing, health checks, and observability must be designed as one failure policy.
- A good design review names both the safe behavior and the evidence that proves it happened.
- The final trade-off is control versus complexity: every network control surface needs ownership, tests, and incident visibility.