Network Failure Design Review

LESSON

Networking and Failure Models

008 45 min intermediate CAPSTONE

Network Failure Design Review

Core Insight

Imagine the learning platform is preparing a launch week for a large cohort. The product must serve lesson pages, record progress, issue certificates, and send notifications while traffic rises and one region is known to have intermittent packet loss. The team asks for a "resilient network design."

A convincing answer is not a diagram with every possible component. It is a set of explicit promises. Which requests may be stale? Which writes must not duplicate? Which services can fail open? Which operations must fail closed? Which retry policies are safe? Which health signals remove a replica from traffic? Which traces prove what happened after an incident?

The hard part is that these promises cross several boundaries at once. A client deadline affects a gateway retry. A schema choice affects whether a receiver can deduplicate a write. A health check affects routing, but only if it measures the dependency that the request class actually needs. A trace can explain a timeout, but only if the request identity survives the retry and the async follow-up work.

The network does not become reliable by wishing away ambiguity. It becomes manageable when each boundary says what it knows, what it cannot know, and what it is allowed to do next.

The trade-off is resilience versus complexity. Every extra control surface can make failure easier to contain, but it can also become another thing to misconfigure, monitor, and debug.

Capstone Scenario

Design the network-facing behavior for the learning platform under these constraints:

Your output should be a short design review memo. It should explain behavior under failure, not only list technologies.

Architecture Synthesis

Start at the application boundary. Separate operations by meaning:

catalog read: may be stale, cacheable, safe to retry
progress write: side effect, requires idempotency key
certificate issue: authoritative, fail closed under uncertainty
recommendation read: optional, may be skipped

Then attach communication policy to each class. Catalog reads can use short timeouts, local replicas, and fallback caches. Progress writes need explicit idempotency, retry budgets, and enough deadline to avoid repeated ambiguous work. Certificate issuance should require confirmed progress state and should not be routed through a partitioned minority. Recommendations can use a strict deadline and graceful omission.

Next, design routing and health. Readiness checks should remove replicas that cannot reach required dependencies for the traffic they accept. A catalog replica that cannot reach the origin may still serve cached reads; a progress replica that cannot reach the authoritative store should not accept writes. Load balancing should distinguish request classes when their failure consequences differ. Draining should protect deploys from interrupting in-flight writes. Discovery should separate stable service names from changing endpoints, and service identity should prevent accidental trust in the wrong workload.

Finally, define observability. A trace should show the caller deadline, gateway decision, retry attempt, selected upstream, idempotency key, response class, and fallback if one occurred. Metrics should show aggregate latency, retry volume, health-check failures, zone reachability, and per-operation error rates. Logs should preserve the few facts that explain individual ambiguity: request key, operation class, route, deduplication outcome, and why a retry or fallback was allowed.

The result can be summarized as a policy matrix:

Operation Freshness Retry policy Routing policy Failure mode
Catalog read Stale allowed Short retry budget Prefer local healthy cache Serve stale or return read error
Progress write Fresh authoritative state Retry only with idempotency key Route to replica with write dependency Deduplicate or return ambiguous-write error
Certificate issue Fresh authoritative state No blind retry of issuance side effect Require confirmed progress authority Fail closed
Recommendations Best effort Tiny budget or no retry Any healthy optional dependency Omit from response

This matrix is useful because it prevents accidental uniformity. The same network timeout means different things for different operations.

Failure Review Table

Failure shape Safe behavior Required signal
Recommendation timeout Return page without recommendations Gateway span records skipped optional dependency
Progress write timeout Retry only with idempotency key and remaining deadline Logs link attempts by stable request key
Two-node minority partition Reject authoritative writes on minority side Quorum or reachability signal blocks commit
Replica dependency failure Remove from readiness for affected traffic Readiness check reflects dependency class
Deploy scale-down Drain before termination Router stops new traffic and tracks in-flight work
Stale discovery entry Suppress unsafe writes to unknown endpoint Route decision records endpoint version and identity

The table is intentionally concrete. It forces each failure response to name both behavior and evidence. A design that says "the platform retries" is not enough. A design that says "the gateway retries idempotent progress writes once, with jitter, only while 250 ms of deadline remains, and records both attempts under one request key" is testable.

Review Memo Structure

A good memo should be short enough to review but specific enough to test. One useful structure is:

  1. Name the operation classes and their semantics.
  2. Define deadlines, retry budgets, and idempotency requirements for each class.
  3. Explain routing, discovery, identity, readiness, and draining rules.
  4. Describe behavior under zone partition, dependency failure, deploy, and stale discovery.
  5. Name the telemetry that proves each behavior happened.

For the launch-week platform, that memo should make a few uncomfortable choices explicit. A minority zone should not issue certificates if it cannot confirm authoritative progress. A progress write that timed out should be retried only through a deduplication path. A catalog page may be served from cache, but the response should say internally that stale data was used. Recommendations can disappear without breaking the page, but that omission should count as degraded success.

Readiness Check

A strong design review should answer these questions:

If the design cannot answer these questions, the system may still run, but its failure behavior is accidental. The goal is not perfect prediction. The goal is to make the most important uncertainty visible before production forces the choice.

Common Design Mistakes

One mistake is applying one retry and timeout policy to every dependency. Optional reads, idempotent writes, and authoritative side effects need different policies.

Another mistake is letting the routing layer make decisions without application semantics. A proxy can enforce rules, but it needs method, route, idempotency, deadline, and health information to choose safely.

A third mistake is instrumenting only errors. Under partial failure, successful responses can still hide retries, stale reads, fallback behavior, or skipped dependencies. The design should make degraded success visible.

Resources

Key Takeaways

PREVIOUS Observability Across Network Boundaries