Network Boundaries, Latency, and Partial Failure
LESSON
Network Boundaries, Latency, and Partial Failure
Core Insight
Imagine changing your display name in an app. You type a new name, press Save, wait for a second, and the page says the update failed. A few moments later you refresh the profile page, and the new name is there.
The product seemed to give two answers: "it failed" and "it worked." That contradiction is not just a bad error message. It is a sign that a network boundary turned one user action into several local observations. The browser saw a slow response. The gateway saw its timer expire. The profile service may have written the update. The database may have committed version 52. None of those facts is exactly the same as the user's question: "Was my name saved?"
A network boundary is the place where one participant must send a message to another participant instead of using local memory and one call stack. A latency measurement is how long the message path takes: sending, waiting, processing, queuing, writing, and returning. A partial failure happens when some parts of the path are slow, overloaded, isolated, or broken while other parts keep working.
The non-obvious point is that latency and failure are connected. A request that is merely slow can become indistinguishable from a failed request from the caller's point of view. A timeout proves only that the caller stopped waiting. It does not prove the remote operation never happened. Good distributed design treats that uncertainty as a normal state that needs policy, evidence, and a safe next step.
What Actually Crosses The Boundary
Use the profile update as the recurring example. The visible user action is simple: save a display name. Internally, the action crosses at least two boundaries.
browser -> gateway: save display name "Nora"
gateway -> profile service: update user-17 to "Nora"
profile service -> database: write profile version 52
The happy path is easy to read:
browser -> gateway: save name
gateway -> profile: update user-17, op-413
profile -> database: write version 52
database -> profile: committed
profile -> gateway: success, version 52
gateway -> browser: saved
The important detail is that every arrow is a message and every box has its own local view. The browser knows what it sent and what it received. The gateway knows its timer, outbound request, and inbound response. The profile service knows whether it attempted the database write. The database knows whether it committed version 52.
Now delay the response from the profile service:
gateway -> profile: update user-17, op-413
profile -> database: write version 52
database -> profile: committed
gateway timer expires at 800 ms
profile -> gateway: success arrives at 1100 ms
From the gateway's point of view, the operation crossed its waiting limit. From the profile service's point of view, the update completed. Both observations can be true. The boundary did not create one clean global event called "failed." It created local evidence that must be interpreted.
That is why a remote call is not just a slow function call. A local function call usually stops being relevant if the caller abandons the stack. A remote participant can keep working after the caller has timed out. The work may later succeed, fail, or send a response that nobody is still waiting to receive.
Latency Is A Distribution, Not One Number
Latency is often discussed as if a service has one speed: "the profile service takes 100 ms." Real systems have a spread of timings. Most requests may be fast, while a small percentage are much slower because of garbage collection, disk flushes, queueing, lock contention, packet loss, cold caches, overloaded workers, or a slow downstream dependency.
That spread matters because users and callers experience individual requests, not averages. A service with a 100 ms average can still have a 2 second tail. The tail is the slower end of the latency distribution, often described with percentiles such as p95 or p99. p99 latency means 99 percent of requests are faster than that number, and 1 percent are slower.
For the profile save, a simplified latency distribution might look like this:
profile update latency
p50: 80 ms typical successful request
p95: 300 ms slow but acceptable request
p99: 1200 ms rare tail request
max: 5000 ms incident or severe overload
If the gateway timeout is 800 ms, the p99 request can look like a failure even though the profile service eventually writes the database row. If the timeout is 5000 ms, fewer successful writes will be mistaken for failures, but many gateway workers may sit idle waiting during overload. Those waiting workers consume memory, connections, and concurrency slots that other users need.
The design trade-off is therefore not "short timeout good" or "long timeout good." A timeout spends from a shared latency budget. The budget is the amount of time the whole user-visible operation can take before the product experience becomes unacceptable.
profile save budget: 2000 ms
browser and network: 200 ms
gateway validation: 100 ms
profile service attempt: 700 ms
read-after-timeout check: 250 ms
render final response: 150 ms
slack for jitter and queues: 600 ms
The exact numbers are product choices. The habit is the important part. Do not let every downstream dependency spend the full user-facing budget. If each hop gets the full two seconds, a chain of calls can exceed the promise before the system notices.
Timeouts, Deadlines, And What They Mean
A timeout is a local rule: "I will stop waiting after this much time." A timeout chooses the caller's next behavior. It does not decide the remote truth.
A deadline is an end time for the whole operation: "After this moment, the answer is no longer useful for this request." Deadlines are often better than independent timeouts because they carry the remaining budget through the call chain.
Without a deadline, each service may accidentally restart the waiting clock:
browser budget: 2000 ms
gateway waits up to 1800 ms for profile
profile waits up to 1800 ms for database
database stalls for 1200 ms
total user wait can exceed the original product budget
With a propagated deadline, the downstream service can see how much useful time remains:
browser deadline: 19:30:02.000
gateway receives request at 19:30:00.150
gateway sends deadline to profile
profile receives request at 19:30:00.220
profile sees about 1780 ms remain
profile avoids starting work that cannot finish in time
Deadlines do not remove uncertainty. They make the uncertainty bounded. A service can reject work early, return deadline_exceeded, or skip expensive optional work when the answer would arrive too late to help.
The most important wording stays simple:
timeout means:
the caller stopped waiting
timeout does not mean:
the remote operation failed
the remote operation succeeded
the remote operation stopped running
That distinction prevents many unsafe follow-up decisions. The gateway may need to show save_pending, perform a read-after-timeout check, or schedule reconciliation. It should not claim "profile update failed" unless it has evidence from the owner of the profile state.
Partial Failure: Alive Enough To Confuse You
Total failure is often easier to reason about than partial failure. If the profile service is completely down, callers fail fast, alerts fire, and the product can enter a clear degraded mode. Partial failure is messier because the system is still alive enough to create ambiguous evidence.
The profile service might accept requests but respond slowly. The database might commit writes but return slowly. The gateway might reach one profile instance but not another. A load balancer might keep routing to an instance with a saturated connection pool. Health checks might pass while real user requests time out.
That last case is common. A service can answer /health quickly because the health endpoint does almost no work, while the real update path is stuck waiting for database connections:
health check:
gateway -> profile /health
profile -> gateway: 200 OK in 10 ms
real request:
gateway -> profile update user-17
profile waits for database connection
gateway timeout fires at 800 ms
From an infrastructure dashboard, the profile service may look "up." From the user's point of view, saves are failing. From the database point of view, some writes may still be committing. Partial failure means those views can disagree without any one observer lying.
This is why distributed systems need more than binary up/down thinking. They need signals that show latency, saturation, errors, and in-flight work. A service that is technically alive but has p99 latency above the caller timeout is not healthy for that workflow.
Designing The Unknown State
The profile update needs a state for the moment after the gateway timeout and before the system knows the profile owner's truth.
A naive design has only two UI states:
saved
failed
That is too small for the evidence. The gateway may not know enough to choose either. A better design separates user communication from internal evidence:
internal state:
op-413 outcome_unknown
caller timed out at 800 ms
profile owner not yet checked after timeout
user state:
"Still checking your change"
or "Saved locally; confirming with the server"
or "We could not confirm yet. Refresh in a moment."
The exact wording depends on the product. The system design needs the internal state either way.
A safe policy might be:
if profile responds before deadline:
return saved with committed version
if gateway timeout fires:
record outcome_unknown for op-413
ask the profile owner for user-17 version
if version 52 is visible: return saved
if owner says no write happened: allow retry
if owner cannot answer: return pending and schedule repair
This is not a retry lesson yet; the next lesson handles idempotency and retry policy in depth. The point here is narrower: the boundary creates an unknown state, and the system must not erase that state by guessing.
Operational Signals For The Boundary
When a boundary is healthy, the caller's expectations and the receiver's behavior line up. When the boundary becomes unhealthy, the first clue is often not an error rate spike. It is a latency shape change.
Useful signals include:
- request latency percentiles for each boundary, especially p95 and p99
- timeout counts by caller and dependency
- in-flight request counts and queue depth
- connection pool saturation
- retry volume, even if retries are handled in the next lesson
- disagreement between health checks and real endpoint success
- traces that show where the request spent time
For the profile save, a useful trace might show:
gateway validate request: 25 ms
gateway -> profile wait: 800 ms timeout
profile handler queue: 600 ms
profile database write: 90 ms
profile response after timeout: 190 ms
That trace explains the incident better than an error message alone. The gateway timeout was not random. The profile handler spent too long queued before it even reached the database. A team could respond by adjusting concurrency, protecting the database pool, changing routing, or returning overload earlier.
Operationally, the goal is not to make every request fast forever. The goal is to see when a boundary stops meeting the promise and to know which side has the next useful evidence.
Failure Modes And Trade-Offs
The first failure mode is false certainty. The caller times out and reports "failed" even though the owner later commits the update. This confuses users and can trigger unsafe retries.
The second failure mode is unbounded waiting. The caller waits too long because it wants a definite answer. During overload, those waiting requests consume scarce resources and can make unrelated workflows fail.
The third failure mode is health-check optimism. A service reports healthy while the real path is saturated. The fix is to measure the boundary the product actually uses, not only a cheap endpoint.
The fourth failure mode is missing ownership after timeout. If the gateway does not know which service owns the profile state, it cannot ask the right question after uncertainty appears.
Every timeout policy buys a trade-off. Shorter waiting protects caller resources and responsiveness, but creates more unknown outcomes. Longer waiting can reduce ambiguity, but increases resource consumption and tail latency. A good design makes that trade-off explicit and gives the unknown state a repair path.
Practice Prompt
Pick one user action that crosses a network boundary: saving a profile, uploading a photo, joining a call, posting a comment, or reserving a seat. Close the lesson and write:
caller:
receiver:
owner of the official state:
user-facing latency budget:
caller timeout:
what the timeout proves:
what the timeout does not prove:
state shown while outcome is unknown:
signal that would reveal partial failure:
If "what the timeout proves" says anything stronger than "the caller stopped waiting," revise it. If the only signal is "service is up," add a latency or saturation signal for the real boundary.
Resources
- [ARTICLE] Timeouts, Retries, and Backoff with Jitter
- Focus: How timeout choices interact with overload, tail latency, retries, and caller resource exhaustion.
- [ARTICLE] Notes on Distributed Systems for Young Bloods
- Focus: Practical framing for slow networks, failure ambiguity, and defensive boundaries.
- [BOOK] Designing Data-Intensive Applications
- Focus: Reliability, partial failure, and the limits of pretending remote calls behave like local calls.
Key Takeaways
- A network boundary turns a local-looking action into message exchange with incomplete evidence.
- Latency is a distribution; tail requests can cross caller timeouts even when averages look fine.
- A timeout proves the caller stopped waiting, not what the receiver did.
- Partial failure is hard because the system is alive enough for different observers to hold different true facts.
- Good boundary design uses latency budgets, deadlines, owner checks, unknown states, and operational signals.