Observability Across Network Boundaries
LESSON
Observability Across Network Boundaries
Core Insight
Imagine a learner reports that completing a lesson sometimes takes ten seconds. The frontend logs show a slow request. The API gateway shows one retry. The progress service shows a normal write. The database shows a brief queue spike. The service-discovery log shows that the first attempt went to a draining replica. No single component believes it failed.
This is the observability problem created by network boundaries. Once a request crosses processes, proxies, regions, and queues, each participant sees only part of the path. A timeout may appear at the caller, a successful commit may appear at the server, and a retry may hide the first symptom unless the signals are connected.
Good observability is not just more logs. It is a way to preserve request context across boundaries so engineers can reconstruct what happened, which layer made a decision, and where uncertainty entered the path. Metrics, logs, traces, events, and packet-level tools answer different questions, and the useful incident story usually combines several of them.
The useful design shift is to instrument the path, not just the components. The trade-off is detail versus cost: rich telemetry makes incidents explainable, but high-cardinality labels, excessive spans, and noisy logs can make the system expensive and harder to search. Observability has to be designed with the same discipline as retries or routing.
Follow One Request Across Layers
A useful investigation starts by giving the request a durable identity. A request ID, trace ID, or correlation ID lets separate systems attach their local observations to one path. That identity needs to survive proxies, retries, async jobs, queue messages, and service-to-service calls.
browser
-> gateway
-> progress service
-> storage
-> notification service
For each hop, the system should record a few basic facts: when the work started, how long it took, what decision was made, which status was returned, whether a retry happened, which endpoint was selected, which version answered, and how much deadline remained. Those facts are more useful than a thousand unconnected log lines.
The same request can have different truths at different layers:
- the browser saw a timeout
- the gateway retried once
- discovery sent the first attempt to a draining replica
- the first progress-service attempt still committed the write
- the second attempt was deduplicated
- the notification service was skipped because no deadline remained
Without shared context, those facts look contradictory. With shared context, they describe one coherent failure path.
Metrics, Logs, Traces, And Packets Answer Different Questions
Metrics are good for aggregates: error rate, latency percentiles, retry volume, queue depth, connection resets, and saturation. They answer "is this getting worse?" and "how many users are affected?"
Logs are good for discrete facts: this request used idempotency key abc, this dependency returned 503, this route selected region B, this retry was suppressed because the deadline expired.
Traces are good for shape: which hops participated, where time was spent, which span retried, which route or endpoint was selected, and which service turned a downstream failure into a user-visible response.
Packet captures and low-level network tools are good when the question is below the application: retransmissions, TLS handshake problems, DNS resolution behavior, connection resets, or unexpected routing.
metric: p95 progress completion latency rose to 2.4s
log: request abc retried after 504 from zone C
trace: 1.9s spent waiting on storage from first attempt through endpoint P3
packet view: TCP retransmissions increased on one path
The trade-off is abstraction versus evidence. Application telemetry explains semantics. Network telemetry explains movement. Real incidents often need both, but they should not be mixed into one vague bucket called "network problem."
Tail Latency Hides In Fan-Out
Many user requests call several dependencies. Even if each dependency is usually fast, the total path is exposed to the slowest required hop. A page that needs catalog, progress, recommendations, and certificates has more chances to hit a slow tail than a page that calls only one service.
page request
-> catalog: 80 ms
-> progress: 120 ms
-> recommendations: 900 ms
-> certificates: skipped after deadline
Observability should therefore capture dependency budgets, not only final response time. If the gateway has 800 ms for the whole page, a recommendation call that consumes 700 ms may be more important than a storage query that is merely above its local average.
This is where earlier lessons connect. Timeouts and retries are policy decisions. Health checks and load balancers are routing decisions. Discovery and identity are control-plane decisions. Observability needs to show those decisions, not just their symptoms.
Degraded Success Is Still A Signal
Some of the most important network failures do not return an error to the user. The page loads, but recommendations are omitted. The progress write succeeds, but only after a retry. The catalog response is served from stale cache because the origin timed out. The certificate job is deferred because the system could not confirm authoritative progress state.
If observability records only final failures, all of those paths disappear. The system looks healthy while quietly spending retry budget, hiding optional dependencies, or serving older data than expected.
final status: 200 OK
degraded facts:
recommendations skipped
progress write retried once
catalog served from stale cache
certificate issuance deferred
The user-facing outcome may be acceptable, but operators still need to see the degradation. Otherwise the first visible outage arrives after the system has been running hot for hours.
The trade-off is signal versus noise. Not every fallback deserves a page. But degraded success should be measurable, searchable, and visible enough that teams can tell the difference between healthy success and success bought by burning resilience mechanisms.
Common Design Mistakes
One mistake is collecting telemetry only at service boundaries and not at policy boundaries. A proxy retry, gateway fallback, circuit-breaker open, or load-balancer choice can be the most important event in the path.
Another mistake is creating high-cardinality labels without discipline. Labeling metrics by user ID, raw URL, or unbounded request key can make metrics systems expensive or unusable. Put unique values in traces or logs; keep metric dimensions bounded.
A third mistake is treating traces as proof that the network was innocent. A trace can show where application spans waited, but lower-level failures such as DNS delay, TLS negotiation, or packet loss may need separate evidence.
A fourth mistake is losing context at async boundaries. If the progress write emits a completion event and the notification worker starts a new trace with no link to the original request, the incident path breaks exactly where delayed work begins.
Resources
- [DOC] OpenTelemetry Concepts
- Link: https://opentelemetry.io/docs/concepts/
- Focus: Compare traces, metrics, logs, context propagation, and semantic conventions.
- [BOOK] Site Reliability Engineering
- Link: https://sre.google/sre-book/monitoring-distributed-systems/
- Focus: Use monitoring as a way to detect symptoms and preserve service promises.
- [DOC] Envoy Observability
- Link: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability
- Focus: Study what a network proxy can reveal about retries, routing, and upstream behavior.
Key Takeaways
- Networked incidents need path observability, not only per-service dashboards.
- Metrics, logs, traces, and packet-level evidence answer different questions.
- Trace context helps reconcile caller timeouts, server commits, retries, and deduplication into one story.
- Observability should show policy decisions such as routing, discovery, retry, fallback, and deadline exhaustion.
- Degraded success is still part of the failure signal when the system hides optional work, stale data, or retries from the user.