Observability for HTTP: Logs, Traces, and Wire Symptoms

LESSON

023 25 min intermediate

Observability for HTTP: Logs, Traces, and Wire Symptoms

The core idea: HTTP observability is boundary evidence: a useful debug trail connects what the client saw, what the edge decided, what the origin received, and which timing or protocol symptom explains the gap.

Core Insight

Imagine checkout starts returning intermittent 502 Bad Gateway responses. The application team says the checkout service is healthy. The CDN dashboard shows a small spike in edge errors. The database dashboard looks normal. A few users report that refreshing works. If each team reads only its own logs, the incident becomes a debate about whose system is "really" failing.

The useful move is to treat the request path as a chain of witnesses. The browser saw a status code and latency. The CDN or edge proxy saw cache status, TLS and HTTP protocol, rule id, upstream selection, and a timeout or upstream response. The reverse proxy saw connection reuse, upstream timing, and response flags. The application saw a route, a trace id, handler timing, dependencies, and maybe nothing at all if the edge failed before forwarding the request.

HTTP observability is not just "collect logs." It is deciding which facts must survive each boundary so one request can be reconstructed later. A status code alone is ambiguous. 502 might mean the origin returned a bad response, the proxy timed out, the connection reset, TLS to the origin failed, or a gateway rule rejected the upstream. The fix depends on which boundary produced the error.

The trade-off is diagnostic richness versus data volume and privacy. Rich logs, trace attributes, headers, and samples make incidents faster to explain. They also cost money, create noise, and can leak sensitive information if they capture full URLs, cookies, authorization headers, or user identifiers carelessly. Good HTTP observability records enough boundary evidence to answer production questions without turning every request into a privacy risk.

The Evidence Each Boundary Owns

Every HTTP hop can observe different facts. A client can observe the final URL, visible redirects, response status, browser timing, and whether a request was blocked by CORS or mixed-content rules. It usually cannot see which origin pool the edge chose or whether a CDN served a stale object.

The edge can observe facts near the public boundary:

request id
client IP or trusted client identity
host, path, method
TLS version and ALPN result
HTTP protocol used by the client
edge rule id
cache status
redirect target or rewrite target
selected origin
upstream status
edge duration and upstream duration
response bytes

The application can observe different facts:

route and handler
authenticated subject or tenant
request id and traceparent
validated method and content type
business operation
dependency spans
application status or exception

Wire symptoms sit between these layers. They are not always visible in application logs: connection reset, TLS handshake error, HTTP/2 stream reset, HTTP/3 fallback, upstream timeout, header too large, body too large, malformed response, or proxy buffer overflow. These symptoms explain why a request may fail before application code runs.

The bridge is correlation. A request needs a stable identifier that crosses hops. It may be a generated X-Request-ID, a W3C traceparent, or both:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
X-Request-ID: req_7c9d

The identifier is not useful unless every layer records it and forwards it safely. The edge should create one if the client did not provide a trusted value. Internal services should propagate it. Logs and traces should use the same value or a direct mapping.

A Trace Is Not a Replacement for Logs

A distributed trace shows the causal path of one request through instrumented services. It is good at answering "where did this request spend time?" and "which dependency span failed?" But traces often begin at the edge or application instrumentation point. If a TLS handshake fails, an ALPN negotiation falls back, a CDN cache serves stale content, or an edge rule redirects before origin, the application trace may not exist.

Access logs are better at boundary accounting. They are compact records for every request or a sampled set of requests. They show status codes, timings, hosts, paths, upstreams, cache status, and policy decisions. They are often the first place to see a pattern:

edge_status=502
upstream_status=-
edge_error=upstream_timeout
origin=checkout-primary
rule_id=checkout-origin-v2
duration_ms=30000
trace_id=4bf92f...

Metrics are better at shape. They show rate, error ratio, latency percentiles, timeout count, cache hit ratio, redirect count, connection resets, and saturation. They tell you that a class of requests is sick. They rarely tell you why one request failed.

Packet captures and wire-level tools are last-mile evidence. They can show connection resets, TLS negotiation, retransmits, HTTP/2 frames, or malformed responses. They are powerful, expensive to interpret, and often constrained by encryption and privacy. Use them when higher-level evidence cannot explain the symptom.

The practical model is:

metrics show the shape
logs name the boundary and decision
traces show instrumented causal work
wire evidence explains transport/protocol symptoms

The mistake is expecting one signal to do all four jobs.

Worked Path: The Intermittent 502

A user reports:

URL: https://www.shop.test/checkout
time: 10:15:04
visible status: 502
request id shown in error page: req_7c9d

Start at the edge log:

ts=10:15:04.120
request_id=req_7c9d
trace_id=4bf92f3577b34da6a3ce929d0e0e4736
host=www.shop.test
path=/checkout
method=GET
client_proto=h2
cache_status=BYPASS
rule_id=checkout-rewrite-v2
origin=checkout-primary
edge_status=502
upstream_status=-
edge_error=upstream_timeout
edge_ms=30014
upstream_connect_ms=4
upstream_first_byte_ms=30000

This already tells a story. The request reached the edge over HTTP/2. Cache was bypassed. A rewrite rule selected checkout-primary. The edge did not record an upstream HTTP status. It waited about 30 seconds for the first byte and then generated the 502.

Now check the reverse proxy or origin load balancer:

request_id=req_7c9d
received_at=10:15:04.126
upstream=checkout-app-17
proxy_status=504
app_status=-
upstream_timeout_ms=30000
bytes_from_app=0

The proxy agrees: it forwarded toward an app instance and timed out before the app returned bytes. Now look for an application trace with the same trace id. If none exists, the request may not have reached the app handler. If a trace exists:

trace_id=4bf92f...
span=GET /checkout
duration_ms=29880
child=db.inventory.reserve duration_ms=29620 status=timeout
handler_status=error

Now the boundary changed. The app did receive the request. It timed out waiting for inventory reservation. The edge's 502 was a gateway symptom for an application dependency timeout. The user-visible status did not name the root cause; the correlated evidence did.

The investigation can now ask a precise question:

Are 502s concentrated on checkout-rewrite-v2 + checkout-primary
with app spans waiting on db.inventory.reserve?

If yes, the incident is not a global CDN problem. It is a checkout dependency or timeout budget problem that surfaces at the HTTP edge.

Timing Budgets Explain Many "Random" Failures

HTTP paths often have nested timeouts:

browser waits:              60s
CDN edge waits for origin:  30s
origin proxy waits for app: 28s
app waits for dependency:   25s
database statement timeout: 20s

These numbers should be intentional. If the app waits 40 seconds for a dependency but the edge times out at 30 seconds, the app may finish work after the client has already received a gateway error. If a proxy retries a request after an ambiguous timeout, the app may receive duplicate work. If the database timeout is longer than the HTTP timeout, users see HTTP failure while backend work continues.

Good observability records timeout budgets and actual timing by boundary:

client_total_ms
edge_total_ms
edge_to_origin_connect_ms
origin_proxy_ms
app_handler_ms
dependency_ms
timeout_policy
retry_count

This is where logs and traces meet. A trace span may show the database call. The edge log shows when the client-facing system gave up. The operational question is not just "which span is slow?" It is "which boundary timed out first, and what work continued afterward?"

Privacy and Sampling Are Part of the Design

It is tempting to log everything during an incident. Full URLs can contain tokens, emails, cart ids, coupon codes, OAuth state, or search terms. Headers can contain cookies and authorization credentials. Request and response bodies can contain personal data. Observability that leaks secrets creates a second incident.

Design log fields deliberately:

record: method, route template, status, timing, cache status, rule id, origin, trace id
avoid: raw Authorization, Cookie, Set-Cookie, full bodies, unredacted query strings
careful: user id, IP address, email, tenant id, payment/order identifiers

Prefer route templates over raw paths when possible:

route=/orders/{order_id}
not path=/orders/A123?token=...

Sampling also needs intent. Sampling traces randomly at 1% may miss rare but important errors. A better policy often keeps all error traces, all slow traces above a threshold, and a small sample of healthy traffic. Logs can be sampled differently from metrics. Security-relevant events may need retention even when routine access logs are sampled.

The trade-off is not "more data is always better." The goal is enough evidence to reconstruct boundary behavior, with privacy, cost, and retention controlled.

Operational Failure Modes

Failure: application-only logs. If logs start inside the app, failures at TLS, CDN, redirects, cache, proxy, or origin connection setup are invisible. Keep edge and proxy evidence.

Failure: uncorrelated request ids. Edge logs, app logs, and traces with different identifiers force humans to correlate by timestamp. Generate and propagate a stable request id or trace context.

Failure: status code without producer. A 502 from an edge, a proxy, or an app wrapper means different things. Log who generated the status and what upstream status, if any, was observed.

Failure: no timeout budget visibility. Without timing at each hop, a "slow request" has no boundary. Record connect time, first-byte time, proxy time, handler time, dependency time, and retry count.

Failure: observability leaks secrets. Query strings, cookies, auth headers, and bodies need redaction or exclusion. Privacy controls belong in the logging design, not as a cleanup task later.

Useful signals include status by producer, edge rule id, cache status, selected origin, upstream status, timeout reason, trace id, route template, client protocol, ALPN result, HTTP version, redirect chain length, retry count, response bytes, request body size, header size, and latency percentiles by boundary.

Debugging Checklist

Close the lesson and reconstruct one failing request from memory:

What did the client see?
Which request id or trace id ties the evidence together?
Which edge rule fired?
Was the response from cache, edge, proxy, or origin?
What upstream status did the edge observe?
Which boundary timed out or failed first?
Did application code run?
Which sensitive fields were deliberately not logged?
Which metric shows whether this is one request or a pattern?

If the answer depends on asking three teams to grep unrelated logs, the system is not yet observable at the HTTP boundary. The next design task is to propagate identity and record the decisions each intermediary makes.

Connections

The redirect and edge-policy lesson showed why rule ids, redirect targets, and rewrite decisions matter. This lesson turns those fields into incident evidence.

The capstone that follows asks you to design a full global request path. Observability is what makes that design operable: every DNS, TLS, CDN, proxy, cache, redirect, realtime, and origin decision should leave enough evidence to debug without guessing.

Resources

[RFC] HTTP Semantics RFC 9110
- Focus: Use it for status codes, fields, methods, and the semantics behind client-visible HTTP behavior.
[SPEC] W3C Trace Context
- Focus: Use it for traceparent, tracestate, and the standard way to propagate trace identity across services.
[DOC] OpenTelemetry: Traces
- Focus: Use it for spans, trace structure, context propagation, and how traces complement logs and metrics.
[DOC] MDN: Server-Timing
- Focus: Use it for exposing selected server-side timing information to clients and browser developer tools.
[DOC] Cloudflare Logs Reference
- Focus: Use it as a concrete example of edge log fields such as cache status, origin status, colo, and timing evidence.

Key Takeaways

HTTP observability is boundary evidence: client, edge, proxy, cache, origin, and application layers each see different facts.
Metrics show the shape, logs name boundary decisions, traces explain instrumented work, and wire evidence explains transport symptoms.
A stable request id or trace context is the bridge that turns separate logs into one request story.
Observability design must include privacy, sampling, and redaction; more raw data is not automatically better evidence.

← Back to HTTP Protocol and Content Delivery

← Back to Distributed Systems

← Back to Learning Hub