Day 047: Observability Across Local and Distributed Systems
Observability is the ability to infer what the system is actually doing from the evidence it emits, especially when no single process or machine can tell the whole story.
Today's "Aha!" Moment
Many teams learn observability through tools: dashboards, logs, tracing backends, profiling views. Those tools matter, but they can hide the central idea. Observability is not a product category. It is a diagnostic capability. The system is observable when you can move from a symptom, “the request is slow,” “errors are spiking,” “the worker is stuck,” to a believable explanation using the signals the system emits.
That explanation gets harder as systems become more distributed. In a single process, you can often inspect memory, CPU, syscalls, or local logs directly. In a multi-service request path, no one process contains the full story. The request crosses machines, queues, retries, caches, and storage layers. At that point, observability is the skill of stitching together different kinds of evidence, not the skill of staring at one dashboard longer.
Take a learner clicking “complete lesson.” The API accepts the request, calls the progress service, writes durable state, emits an event, and maybe triggers recommendations. Users report slowness. Metrics tell you the slowdown is real and broad. Traces show the request spent most of its time in one downstream call. Logs reveal that the downstream service timed out on storage and switched to a fallback queue. If needed, local process tools then explain why that storage client was blocking. That whole chain is observability in practice.
The key shift is this: logs, metrics, traces, and local profiling are not competing answers. They are different kinds of evidence used at different stages of explanation. Once you see that, observability stops being a shopping list and starts being a workflow.
Why This Matters
The problem: Teams often collect large amounts of telemetry yet still struggle to explain failures because the signals are not correlated or are treated as separate tool silos.
Before:
- Dashboards show symptoms but not causes.
- Logs contain detail but no useful correlation.
- Traces exist but are expected to replace every other signal.
- Local process debugging is forgotten as soon as the architecture becomes distributed.
After:
- Metrics detect pattern and scale.
- Logs preserve concrete decisions and events.
- Traces reconstruct end-to-end causality.
- Local introspection remains available when the problem collapses into one host or process.
Real-world impact: Faster incident response, better instrumentation choices, clearer handoffs between platform and application teams, and fewer investigations that end with “the network was weird” or “CPU was high” without a real explanation.
Learning Objectives
By the end of this session, you will be able to:
- Define observability as explanation, not telemetry volume - Explain why the goal is inference about internal state, not raw data collection.
- Use metrics, logs, and traces together - Distinguish what each signal reveals best and how they complement one another.
- Connect distributed evidence back to local behavior - Understand when a distributed symptom still requires host- or process-level investigation.
Core Concepts Explained
Concept 1: Metrics Show Shape and Trend, but Usually Not the Full Story
Suppose the lesson-completion path starts timing out for some users. The first useful question is often broad: is this one odd request or a systemic pattern? Metrics answer that well. They tell you whether p95 latency is climbing, whether error rate is rising, whether queue depth is growing, or whether CPU saturation is spreading across a fleet.
That is their strength. Metrics compress many events into aggregate shape:
- how bad is it?
- how widespread is it?
- when did it begin?
- which service or dependency changed first?
Metrics are therefore excellent for detection and scoping. But they intentionally throw away detail. A histogram can tell you latency shifted. It cannot tell you which exact request path stalled or what decision the service made when it did.
One good way to think about metrics is:
metrics answer:
"is there a pattern worth explaining?"
The trade-off is scale versus detail. You gain broad visibility across time and fleets, but you lose the request-specific context needed to explain individual failures.
Concept 2: Logs and Traces Explain Different Parts of the Causal Story
Once metrics tell you there is a real problem, you usually need two other lenses: event detail and causal path.
Logs are best at preserving local decisions and concrete events. A service can log that a storage write exceeded its deadline, that it switched to a fallback queue, or that input validation rejected the request. This is information a metric typically smooths away.
Traces do something different. They reconstruct one request across boundaries. A trace can show that the lesson-completion request spent 40 ms in the API, 80 ms in auth, and 700 ms waiting on storage inside the progress service. That gives you path structure and timing, not just local narrative.
metrics -> something is wrong
trace -> where along the path it became wrong
logs -> what each component decided while it was going wrong
log_event = {
"trace_id": "abc-123",
"service": "progress",
"event": "storage_write_timeout",
"fallback": "queued_retry",
}
The example is simple, but the point is crucial: correlation fields turn logs and traces into joint evidence instead of separate archives. Without a shared trace or request identifier, distributed investigations become much slower and more speculative.
The trade-off is specificity versus volume. Logs and traces carry much richer detail than metrics, but they are also more expensive to collect, store, and interpret well.
Concept 3: Distributed Symptoms Often Collapse Back into Local Process Questions
A common mistake is to think observability becomes purely “distributed” once traces and dashboards exist. In reality, many distributed symptoms end with a local question. A trace shows storage latency dominates one service. Now you may need to ask: is that process blocked on CPU, memory pressure, disk I/O, GC pauses, lock contention, or syscalls?
This is where local observability remains essential. Process-level metrics, structured application logs, profiles, and host tools still matter because distributed systems are made of local systems. A beautiful trace that ends in “service X was slow” is not a full explanation until you understand what happened inside service X.
One practical workflow looks like this:
1. metrics detect the pattern
2. traces isolate the critical path
3. logs explain local decisions/events
4. host/process tools explain local resource behavior if needed
This is why observability “across local and distributed systems” is one subject, not two. The distributed view tells you where to zoom in. The local view tells you what was actually happening when you got there.
The trade-off is complexity versus explanatory power. A richer observability stack requires better instrumentation discipline, context propagation, and data hygiene, but it is what turns vague symptoms into actionable explanations.
Troubleshooting
Issue: One telemetry source is expected to answer every debugging question.
Why it happens / is confusing: Tools are often sold as categories, dashboards, logs, tracing, which encourages siloed thinking rather than diagnostic sequencing.
Clarification / Fix: Start with a narrower question. Do you need pattern, path, or local event detail? Use the signal best suited to that question, then bring in the others to refine the explanation.
Issue: Signals exist, but they cannot be correlated across services.
Why it happens / is confusing: Teams instrument quickly but do not consistently propagate request IDs or trace context.
Clarification / Fix: Treat correlation context as foundational instrumentation. Without shared identifiers, logs, spans, and local events remain isolated clues instead of connected evidence.
Advanced Connections
Connection 1: Observability ↔ Operating-System Debugging
The parallel: Local debugging skills, process metrics, resource accounting, profiling, syscall inspection, do not disappear in distributed systems. They become the zoomed-in layer of the same diagnostic workflow.
Real-world case: A cross-service trace may reveal the slow service, but only local CPU, memory, I/O, or lock investigation explains why that service was slow.
Connection 2: Observability ↔ Production Engineering
The parallel: Production quality depends not just on keeping systems up, but on making failures explainable under time pressure.
Real-world case: Teams often improve incident response more by propagating trace context and structuring logs well than by adding more disconnected dashboards.
Resources
Optional Deepening Resources
- These resources are optional and are not required for the core 30-minute path.
- [DOC] OpenTelemetry Documentation
- Link: https://opentelemetry.io/docs/
- Focus: Review how traces, metrics, and logs can share context and semantics.
- [DOC] Prometheus Overview
- Link: https://prometheus.io/docs/introduction/overview/
- Focus: Revisit metrics as aggregate signals for detection and trend analysis.
- [BOOK] Site Reliability Engineering
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: See observability in the context of production incidents, latency, and operational feedback loops.
- [DOC] perf_event_open(2)
- Link: https://man7.org/linux/man-pages/man2/perf_event_open.2.html
- Focus: Connect distributed symptoms back to host- and process-level investigation when a trace ends at one slow service.
Key Insights
- Observability is about explanation - The goal is to infer internal state from emitted evidence, not merely to collect telemetry.
- Metrics, logs, and traces answer different diagnostic questions - Pattern, event detail, and causal path are complementary, not interchangeable.
- Distributed diagnosis often ends in local debugging - A trace can identify the slow service, but local process behavior often explains why it was slow.
Knowledge Check (Test Questions)
-
What do metrics usually reveal best?
- A) Broad behavioral shape such as latency trends, error rate shifts, and saturation patterns.
- B) The exact body of one failed request.
- C) Full request causality by themselves.
-
Why are traces especially valuable in distributed systems?
- A) Because they reconstruct the path and timing of one request across service boundaries.
- B) Because they replace the need for logs and metrics entirely.
- C) Because they summarize fleet-wide trends better than metrics.
-
Why does local process investigation still matter in a distributed system?
- A) Because distributed symptoms often reduce to one host or process behaving badly, and that still requires local evidence to explain.
- B) Because traces cannot cross service boundaries.
- C) Because metrics are only useful on one machine.
Answers
1. A: Metrics are strongest for showing that a broad pattern exists and how large it is across time or fleets.
2. A: Traces are best for reconstructing one request’s end-to-end causal path and showing where time or failure accumulated.
3. A: Distributed observability tells you where to look; local observability often tells you what was actually happening once you get there.