Day 158: Observability - Metrics, Logs & Distributed Tracing

Observability matters because distributed systems only become operable when the team can reconstruct behavior from signals instead of guessing from symptoms.

Today's "Aha!" Moment

Teams often talk about observability as if it simply meant "we have metrics, logs, and traces." That is like saying a hospital is good because it owns thermometers, X-ray machines, and blood tests. The tools matter, but the real question is whether they let you explain what is happening inside the system.

Take the warehouse platform during a canary rollout. Latency rises for some requests, but not all. Error rate is still low overall. One service claims to be healthy, yet queue age is climbing and only a subset of pods show slow downstream calls. Without good signals, the team guesses. With observability, the team can ask a sharper sequence of questions: which path is slow, which services are involved, which pods or versions are affected, and what changed just before the degradation started?

That is the aha. Observability is not just telemetry collection. It is the system's ability to support explanation under changing conditions.

Once you see that, the three signal types stop looking like a checklist. Metrics summarize patterns, traces reconstruct request paths, and logs capture detailed local evidence. The value is in how they complement each other during diagnosis.

Why This Matters

Suppose a new image-processing release is promoted through the pipeline. Five minutes later, the support team reports slow response times for a subset of customers. CPU is not maxed out. The cluster is not obviously failing. The deploy itself completed cleanly.

This is the type of problem where observability makes the difference between reasoning and superstition. The team needs to know:

is the slowdown global or localized?
does it correlate with a version, region, or node pool?
is the hot path inside one service or across several?
is the issue throughput, queueing, downstream latency, or retry amplification?

In modern cloud systems, many incidents are not binary outages. They are partial, shifting, and cross-service. Observability matters because the system is too distributed for intuition alone. Without it, teams end up treating operations as detective work with missing evidence.

Learning Objectives

By the end of this session, you will be able to:

Explain what observability is actually for - Distinguish signal collection from the ability to explain system behavior.
Describe the role of metrics, logs, and traces - Understand what each signal is good at and where each one falls short alone.
Reason about instrumentation trade-offs - Evaluate cardinality, cost, sampling, and signal quality as design decisions rather than afterthoughts.

Core Concepts Explained

Concept 1: Metrics, Logs, and Traces Are Different Projections of the Same System

The fastest way to make observability vague is to treat all telemetry as interchangeable. They are not.

Metrics tell you how much, how often, or how bad in aggregate.
Logs tell you detailed local facts at specific events.
Traces tell you how one request or workflow moved through the system.

For the warehouse canary issue:

metrics may show p95 latency rising for one route
traces may show most of the added time coming from one downstream image step
logs may reveal a specific timeout, config mismatch, or exception on the affected pods

The point is not to choose one winner. The point is to use each signal at the level where it is strongest.

metrics -> fleet pattern
traces  -> request path
logs    -> local evidence

This is why observability is a system, not three unrelated tools.

Concept 2: Good Observability Preserves Context Across Boundaries

In distributed systems, the hardest part of debugging is often loss of context. A request enters one service, then hits three more, a queue, a worker, and a cache. If each component emits signals without shared identifiers or useful dimensions, the story fragments.

Context preservation is therefore central:

route or operation name
service and instance identity
deployment version
tenant, region, or shard when relevant
trace/span IDs for causal linkage

That is why distributed tracing matters so much. It preserves causal structure across service boundaries. But metrics and logs also need useful dimensions, or traces will explain a single request while the broader pattern stays invisible.

The practical rule is simple: instrument so that the same incident can be viewed at three levels:

aggregate behavior
request path
local detail

If one of those levels is missing, diagnosis slows down sharply.

Concept 3: Observability Is a Design Trade-off Between Insight, Cost, and Noise

Better observability is not free. Signals cost money, storage, CPU, network bandwidth, and attention.

Some common trade-offs are:

high-cardinality metrics are powerful but expensive and easy to misuse
verbose logs help during incidents but can drown teams and budgets
full tracing is often too expensive, so sampling decisions matter
instrumenting everything indiscriminately creates dashboards without understanding

This means observability has to be designed intentionally. The team needs to ask:

what questions must we answer during failure?
which paths and dimensions matter operationally?
what signal needs to be always on?
what can be sampled or turned up only during incidents?

For the warehouse platform, request rate and error rate may always be measured, traces may be sampled by default but kept for slow/error paths, and logs may be structured around domain-relevant fields instead of raw text.

The right goal is not "maximum telemetry." The right goal is "enough high-quality telemetry to explain behavior without overwhelming the system or the humans operating it."

Troubleshooting

Issue: The team has lots of dashboards but still cannot explain incidents.

Why it happens / is confusing: Signal quantity was mistaken for observability quality.

Clarification / Fix: Check whether metrics, logs, and traces preserve enough shared context to tell one coherent story across boundaries.

Issue: Logs are abundant, but nobody can correlate them across services.

Why it happens / is confusing: Events were emitted locally without consistent structure or correlation identifiers.

Clarification / Fix: Use structured logs and propagate request/trace context so local evidence can join a system-wide narrative.

Issue: Telemetry costs are growing fast while incident response is not improving much.

Why it happens / is confusing: Instrumentation was added broadly without clear questions or sampling strategy.

Clarification / Fix: Revisit which signals are always-on, which need sampling, and which dimensions actually help answer operational questions.

Advanced Connections

Connection 1: Observability ↔ CI/CD and Progressive Delivery

The parallel: Delivery systems need observability to decide whether a rollout is healthy, whether a canary is degrading behavior, and whether rollback is justified.

Real-world case: Version tags, rollout stages, and release metadata only become operationally useful when observability surfaces them.

Connection 2: Observability ↔ Reliability Engineering

The parallel: SLOs, alerting, incident response, and capacity work all depend on signals that explain system behavior at the right granularity.

Real-world case: Tail latency, queue age, dependency saturation, retry amplification, and error budgets are all observability-driven operational concepts.

Resources

Optional Deepening Resources

[DOCS] OpenTelemetry Documentation
- Link: https://opentelemetry.io/docs/
- Focus: Use it as the primary reference for metrics, logs, traces, context propagation, and instrumentation concepts.
[DOCS] OpenTelemetry Concepts: Signals
- Link: https://opentelemetry.io/docs/concepts/signals/
- Focus: See the official framing of how metrics, logs, and traces differ and fit together.
[SITE] Google SRE Book
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Connect observability to incident handling, monitoring strategy, and real production operations.
[SITE] OpenTelemetry Demo
- Link: https://opentelemetry.io/ecosystem/demo/
- Focus: See a realistic distributed system instrumented end to end.

Key Insights

Observability is about explanation, not mere collection - The goal is to reconstruct behavior, not just store telemetry.
Metrics, logs, and traces answer different questions - Each signal is strongest at a different level of diagnosis.
Signal design is a resource trade-off - Good observability balances context richness against cost, noise, and operator attention.

Knowledge Check (Test Questions)

Which statement best captures observability?
- A) It means storing as many logs as possible.
- B) It means the system emits enough structured signals to explain what happened and why.
- C) It means dashboards are always green.
What are traces best at showing?
- A) Long-term fleet-wide storage growth trends.
- B) The causal path and timing of one request or workflow across services.
- C) The complete replacement for all logs and metrics.
Why can observability tooling become expensive without improving diagnosis?
- A) Because more telemetry automatically guarantees better understanding.
- B) Because signals can be high-volume, high-cardinality, and poorly structured if they are collected without clear operational questions.
- C) Because traces make metrics impossible.

Answers

1. B: Observability is fundamentally about explanation under uncertainty, not just about collecting raw telemetry.

2. B: Traces are strongest when you need to reconstruct one request path and see where time or failure accumulated.

3. B: Telemetry has real cost, and without good structure and purpose it creates more noise than understanding.

← Back to Learning