Day 190: Observability & Monitoring

Monitoring tells you when known conditions are going wrong. Observability helps you explain why the system is behaving strangely when the failure does not match a prewritten script.

Today's "Aha!" Moment

Teams often use “monitoring” and “observability” as if they were interchangeable. They are related, but they are not the same thing.

Monitoring is about predefined expectations. You decide in advance which conditions matter:

latency too high
error rate too high
disk too full
queue too backed up

Then you alert when those conditions occur.

Observability is what you need when those predefined checks are not enough. The system is slow, but only for one tenant. Error rate is fine, but one rollout path is behaving strangely. A model endpoint is returning valid responses, yet downstream outcomes suddenly worsen. At that point, the question is no longer “did threshold X fire?” The question is “what is actually happening inside this system?”

That is the aha. Monitoring is mainly about known failure conditions. Observability is about having enough rich signals to investigate the unknown or partially known ones.

Why This Matters

Suppose the warehouse company has a checkout service with an SLO on successful purchases. Monitoring is already in place:

burn-rate alerts on failed checkouts
latency alerts
infra saturation dashboards

Those are necessary. But one day a new rollout causes a subtle issue:

only one payment provider path degrades
only one region sees it
the overall error rate is still within normal range
support tickets start mentioning “stuck payments” before the SLO is clearly burning

If the team only has threshold-based monitoring, they may know something feels off but still lack the path-level evidence to explain it quickly. If they also have strong observability, they can pivot through traces, structured logs, version-aware metrics, and request attributes to isolate the cause.

That is why the distinction matters. Monitoring detects and pages. Observability shortens the time between “something is wrong” and “we know enough to respond intelligently.”

Learning Objectives

By the end of this session, you will be able to:

Explain the difference between monitoring and observability - Understand how predefined checks differ from exploratory diagnosis.
Reason about the role of metrics, logs, and traces - Recognize how each signal supports either detection, explanation, or both.
Design better operational visibility - Know how to combine alerts, dashboards, and investigative signals without drowning the team in noise.

Core Concepts Explained

Concept 1: Monitoring Is About Predefined Questions

Monitoring works best when the team already knows what “bad” looks like.

Examples:

p99 latency above threshold
SLO burn too fast
job queue age above threshold
node memory exhaustion risk

These are all questions you can write down ahead of time:

If this condition happens,
somebody should know now.

That is why monitoring is tied closely to alerting and service objectives. It is the discipline of turning known risk patterns into timely detection.

But monitoring has a hard limit: it can only answer the questions you thought to ask. When the system fails in a new or subtle way, monitoring alone may tell you “something is odd” without telling you what to inspect next.

Concept 2: Observability Is About Rich Enough Signals to Investigate the Unknown

Observability is not one tool. It is a property of the system and its telemetry:

metrics show aggregate behavior
logs preserve detailed local events
traces preserve request path and timing across boundaries

Those three together let operators ask new questions after the system is already misbehaving.

A useful operational loop looks like this:

monitoring alert or suspicion
        |
        v
check high-level metrics
        |
        v
segment by service / region / version / tenant
        |
        v
follow traces and correlated logs
        |
        v
form and test explanation

That is why observability is so important in complex distributed systems. The failure rarely arrives with a label saying “this is the exact bug.” The team has to narrow the search space by moving between aggregated signals and detailed evidence.

Concept 3: Good Operational Visibility Balances Detection, Explanation, and Cost

A common anti-pattern is to keep adding dashboards, logs, and metrics with no clear role. That produces telemetry sprawl, high cost, and confusion during incidents.

The more useful design question is:

which signals are for fast detection?
which signals are for diagnosis?
which signals are for forensics and long-tail investigation?

This maps well to:

monitoring / alerting: detect when action is needed
observability: provide enough context to explain the issue
post-incident evidence: support careful review, prevention, and learning

This is where the earlier lessons connect:

SLOs define which signals deserve to wake people up
observability helps explain why those signals moved
error budgets turn those observations into release and hardening decisions

The best systems do not try to alert on everything. They alert on what demands action and keep richer signals available for the investigation that follows.

Troubleshooting

Issue: The team has lots of dashboards, but incidents still take a long time to diagnose.

Why it happens / is confusing: Dashboards may show many aggregates, but they do not necessarily preserve the path-level or contextual evidence needed for explanation.

Clarification / Fix: Add better trace correlation, structured logs, and dimensions that help slice behavior by version, region, tenant, or dependency path.

Issue: Alerting is noisy, so engineers start ignoring it.

Why it happens / is confusing: Too many alerts are tied to supporting signals rather than to conditions that truly require action.

Clarification / Fix: Anchor pages in user-visible reliability signals and SLO impact. Keep supporting telemetry available for diagnosis without turning all of it into pages.

Issue: The team can explain incidents after hours of manual digging, but not fast enough to reduce impact.

Why it happens / is confusing: The system may technically be observable, but the signals are poorly correlated or too hard to navigate under time pressure.

Clarification / Fix: Improve correlation IDs, version tags, rollout markers, and consistent signal naming so incident investigation starts with better pivots.

Advanced Connections

Connection 1: Observability & Monitoring <-> SLOs / Error Budgets

The parallel: Monitoring tied to SLOs tells the team when user-facing promises are at risk; observability explains the mechanisms that are consuming the budget.

Real-world case: Burn-rate alerts page the team, while traces and logs show whether the problem is one dependency, one rollout, or one traffic segment.

Connection 2: Observability & Monitoring <-> Incident Management

The parallel: Incident response speed depends heavily on how quickly the team can move from detection to explanation.

Real-world case: A strong monitoring system notices the outage; a strong observability system helps the incident commander and responders converge on likely causes quickly.

Resources

Optional Deepening Resources

[SITE] Google SRE Book
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Use it for the reliability framing of monitoring, alerting, and signal quality.
[DOCS] OpenTelemetry Documentation
- Link: https://opentelemetry.io/docs/
- Focus: Connect metrics, logs, and traces to a practical cross-service observability model.
[DOCS] Prometheus Documentation
- Link: https://prometheus.io/docs/introduction/overview/
- Focus: Use it as the primary reference for metrics-based monitoring and alerting patterns.
[SITE] Honeycomb: Observability Explained
- Link: https://www.honeycomb.io/observability
- Focus: Use it as a secondary practical explanation of observability as exploratory understanding rather than static dashboarding.

Key Insights

Monitoring and observability are related but not identical - Monitoring answers known questions; observability helps investigate unknown or unusual behavior.
Signals have different jobs - Metrics are great for detection and trend view, while logs and traces often provide the detail needed for explanation.
The best operational systems separate paging from investigation - Alert on action-worthy conditions, then rely on rich telemetry to explain them quickly.

Knowledge Check (Test Questions)

What is the main difference between monitoring and observability?
- A) Monitoring is only for CPU metrics, while observability is only for traces.
- B) Monitoring focuses on predefined failure conditions, while observability helps investigate unexpected behavior.
- C) Observability replaces the need for alerting.
Why can a system have good monitoring but weak observability?
- A) Because it may detect threshold violations without preserving enough rich context to explain the cause quickly.
- B) Because monitoring always makes traces impossible.
- C) Because observability only matters in development.
What is a good operational use of logs and traces after an SLO-related alert fires?
- A) Ignore them if the alert already paged someone.
- B) Use them to narrow the failure by region, version, dependency path, or request flow and build a causal explanation.
- C) Replace the alerting system with manual log reading.

Answers

1. B: Monitoring is primarily about known bad conditions; observability is about having enough telemetry to explore and explain behavior under uncertainty.

2. A: A system may know that something went wrong but still be unable to show responders where and why it is going wrong.

3. B: Logs and traces help responders turn detection into explanation, which is what shortens incident impact.

← Back to Learning