Day 159: Monitoring & Alerting - SLOs, SLIs & On-Call Best Practices

Monitoring and alerting matter because human attention is expensive, and SLOs tell the system when spending that attention is justified.

Today's "Aha!" Moment

Observability, from the previous lesson, helps a team explain almost any behavior if enough evidence exists. Monitoring is narrower and more opinionated: it chooses a small set of questions the system keeps answering all the time. Alerting is narrower still: it decides which answers are important enough to interrupt a human right now.

That distinction matters because many teams collect lots of telemetry and still run bad operations. A dashboard can show CPU, memory, queue depth, retries, cache hit ratio, pod restarts, and trace latency all at once, but that does not tell the on-call engineer what deserves action. Without a policy, everything looks interesting and nothing feels urgent.

Think about the warehouse image-processing platform from the last lesson. A new rollout causes occasional slow image transforms in one region. If the team pages on every local symptom, on-call gets noise: one page for CPU spikes, another for worker restarts, another for queue growth, and another for a temporary retry burst. If instead the team pages when the user-facing latency SLO is burning too quickly, the alert means something concrete: user promises are now at risk.

That is the aha. Monitoring is not "collect lots of metrics." Alerting is not "notify when any metric crosses a threshold." The real job is to connect telemetry to reliability promises, then use those promises to decide when a human should care immediately.

Why This Matters

Suppose customers upload product photos and expect processed thumbnails within a few seconds. After a canary deploy, the system still returns most requests successfully, but a subset now waits much longer because one worker pool is retrying a downstream storage call. The cluster is not obviously down. Average latency still looks acceptable. Most dashboards stay green.

This is exactly where good monitoring and alerting separate mature operations from dashboard theater.

If the team pages on raw CPU or on any individual 5xx, they burn people without protecting users. If they only look at broad averages, they miss the fact that tail latency is already harming a meaningful slice of requests. If they do not define an SLO, they have no shared answer to a simple question: are we still delivering the service we promised?

Monitoring matters because systems are noisy and operators have limited time. Alerting matters because attention is a production resource. SLOs matter because they convert vague reliability talk into a concrete operational contract.

Learning Objectives

By the end of this session, you will be able to:

Explain the roles of SLI, SLO, SLA, and error budget - Understand how measurements become operational promises.
Distinguish monitoring from alerting - Separate dashboards, tickets, and paging alerts by the action they are meant to trigger.
Design healthier on-call signals - Recognize which alerts are actionable, which are noisy, and how to align them with user impact.

Core Concepts Explained

Concept 1: Monitoring Is the Standing Set of Questions You Refuse to Forget

Observability lets you ask many questions during an investigation. Monitoring chooses the few questions that should already be answered before an incident becomes a surprise.

For the image-processing platform, those questions might be:

are user uploads succeeding?
are processed images arriving within the promised latency?
is backlog growing faster than workers can drain it?
is one region or deployment version degrading differently from the rest?

Those are not arbitrary metrics. They are operational questions tied to user experience and system pressure.

This is why "collect everything" is not a monitoring strategy. Monitoring only becomes useful when it is attached to a decision. A metric that helps an engineer debug later can still be valuable, but it is not automatically a monitoring signal. Monitoring is the curated layer of telemetry that continuously answers, "Are we healthy enough right now?" and "What is starting to drift?"

In practice, that means monitoring should include a mix of:

user-visible indicators such as success rate and tail latency
flow indicators such as queue age, backlog, or request concurrency
saturation indicators such as worker utilization or downstream capacity

The important part is not the category name. The important part is whether the signal corresponds to a system promise or an approaching bottleneck.

Concept 2: SLI, SLO, SLA, and Error Budget Give Alerts Meaning

An SLI is the measurement. An SLO is the target built from that measurement. An SLA is an external commitment, often contractual. The error budget is the allowed amount of unreliability implied by the SLO.

For example:

SLI: percentage of image-processing requests that complete successfully within 3 seconds
SLO: 99.9% of those requests succeed within 3 seconds over a rolling 30-day window
SLA: the promise made to paying customers, if one exists
Error budget: the remaining room to miss the SLO before the service has spent its reliability allowance

user request
    |
    v
instrumented outcome
    |
    v
SLI calculation
    |
    v
compare against SLO
    |
    v
error budget burn
    |
    +--> page now
    +--> create ticket
    +--> keep watching

This model is powerful because it turns alerting from threshold guessing into policy. A brief burst of errors may be interesting but not page-worthy if the error budget burn is small and self-correcting. A moderate but sustained rise in tail latency may deserve a page if it is consuming reliability too quickly.

That is also why burn-rate style alerts are so useful. They do not merely ask, "Did a number cross a line?" They ask, "Are we spending reliability faster than we can afford?" That question is much closer to what the business and the on-call engineer actually care about.

Concept 3: Good Alerting Protects Humans as Much as Systems

A page is not just a notification. It is an instruction to interrupt sleep, context, and planned work. That makes alerting a human-factors design problem as much as a technical one.

Good alerting separates different actions:

Page when immediate human action is needed to prevent or reduce meaningful user harm.
Ticket when something needs follow-up, but not at 3 a.m.
Dashboard or report when the signal is mainly for trend review, planning, or diagnosis.

This sounds obvious, but many alerting systems fail exactly here. Teams often page on low-level symptoms because those metrics are easy to instrument, even when those symptoms neither predict user harm nor tell the on-call engineer what to do next.

A healthier rule is that every paging alert should have four properties:

it points to a real user-impacting risk or a very near precursor
it is owned by a team that can act on it
it suggests an immediate first move or links to a runbook
it is rare enough that people still trust it

For the image-processing platform, a good paging alert might be "latency SLO burn is high in region eu-west for the new canary version." A poor paging alert might be "worker CPU exceeded 80% for 2 minutes" when autoscaling already handles that condition and no user impact exists yet.

This is what on-call best practices are really about. The goal is not to page earlier or later in the abstract. The goal is to spend human attention where it changes outcomes.

Troubleshooting

Issue: The team has many alerts, but major incidents still slip through.

Why it happens / is confusing: Alerts were attached to local symptoms instead of user journeys, SLOs, or error-budget burn.

Clarification / Fix: Start from the service promise first, define a small number of SLIs, then page on meaningful degradation of those signals rather than on every component-level anomaly.

Issue: Dashboards look healthy, but customers still report slow behavior.

Why it happens / is confusing: The chosen SLI or aggregation hides the affected slice, such as one route, region, tenant, or tail-latency band.

Clarification / Fix: Revisit the SLI definition and segmentation. Averages often hide the very failures users feel first.

Issue: On-call engineers ignore alerts or silence them aggressively.

Why it happens / is confusing: The alert stream mixes urgent pages, background noise, and unactionable information into one channel.

Clarification / Fix: Reclassify signals into page, ticket, and dashboard layers. Every page should have an owner, a response path, and a reason it could not wait until business hours.

Advanced Connections

Connection 1: Monitoring & Alerting <-> Observability

The parallel: Observability gives the evidence needed to explain behavior; monitoring chooses the standing questions; alerting decides when the answers require immediate intervention.

Real-world case: During a canary rollout, traces help explain the slow path, but the paging decision should still come from SLO burn and user-visible risk.

Connection 2: Monitoring & Alerting <-> Release Engineering

The parallel: Error budgets turn reliability into a control signal for release pace, rollback decisions, and change management.

Real-world case: A team that is burning budget too quickly may pause risky rollouts even if no full outage has happened yet.

Resources

Optional Deepening Resources

[SITE] Google SRE Book
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Use it for the foundational framing of monitoring, alerting, reliability targets, and on-call operations.
[SITE] Google SRE Workbook
- Link: https://sre.google/workbook/table-of-contents/
- Focus: Read it for practical guidance on alerting quality, toil, incident response, and error-budget driven decisions.
[DOCS] Prometheus Alerting Overview
- Link: https://prometheus.io/docs/alerting/latest/overview/
- Focus: See how alert routing, grouping, inhibition, and silencing work in a widely used production stack.
[DOCS] Prometheus Alerting Rules
- Link: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
- Focus: Use it to connect the lesson's ideas to concrete rule evaluation and threshold design.

Key Insights

Monitoring is a selected layer of operational questions - Not every metric is a monitoring signal, and not every signal deserves an alert.
SLOs give alerting policy its meaning - They connect telemetry to an explicit service promise and to the cost of breaking it.
Good alerting protects human attention - The point of a page is not notification itself, but timely action on meaningful user risk.

Knowledge Check (Test Questions)

Which statement best distinguishes monitoring from alerting?
- A) Monitoring stores logs, while alerting stores metrics.
- B) Monitoring continuously evaluates important service questions, while alerting decides when the answers justify immediate human action.
- C) Monitoring is for developers, while alerting is for managers.
Why is an error budget useful?
- A) It tells the team how many servers to buy.
- B) It converts an SLO into a usable allowance for unreliability and helps decide when reliability risk is becoming unacceptable.
- C) It removes the need for incident response.
Which alert is most likely to be page-worthy?
- A) A brief CPU spike on one worker that autoscaling already absorbed.
- B) A dashboard showing weekly cache hit ratio trends.
- C) Sustained burn of the latency SLO for a user-facing request path during an active rollout.

Answers

1. B: Monitoring asks standing operational questions continuously; alerting is the escalation policy layered on top of those answers.

2. B: The error budget turns a reliability target into something operational teams can reason about during releases and incidents.

3. C: This alert is tied to user-visible risk and to a condition where timely action could change the outcome.

← Back to Learning