Day 189: SLOs, SLIs, and Error Budgets

SLOs matter because they turn “we want reliability” into an explicit contract about how much failure is acceptable and how fast the team may safely change the system.

Today's "Aha!" Moment

Many teams say they care about reliability, but what they actually have is a collection of feelings:

“this service should almost always work”
“alerts are too noisy”
“we need to ship faster”
“the platform feels unstable lately”

Those statements are directionally useful, but they are not enough to run a service. Without a shared target, product teams and platform teams argue from different instincts. One group wants more features. Another wants more hardening. Everyone says reliability matters, but nobody has a precise way to decide when the system is healthy enough to keep changing aggressively and when it is not.

That is why SLOs, SLIs, and error budgets exist. They give the organization a contract:

SLI: what signal represents user-visible reliability?
SLO: what level of that signal are we promising over a time window?
error budget: how much unreliability can we still “spend” before we should slow down risky change?

That is the aha. These concepts are not reporting vocabulary. They are decision vocabulary. They tell the team when the system is healthy enough to keep pushing and when reliability debt has become too high to ignore.

Why This Matters

Suppose the warehouse company runs a checkout API that is changing quickly. Product wants fast experimentation. Platform wants safer rollouts. On-call engineers want fewer incidents. Everyone is reasonable, but they are optimizing different goals.

Without clear service objectives, the arguments become vague:

“the service feels flaky”
“the incident wasn’t that bad”
“we can’t block this release for a small blip”
“we keep shipping too aggressively”

Now compare that with a service that has:

an SLI for successful checkout completion
an SLO of 99.9% success over 30 days
a visible error budget that shows how much reliability headroom remains

Now the conversation changes. A risky launch after repeated incidents is no longer “Ops being cautious.” It is a decision to spend scarce reliability budget. A hardening sprint is no longer “slowing the roadmap.” It is the justified response when the budget is nearly exhausted.

That is why SLOs are so important. They make reliability negotiable in a disciplined way instead of leaving it as a permanent argument with no shared unit of truth.

Learning Objectives

By the end of this session, you will be able to:

Explain the role of SLI, SLO, and error budget - Understand how they fit together as one operating contract.
Choose better reliability signals - Recognize that the most useful SLIs are user-visible and decision-relevant, not just easy to measure.
Use error budgets operationally - Know how they guide rollout speed, incident response, and trade-offs between feature velocity and hardening.

Core Concepts Explained

Concept 1: An SLI Is the Reliability Signal That Represents User Experience

An SLI is not “any metric we already have.” It is a measured indicator that stands in for the aspect of service quality users actually depend on.

Good examples:

successful checkout requests
requests completed under a latency threshold
job completion within an expected deadline
read or write success for a user-visible API

Weak examples:

CPU utilization by itself
pod count by itself
internal queue depth without user impact context

Those internal metrics can still matter, but they are usually supporting signals, not the primary reliability contract.

The design question is:

If this signal degrades,
will users actually feel it?

If the answer is unclear, the SLI is probably too far from the user experience.

This is why SLI design is hard. Teams often default to what is easy to instrument instead of what best represents the service promise.

Concept 2: The SLO Turns a Signal into a Promise Over Time

An SLO adds two things to an SLI:

a target
a time window

For example:

99.9% successful checkout requests over 30 days
99% of search requests under 300 ms over 7 days

That matters because reliability is never perfect. The SLO does not promise zero failure. It defines how much imperfection is acceptable for this service and this user expectation.

A useful mental model:

SLI = what we measure
SLO = how good it must be
window = over what period we judge it

The target should be high enough to protect users and low enough to remain realistic. If it is too loose, it does not drive behavior. If it is unrealistically strict, the team will learn to ignore it or negotiate around it constantly.

This is why SLOs are product and engineering decisions together. They are not purely technical thresholds. They define how much reliability the organization is willing to promise and pay for.

Concept 3: The Error Budget Is the Bridge Between Reliability and Change Velocity

The error budget is what makes the whole model operational.

If an SLO allows 0.1% failure over a window, that tolerated unreliability is the budget. As incidents and bad deploys consume it, the team has less headroom for risky change.

This is the part that turns reliability from dashboarding into control logic:

user-visible reliability signal
        |
        v
compare against SLO
        |
        v
remaining error budget
        |
        +--> budget healthy -> normal release pace
        +--> budget burning -> slow down / tighten rollout
        +--> budget exhausted -> prioritize reliability work

That is why error budgets are so powerful. They do not say “never ship when incidents happen.” They say “ship in proportion to the reliability headroom you still have.”

This connects directly to the last lesson on golden paths and platform defaults. A mature platform does not just help teams deploy faster; it helps them deploy in ways compatible with the reliability budget they have left.

Troubleshooting

Issue: The team picked an SLI because it was easy to graph, but it does not reflect user pain well.

Why it happens / is confusing: Internal technical signals are often more accessible than user-visible outcomes, so teams substitute convenience for relevance.

Clarification / Fix: Re-evaluate the SLI from the user’s perspective. Keep internal metrics for diagnosis, but anchor the contract in a signal that represents the real service promise.

Issue: The SLO is so strict that it is violated constantly and no one takes it seriously.

Why it happens / is confusing: Teams often choose aspirational numbers rather than realistic reliability objectives tied to user expectations and system maturity.

Clarification / Fix: Reset the target to something demanding but credible, then improve it deliberately over time if the system and business case justify it.

Issue: Error budgets are calculated but never influence releases or priorities.

Why it happens / is confusing: The organization adopted the vocabulary without attaching real operating rules to it.

Clarification / Fix: Define explicit actions for different budget states: when to slow rollouts, when to freeze risky changes, and when to prioritize hardening work.

Advanced Connections

Connection 1: SLOs <-> Golden Paths & Platform Engineering

The parallel: Platform defaults become more valuable when they help teams stay inside reliability budgets without custom work each time.

Real-world case: Standard rollout, observability, and rollback patterns reduce the chance that every team spends its budget on avoidable deployment mistakes.

Connection 2: SLOs <-> Observability & Monitoring

The parallel: Observability provides the raw signals and debugging context, but SLOs decide which signals count as the service contract.

Real-world case: Many metrics can exist, but only a few user-visible ones should drive error-budget decisions directly.

Resources

Optional Deepening Resources

[SITE] Google SRE Book
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Use it as the primary conceptual reference for how SLOs and error budgets guide reliability decisions.
[SITE] Google SRE Workbook
- Link: https://sre.google/workbook/table-of-contents/
- Focus: Study practical examples of SLI choice, SLO design, and how error budgets affect operations and release policy.
[DOCS] OpenSLO
- Link: https://openslo.com/
- Focus: Connect the conceptual model to a portable specification format for defining service objectives.
[SITE] Nobl9 SLO Guide
- Link: https://www.nobl9.com/service-level-objectives
- Focus: Use it as a secondary applied reference for how teams operationalize objectives and budget tracking.

Key Insights

SLIs, SLOs, and error budgets form one operating model - The metric, the target, and the remaining failure headroom only make sense together.
Good SLOs are user-visible and decision-relevant - They should reflect service promises, not just whatever internal metric was easiest to graph.
Error budgets connect reliability to release speed - They are useful only when they change how the team rolls out, prioritizes, and responds.

Knowledge Check (Test Questions)

What makes an SLI useful?
- A) It is the easiest metric for the team to collect.
- B) It represents a user-visible aspect of service quality that can support decisions.
- C) It always measures CPU usage.
What does an SLO add to an SLI?
- A) A target and a time window that define the service promise.
- B) A guarantee of zero incidents.
- C) A replacement for monitoring.
Why is the error budget operationally important?
- A) It translates reliability performance into guidance about how aggressively the team should keep changing the system.
- B) It only matters to finance.
- C) It removes the need for trade-offs between velocity and stability.

Answers

1. B: A strong SLI represents something users actually depend on and gives the team a meaningful reliability signal.

2. A: The SLO defines how good the service must be and over what window that promise is judged.

3. A: The error budget is what lets the team connect current reliability health to rollout pace, hardening work, and release decisions.

← Back to Learning