Day 096: Service-Level Objectives and Error Budgets

Reliability only starts to guide engineering decisions when the team can say, in measurable terms, what level of service users actually need and how much failure the system can tolerate before stability must take priority over change.

Today's "Aha!" Moment

Monitoring tells you what is happening. SLOs tell you whether what is happening is acceptable.

Keep one example throughout the lesson. The learning platform is preparing a major billing rollout before a popular certification launch. Everyone agrees reliability matters, but that agreement is not enough. Product wants the release to ship. Engineering wants to reduce risk. Support wants fewer incidents. Without a shared target, the discussion becomes emotional: "this feels too risky" versus "it is probably fine."

That is the aha. An SLO gives the team a measurable definition of good enough, rooted in what users actually experience. An error budget then makes the trade-off explicit: if the service has already spent too much of its allowed failure for the period, the team should slow down risky change and invest in reliability. If the budget is healthy, the team has room to keep moving.

Once you see that, SLOs stop looking like fancy dashboards and error budgets stop looking like abstract SRE jargon. Together they are a coordination tool between product, engineering, and operations. They turn reliability from opinion into policy.

Why This Matters

The problem: Teams often collect metrics and fire alerts without ever agreeing on what service quality they are actually trying to defend. That leaves reliability work hard to prioritize and release decisions hard to justify.

Before:

"Reliable enough" means different things to different people.
Alerts fire, but the business meaning of those alerts is fuzzy.
Teams either overreact to minor failures or normalize real degradation because no target exists.

After:

Important user-facing behaviors have explicit objectives.
Reliability discussions use shared numbers instead of intuition alone.
Error budgets connect those objectives to operational behavior, rollout pace, and prioritization.

Real-world impact: Better release discipline, clearer prioritization of reliability work, less debate driven by fear or habit, and a much stronger link between monitoring data and actual engineering decisions.

Learning Objectives

By the end of this session, you will be able to:

Separate SLI, SLO, and SLA cleanly - Understand the difference between a measurement, a target, and an external promise.
Choose objectives that reflect user experience - Reason about what should count as good service from the outside.
Use error budgets as a decision tool - Connect observed reliability to release pace, incident response, and engineering priorities.

Core Concepts Explained

Concept 1: An SLI Measures, an SLO Targets, and an SLA Promises

These three terms are frequently blurred, which makes reliability discussions sloppier than they need to be.

SLI: the indicator you measure
SLO: the target for that indicator over a time window
SLA: an external promise, often contractual, that may include consequences

For the learning platform:

SLI: successful purchase requests / total purchase requests
SLO: 99.9% successful purchases over 30 days
SLA: perhaps a contractual uptime or support commitment for enterprise customers

measurement -> target -> external promise
   SLI          SLO            SLA

This distinction matters because not every metric deserves to become an objective, and not every objective should become a contractual promise. The ladder gets narrower as the cost of missing it gets higher.

The trade-off is more explicit vocabulary versus the comfort of vague language. Explicitness is better because it prevents teams from confusing internal measurement with business commitment.

Concept 2: A Good SLO Describes User Experience, Not Internal Tidiness

The strongest SLOs describe what the service feels like to the user or consumer of the system.

For example:

99.9% of purchase attempts succeed over 30 days
95% of course-page loads complete under 250 ms
99% of new purchases appear in the learner dashboard within 2 minutes

Those are useful because they talk about service quality at the promise boundary. By contrast, "CPU below 70%" may be an interesting diagnostic metric, but it is rarely the promise users care about.

purchase_slo = {
    "sli": "successful_purchases / total_purchase_requests",
    "target": "99.9%",
    "window": "30d",
}

The code is only illustrating the shape of the idea: an SLO is a user-meaningful target applied to a clearly defined indicator over a clearly defined time window.

Choosing good objectives is not trivial. If the target is too weak, it does not protect the user experience. If it is unrealistically strong, the team may spend disproportionate effort defending a reliability level the business does not actually need.

The trade-off is sharper prioritization versus the discomfort of real commitment. A good SLO forces the team to decide what level of service is genuinely worth the cost.

Concept 3: Error Budgets Turn Reliability into an Explicit Trade-Off

An error budget is the amount of failure allowed by the SLO within its measurement window.

If the purchase SLO is 99.9% over 30 days, then 0.1% of requests may fail before the objective is breached. That 0.1% is not an invitation to break things casually. It is a controlled allowance that helps the team decide how aggressively it can continue changing the system.

target: 99.9% success
allowed failure: 0.1%

if budget healthy:
  keep shipping carefully

if budget nearly exhausted:
  slow risky change
  prioritize stability work

This is where error budgets become operational instead of theoretical. Suppose the billing rollout consumes a large share of the purchase budget early in the month. The team now has evidence that risk tolerance should decrease. Rollouts may slow, incident response may intensify, and reliability work may outrank feature work temporarily.

Without this mechanism, teams often do one of two bad things:

keep shipping at full speed while reliability is already deteriorating
overreact to every small blip because no bounded notion of acceptable failure exists

The trade-off is less freedom to ignore service quality versus much clearer release governance. That is a good trade because the whole point is to make risk explicit before users absorb too much of it.

Troubleshooting

Issue: Choosing SLOs that describe internal machinery instead of user experience.

Why it happens / is confusing: Internal metrics are easier to collect and often feel more controllable.

Clarification / Fix: Start from important user journeys or externally visible promises, then use internal metrics to explain why those objectives are at risk.

Issue: Setting targets unrealistically high by default.

Why it happens / is confusing: Bigger reliability numbers sound safer and more impressive.

Clarification / Fix: Pick objectives that reflect actual product needs and engineering cost. An over-tight SLO can distort priorities just as badly as a weak one.

Issue: Defining error budgets but never changing behavior when they are consumed.

Why it happens / is confusing: Teams adopt the vocabulary but avoid the operational consequences.

Clarification / Fix: Tie budget consumption to explicit decisions about rollout pace, incident urgency, and where engineering effort goes next.

Advanced Connections

Connection 1: SLOs and Error Budgets ↔ Monitoring

The parallel: Monitoring supplies the raw indicators, but SLOs turn those indicators into a definition of acceptable service.

Real-world case: Request rate, error rate, and latency become much more useful once the team agrees which combinations of them define success for the user.

Connection 2: SLOs and Error Budgets ↔ Release Governance

The parallel: Error budgets are one of the clearest ways to connect system health to release pace without relying on instinct or politics alone.

Real-world case: A team may continue a rollout when the budget is healthy, or pause and invest in stability when budget burn says the service is already too close to breaching its target.

Resources

Optional Deepening Resources

These resources are optional and are not required for the core 30-minute path.
[BOOK] Site Reliability Engineering
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Review the original framing of SLIs, SLOs, and error budgets in production operations.
[BOOK] The SRE Workbook
- Link: https://sre.google/workbook/table-of-contents/
- Focus: Study practical guidance for choosing realistic objectives and operationalizing them.
[DOC] Google Cloud SLO Overview
- Link: https://cloud.google.com/stackdriver/docs/solutions/slo-monitoring
- Focus: See how objectives, rolling windows, and burn-rate ideas map into concrete monitoring practice.
[ARTICLE] Implementing Service Level Objectives
- Link: https://sre.google/workbook/implementing-slos/
- Focus: Go deeper on choosing indicators and objectives that reflect real user expectations.

Key Insights

SLIs, SLOs, and SLAs are different layers of commitment - Measure first, target second, promise externally only where it makes sense.
A good SLO describes service quality that users actually experience - Internal metrics support diagnosis, but they are rarely the top-level objective.
Error budgets turn reliability into a decision framework - They tell the team when it still has room to change aggressively and when it needs to slow down and protect stability.

Knowledge Check (Test Questions)

What is an SLO?
- A) A target for a service indicator over a defined time window.
- B) Any metric that appears on a dashboard.
- C) A contractual promise with penalties by default.
Why should SLOs usually be tied to user-facing behavior?
- A) Because the objective should describe the quality of service the user or client actually experiences.
- B) Because internal metrics are never useful for operations.
- C) Because SLAs and SLOs are always identical.
What is the main operational use of an error budget?
- A) To make trade-offs between change velocity and reliability explicit.
- B) To eliminate the possibility of failure entirely.
- C) To replace monitoring and alerting with one number.

Answers

1. A: An SLO is the chosen target for a measured service indicator over a defined period.

2. A: The point of the objective is to defend the service quality users care about, not to optimize internal machinery for its own sake.

3. A: An error budget tells the team how much reliability risk remains before stability should take precedence over further change.

← Back to Learning