Day 176: Cost Optimization - Observability at Scale

Observability cost optimization is not about keeping less data in the abstract. It is about spending signal budget where it improves detection, explanation, and decision quality the most.

Today's "Aha!" Moment

Observability feels cheap at first. Add more metrics. Keep more logs. Trace more requests. Add more labels so you can slice incidents any way you want later. Early on, this feels responsible: more evidence should mean better operations.

Then scale arrives. Metrics explode because label cardinality grew faster than expected. Logs become expensive enough that teams shorten retention and lose forensic detail. Traces are sampled so aggressively that the rare slow path disappears just when you need it. Dashboards multiply faster than anyone can maintain them. The problem is no longer "we lack telemetry." The problem is that every extra signal now has a price.

That price is not only storage. It is also ingest bandwidth, query performance, index pressure, CPU in the collectors, noisy alerting, and human attention. A signal that nobody uses is not free just because it exists.

That is the aha. Cost optimization in observability is really a design question about evidence economics: which signals deserve high fidelity, which can be aggregated or sampled, which dimensions belong in metrics versus traces or logs, and which signals should not exist at all.

Why This Matters

Suppose the warehouse company now runs a large multi-region platform. Checkout, inventory, payment, and shipping all emit metrics, logs, and traces. The team did the sensible thing for months: instrument liberally so incidents would be easier to debug.

Eventually the bill and the operational side effects arrive at the same time:

metric storage gets expensive because labels like tenant_id, order_id, and provider_request_id leaked into hot-path metrics
log volume spikes after a rollout because retry loops produce repeated structured errors
tracing is sampled so heavily that the rare slow provider path is no longer visible
dashboards and alerts depend on signals whose retention or fidelity is quietly changing

If the team responds with blunt cuts, they save money but lose the evidence they need during incidents. If they keep everything, cost and operational friction keep rising.

So the real question is not "how do we shrink the bill?" The real question is "how do we preserve the signals that matter for detection, diagnosis, and learning while stopping wasteful telemetry from consuming budget and attention?"

Learning Objectives

By the end of this session, you will be able to:

Explain why observability cost becomes a systems problem - Recognize that telemetry cost includes storage, compute, queryability, and human attention.
Reason about signal placement and fidelity - Decide what belongs in metrics, logs, and traces, and where sampling or aggregation should happen.
Design a cost-aware observability strategy - Preserve the evidence needed for SLOs and investigations without paying for low-value telemetry.

Core Concepts Explained

Concept 1: Every Observability Signal Spends a Different Kind of Budget

The most common mistake is to think of observability cost as "GB stored per month." That is only one piece. Different signal types spend different budgets:

metrics spend cardinality budget, retention budget, and query budget
logs spend ingest budget, indexing budget, retention budget, and analyst attention
traces spend collector CPU, buffering, storage, and sampling policy budget

That is why the same field can be cheap in one place and dangerous in another.

For example, tenant_id may be acceptable in logs and traces when investigating a customer-specific incident. But putting it on a high-volume metric can create a cardinality explosion that slows queries and inflates cost across the whole backend.

The useful mental model is not "collect more" versus "collect less." It is:

signal --> backend cost --> query behavior --> operational value

If a signal has low operational value but high backend cost, it is a good optimization target. If it is expensive but crucial for SLOs or incident explanation, the answer is usually not deletion but better placement, better aggregation, or better sampling.

Concept 2: Cost Optimization Is Mostly About Choosing the Right Fidelity at the Right Layer

Observability systems get expensive when teams keep maximum fidelity everywhere:

full-resolution metrics with too many labels
verbose logs for routine healthy requests
100% tracing on high-volume hot paths
long retention on signals that nobody queries

The better approach is layered fidelity.

Use metrics for broad, cheap, always-on answers such as SLI behavior, saturation, queue depth, and fleet health. Use logs for local detail where exact events matter. Use traces for path reconstruction and latency explanation. Then choose fidelity based on how the signal is actually used.

Detection layer:
  bounded-cardinality metrics

Explanation layer:
  traces for unusual or high-value paths

Forensics layer:
  logs with enough structure to answer local questions

This is where the lessons from the rest of the month connect:

SLO alerts tell you when the user-facing system is burning reliability budget
tail-based sampling helps keep the traces most likely to explain the burn
careful control of metric cardinality protects the broad detection layer
anomaly detection can surface unusual patterns, but should not replace deterministic user-facing signals

Cost optimization is therefore not a separate cleanup task. It is the discipline of matching signal fidelity to the job that signal must do.

Concept 3: The Best Optimization Question Is “What Decision Does This Signal Support?”

The strongest filter for telemetry is simple: if this signal fires, changes, or disappears, what decision becomes better or worse?

That question exposes three classes of telemetry:

decision-critical: needed for paging, SLO review, incident debugging, rollout safety, or customer forensics
nice-to-have: useful occasionally, but not worth high-fidelity treatment everywhere
dead telemetry: collected by habit, rarely queried, poorly owned, and expensive relative to value

For the warehouse platform, some examples might look like this:

checkout latency by route and region: decision-critical for SLOs
full debug logs for every healthy request: usually not worth always-on retention
traces for slow or erroring payment-provider paths: high-value when sampled intelligently
per-order IDs in fleet-wide metrics: usually the wrong layer entirely

A cost-aware observability review often becomes:

What signal do we have?
    |
    v
What decision does it support?
    |
    +--> none -> remove or reduce
    +--> rare investigation -> move to logs/traces or lower retention
    +--> broad detection -> keep as bounded metric
    +--> critical diagnosis -> preserve with targeted high fidelity

That is the capstone point of the month. Good observability is not maximal telemetry. It is telemetry whose cost, fidelity, and placement are aligned with how the system is actually operated.

Troubleshooting

Issue: The team cut telemetry cost, but incidents became harder to debug.

Why it happens / is confusing: The reduction removed high-value explanatory signals along with low-value noise.

Clarification / Fix: Separate detection signals from diagnostic signals. Keep bounded metrics for broad health, then preserve targeted logs and traces where explanation matters most.

Issue: Metrics are expensive even though each individual metric seems reasonable.

Why it happens / is confusing: Cardinality cost is multiplicative. A few extra labels on a hot metric can create far more series than teams expect.

Clarification / Fix: Audit labels on high-volume metrics first. Move highly variable identifiers to traces or logs instead of metrics.

Issue: Teams say they need “full fidelity” because they might need the data later.

Why it happens / is confusing: The fear of losing evidence is real, especially after painful incidents.

Clarification / Fix: Ask which concrete decision the signal supports. If the answer is vague, reduce fidelity, shorten retention, or remove it. Preserve full fidelity only where the operational value is clear.

Advanced Connections

Connection 1: Observability Cost Optimization <-> Platform Engineering

The parallel: A platform can encode good defaults for cardinality, sampling, retention, and structured logging so every team does not rediscover the same mistakes.

Real-world case: Internal telemetry SDKs and collector policies can prevent expensive labels or over-verbose defaults from reaching production in the first place.

Connection 2: Observability Cost Optimization <-> Error Budgets

The parallel: Both are budget-allocation problems. Error budgets allocate tolerated unreliability; telemetry budgets allocate how much evidence the system can afford while staying operable.

Real-world case: During a risky rollout, the team may temporarily raise trace retention for one path because the extra diagnostic value is worth the short-term cost.

Resources

Optional Deepening Resources

[DOCS] Prometheus Instrumentation Best Practices
- Link: https://prometheus.io/docs/practices/instrumentation/
- Focus: Use it to reason about bounded labels, hot-path metrics, and why metric design is a cost decision as much as a monitoring decision.
[DOCS] Grafana: Metrics Cost and Cardinality Management
- Link: https://grafana.com/docs/grafana-cloud/cost-management-and-billing/analyze-costs/metrics-costs/prometheus-cardinality/
- Focus: See how observability platforms frame cardinality as one of the main cost and performance drivers.
[DOCS] OpenTelemetry Collector: Tail Sampling Processor
- Link: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
- Focus: Connect tracing cost control with the idea of preserving high-value traces instead of sampling blindly.
[SITE] Google SRE Book
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Keep observability cost in the larger reliability frame: signals exist to support detection, response, and learning, not to maximize telemetry for its own sake.

Key Insights

Telemetry is not free just because it is digital - Every signal spends storage, compute, query, and human-attention budget somewhere in the system.
The right optimization target is signal value, not raw volume - Expensive telemetry can be justified if it supports an important operational decision.
Good observability uses layered fidelity - Bounded metrics detect, targeted traces explain, and structured logs preserve local evidence where it matters.

Knowledge Check (Test Questions)

Why is observability cost optimization more than a storage problem?
- A) Because telemetry also affects query performance, collector load, indexing, and human attention.
- B) Because storage is free in cloud systems.
- C) Because only logs create cost.
Where do highly variable identifiers such as order_id or provider_request_id usually fit best?
- A) High-volume fleet-wide metrics.
- B) Logs or traces, where local investigation needs detail without exploding metric cardinality.
- C) SLO burn-rate alerts.
What is the best first question to ask when deciding whether a signal deserves high fidelity?
- A) "Can we collect even more of it?"
- B) "What concrete operational decision does this signal support?"
- C) "Did another company mention it in a postmortem?"

Answers

1. A: Observability cost also appears in ingest, indexing, query behavior, collector complexity, and alert noise, not just bytes stored.

2. B: Highly variable identifiers are usually safer in logs or traces than in hot-path metrics, where they can multiply series count dramatically.

3. B: The cleanest way to judge signal value is to ask what decision becomes better because the signal exists at that fidelity.

← Back to Learning