Day 176: Cost Optimization - Observability at Scale
Observability cost optimization is not about keeping less data in the abstract. It is about spending signal budget where it improves detection, explanation, and decision quality the most.
Today's "Aha!" Moment
Observability feels cheap at first. Add more metrics. Keep more logs. Trace more requests. Add more labels so you can slice incidents any way you want later. Early on, this feels responsible: more evidence should mean better operations.
Then scale arrives. Metrics explode because label cardinality grew faster than expected. Logs become expensive enough that teams shorten retention and lose forensic detail. Traces are sampled so aggressively that the rare slow path disappears just when you need it. Dashboards multiply faster than anyone can maintain them. The problem is no longer "we lack telemetry." The problem is that every extra signal now has a price.
That price is not only storage. It is also ingest bandwidth, query performance, index pressure, CPU in the collectors, noisy alerting, and human attention. A signal that nobody uses is not free just because it exists.
That is the aha. Cost optimization in observability is really a design question about evidence economics: which signals deserve high fidelity, which can be aggregated or sampled, which dimensions belong in metrics versus traces or logs, and which signals should not exist at all.
Why This Matters
Suppose the warehouse company now runs a large multi-region platform. Checkout, inventory, payment, and shipping all emit metrics, logs, and traces. The team did the sensible thing for months: instrument liberally so incidents would be easier to debug.
Eventually the bill and the operational side effects arrive at the same time:
- metric storage gets expensive because labels like
tenant_id,order_id, andprovider_request_idleaked into hot-path metrics - log volume spikes after a rollout because retry loops produce repeated structured errors
- tracing is sampled so heavily that the rare slow provider path is no longer visible
- dashboards and alerts depend on signals whose retention or fidelity is quietly changing
If the team responds with blunt cuts, they save money but lose the evidence they need during incidents. If they keep everything, cost and operational friction keep rising.
So the real question is not "how do we shrink the bill?" The real question is "how do we preserve the signals that matter for detection, diagnosis, and learning while stopping wasteful telemetry from consuming budget and attention?"
Learning Objectives
By the end of this session, you will be able to:
- Explain why observability cost becomes a systems problem - Recognize that telemetry cost includes storage, compute, queryability, and human attention.
- Reason about signal placement and fidelity - Decide what belongs in metrics, logs, and traces, and where sampling or aggregation should happen.
- Design a cost-aware observability strategy - Preserve the evidence needed for SLOs and investigations without paying for low-value telemetry.
Core Concepts Explained
Concept 1: Every Observability Signal Spends a Different Kind of Budget
The most common mistake is to think of observability cost as "GB stored per month." That is only one piece. Different signal types spend different budgets:
- metrics spend cardinality budget, retention budget, and query budget
- logs spend ingest budget, indexing budget, retention budget, and analyst attention
- traces spend collector CPU, buffering, storage, and sampling policy budget
That is why the same field can be cheap in one place and dangerous in another.
For example, tenant_id may be acceptable in logs and traces when investigating a customer-specific incident. But putting it on a high-volume metric can create a cardinality explosion that slows queries and inflates cost across the whole backend.
The useful mental model is not "collect more" versus "collect less." It is:
signal --> backend cost --> query behavior --> operational value
If a signal has low operational value but high backend cost, it is a good optimization target. If it is expensive but crucial for SLOs or incident explanation, the answer is usually not deletion but better placement, better aggregation, or better sampling.
Concept 2: Cost Optimization Is Mostly About Choosing the Right Fidelity at the Right Layer
Observability systems get expensive when teams keep maximum fidelity everywhere:
- full-resolution metrics with too many labels
- verbose logs for routine healthy requests
- 100% tracing on high-volume hot paths
- long retention on signals that nobody queries
The better approach is layered fidelity.
Use metrics for broad, cheap, always-on answers such as SLI behavior, saturation, queue depth, and fleet health. Use logs for local detail where exact events matter. Use traces for path reconstruction and latency explanation. Then choose fidelity based on how the signal is actually used.
Detection layer:
bounded-cardinality metrics
Explanation layer:
traces for unusual or high-value paths
Forensics layer:
logs with enough structure to answer local questions
This is where the lessons from the rest of the month connect:
SLOalerts tell you when the user-facing system is burning reliability budget- tail-based sampling helps keep the traces most likely to explain the burn
- careful control of metric cardinality protects the broad detection layer
- anomaly detection can surface unusual patterns, but should not replace deterministic user-facing signals
Cost optimization is therefore not a separate cleanup task. It is the discipline of matching signal fidelity to the job that signal must do.
Concept 3: The Best Optimization Question Is “What Decision Does This Signal Support?”
The strongest filter for telemetry is simple: if this signal fires, changes, or disappears, what decision becomes better or worse?
That question exposes three classes of telemetry:
- decision-critical: needed for paging, SLO review, incident debugging, rollout safety, or customer forensics
- nice-to-have: useful occasionally, but not worth high-fidelity treatment everywhere
- dead telemetry: collected by habit, rarely queried, poorly owned, and expensive relative to value
For the warehouse platform, some examples might look like this:
- checkout latency by route and region: decision-critical for SLOs
- full debug logs for every healthy request: usually not worth always-on retention
- traces for slow or erroring payment-provider paths: high-value when sampled intelligently
- per-order IDs in fleet-wide metrics: usually the wrong layer entirely
A cost-aware observability review often becomes:
What signal do we have?
|
v
What decision does it support?
|
+--> none -> remove or reduce
+--> rare investigation -> move to logs/traces or lower retention
+--> broad detection -> keep as bounded metric
+--> critical diagnosis -> preserve with targeted high fidelity
That is the capstone point of the month. Good observability is not maximal telemetry. It is telemetry whose cost, fidelity, and placement are aligned with how the system is actually operated.
Troubleshooting
Issue: The team cut telemetry cost, but incidents became harder to debug.
Why it happens / is confusing: The reduction removed high-value explanatory signals along with low-value noise.
Clarification / Fix: Separate detection signals from diagnostic signals. Keep bounded metrics for broad health, then preserve targeted logs and traces where explanation matters most.
Issue: Metrics are expensive even though each individual metric seems reasonable.
Why it happens / is confusing: Cardinality cost is multiplicative. A few extra labels on a hot metric can create far more series than teams expect.
Clarification / Fix: Audit labels on high-volume metrics first. Move highly variable identifiers to traces or logs instead of metrics.
Issue: Teams say they need “full fidelity” because they might need the data later.
Why it happens / is confusing: The fear of losing evidence is real, especially after painful incidents.
Clarification / Fix: Ask which concrete decision the signal supports. If the answer is vague, reduce fidelity, shorten retention, or remove it. Preserve full fidelity only where the operational value is clear.
Advanced Connections
Connection 1: Observability Cost Optimization <-> Platform Engineering
The parallel: A platform can encode good defaults for cardinality, sampling, retention, and structured logging so every team does not rediscover the same mistakes.
Real-world case: Internal telemetry SDKs and collector policies can prevent expensive labels or over-verbose defaults from reaching production in the first place.
Connection 2: Observability Cost Optimization <-> Error Budgets
The parallel: Both are budget-allocation problems. Error budgets allocate tolerated unreliability; telemetry budgets allocate how much evidence the system can afford while staying operable.
Real-world case: During a risky rollout, the team may temporarily raise trace retention for one path because the extra diagnostic value is worth the short-term cost.
Resources
Optional Deepening Resources
- [DOCS] Prometheus Instrumentation Best Practices
- Link: https://prometheus.io/docs/practices/instrumentation/
- Focus: Use it to reason about bounded labels, hot-path metrics, and why metric design is a cost decision as much as a monitoring decision.
- [DOCS] Grafana: Metrics Cost and Cardinality Management
- Link: https://grafana.com/docs/grafana-cloud/cost-management-and-billing/analyze-costs/metrics-costs/prometheus-cardinality/
- Focus: See how observability platforms frame cardinality as one of the main cost and performance drivers.
- [DOCS] OpenTelemetry Collector: Tail Sampling Processor
- Link: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
- Focus: Connect tracing cost control with the idea of preserving high-value traces instead of sampling blindly.
- [SITE] Google SRE Book
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Keep observability cost in the larger reliability frame: signals exist to support detection, response, and learning, not to maximize telemetry for its own sake.
Key Insights
- Telemetry is not free just because it is digital - Every signal spends storage, compute, query, and human-attention budget somewhere in the system.
- The right optimization target is signal value, not raw volume - Expensive telemetry can be justified if it supports an important operational decision.
- Good observability uses layered fidelity - Bounded metrics detect, targeted traces explain, and structured logs preserve local evidence where it matters.
Knowledge Check (Test Questions)
-
Why is observability cost optimization more than a storage problem?
- A) Because telemetry also affects query performance, collector load, indexing, and human attention.
- B) Because storage is free in cloud systems.
- C) Because only logs create cost.
-
Where do highly variable identifiers such as
order_idorprovider_request_idusually fit best?- A) High-volume fleet-wide metrics.
- B) Logs or traces, where local investigation needs detail without exploding metric cardinality.
- C) SLO burn-rate alerts.
-
What is the best first question to ask when deciding whether a signal deserves high fidelity?
- A) "Can we collect even more of it?"
- B) "What concrete operational decision does this signal support?"
- C) "Did another company mention it in a postmortem?"
Answers
1. A: Observability cost also appears in ingest, indexing, query behavior, collector complexity, and alert noise, not just bytes stored.
2. B: Highly variable identifiers are usually safer in logs or traces than in hot-path metrics, where they can multiply series count dramatically.
3. B: The cleanest way to judge signal value is to ask what decision becomes better because the signal exists at that fidelity.