Day 173: High-Cardinality Observability

High-cardinality telemetry is useful because it preserves detail that helps explain real incidents. It is dangerous because that same detail can overwhelm the observability system that is supposed to help you.

Today's "Aha!" Moment

Teams often discover high-cardinality observability the same way: during a painful incident. Someone asks a good question such as “Is this only affecting one tenant?”, “Which checkout path is slow?”, or “Did the new rollout only hurt one shard?” Then they realize the current metrics flatten exactly the dimension they now need.

The obvious reaction is to add more labels everywhere. That helps for a while. Then the bill rises, the metrics backend slows down, dashboards get noisy, query performance degrades, and suddenly the observability system itself becomes fragile.

That tension is the core of the topic. More dimensional detail can make diagnosis dramatically better, but every extra label multiplies possible time series and storage cost. High-cardinality observability is therefore not about collecting “rich data” in the abstract. It is about deciding where rich dimensions belong, where they do not, and which signal type should carry them.

That is the aha. Cardinality is not just a backend concern for Prometheus operators. It is a design decision about what questions the system can answer affordably under stress.

Why This Matters

Suppose the warehouse company is debugging a checkout regression after a partial rollout. Users report slower purchases, but only some tenants are affected and only in one region. The team wants to know:

is the problem tied to one tenant or many?
one rollout version or several?
one payment provider or all providers?
one shard or one cache key space?

If the metrics contain none of those dimensions, the team is blind. If the metrics contain all of them at full fidelity on every request path, the telemetry bill and query load may explode.

This is why high-cardinality observability matters. The system needs enough dimensional detail to localize real failure domains, but not so much that metrics storage, query latency, or collection pipelines collapse under the weight of “helpful” labeling.

It also matters because the right answer is often not “put everything in metrics.” Some dimensions belong in traces or logs, where their cardinality is more tolerable and their context is richer. Choosing the right signal is part of observability design.

Learning Objectives

By the end of this session, you will be able to:

Explain what cardinality means operationally - Understand how label combinations multiply time series and cost.
Choose the right home for rich dimensions - Distinguish what should live in metrics, traces, or logs.
Design safer labeling practices - Keep diagnostic value while avoiding backend overload and runaway spend.

Core Concepts Explained

Concept 1: Cardinality Is About the Number of Distinct Series You Create

In metrics systems, cardinality usually comes from labels or dimensions. One metric name is rarely the problem by itself. The problem is how many distinct combinations of label values it generates.

For example:

http_requests_total{
  route="/checkout",
  region="eu-west",
  status="500",
  tenant_id="acme-42"
}

That looks innocent. But if tenant_id can take thousands of values, and it combines with many routes, regions, statuses, rollout versions, and pod labels, the series count can grow very quickly.

This is why high cardinality is not just “many labels.” It is “many distinct value combinations.” Static labels with a few known values are usually manageable. Labels with unbounded values such as:

user_id
request_id
session_id
raw URL paths
full query strings

are often disastrous in metrics systems.

The engineering lesson is simple: every label is a multiplicative design choice, not a free annotation.

Concept 2: Not Every Interesting Dimension Belongs in Metrics

The most important practical skill here is deciding where a dimension belongs.

Metrics are strongest when you need:

aggregation
alerting
trend analysis
cheap fleet-wide comparison

Traces are stronger when you need:

request-level causality
rich context on a smaller sample
path-specific debugging

Logs are stronger when you need:

detailed local evidence
exact event records
forensic or audit-level detail

For the warehouse regression, it may be reasonable for metrics to include:

route
region
rollout version
payment provider

But not:

full customer ID
request ID
arbitrary cart contents

Those richer identifiers are usually better preserved in traces or structured logs, where they do not explode the metrics cardinality budget in the same way.

This is the deeper observability design move: use metrics for broad patterns, traces for path detail, and logs for local facts. High-cardinality observability works best when you distribute dimensions across signals intentionally instead of dumping them all into one metrics backend.

Concept 3: Good Cardinality Practice Is a Budgeting Discipline

The practical question is not “Should we ever use high-cardinality dimensions?” Sometimes you absolutely need them. The real question is whether you are spending that cardinality budget where it buys the most diagnostic power.

For example, dimensions that are often worth keeping in metrics include:

service or route
region or cluster
status class
rollout version
bounded customer tier or product tier

Dimensions that often need stronger caution or different handling include:

tenant IDs with huge fan-out
shard IDs if there are many thousands
raw endpoint paths instead of templated routes
ephemeral identifiers that grow without bound

A healthier model is:

diagnostic value
      versus
series explosion + query latency + storage cost + ingestion pressure

This is why teams often need:

label reviews in instrumentation PRs
route normalization instead of raw URL labeling
sampling or exemplars for richer linkage
metric governance around unbounded dimensions
explicit observability budgets, not just backend scaling

If you do not manage cardinality deliberately, your observability platform becomes a victim of the same growth dynamics it is supposed to help explain.

Troubleshooting

Issue: Dashboards and queries are getting slower as instrumentation improves.

Why it happens / is confusing: Extra dimensions were added for diagnosis, but the resulting series multiplication was not treated as an operational cost.

Clarification / Fix: Review which labels are truly needed for fleet-wide metrics and move overly detailed identifiers into traces or logs.

Issue: The team cannot isolate incidents to the right tenant, rollout, or shard.

Why it happens / is confusing: Metrics were over-optimized for low cost and lost the dimensions that matter operationally.

Clarification / Fix: Add a bounded set of high-value labels or route the missing detail into traces/logs where richer context is safer.

Issue: Engineers keep adding labels ad hoc during incidents.

Why it happens / is confusing: The organization has no shared rubric for whether a dimension belongs in metrics, traces, or logs.

Clarification / Fix: Introduce instrumentation review rules around cardinality, normalization, and the intended diagnostic question for each label.

Advanced Connections

Connection 1: High Cardinality <-> Tail-Based Sampling

The parallel: Tail-based sampling can preserve rich context for the most interesting traces without forcing every high-cardinality detail into always-on metrics.

Real-world case: Instead of labeling every metric by a near-unique dimension, the team can retain richer trace data for slow or failing requests.

Connection 2: High Cardinality <-> Cost Optimization

The parallel: Cardinality is one of the clearest places where observability design and observability cost become the same discussion.

Real-world case: A single new unbounded label can increase both bill and operational fragility far more than teams expect.

Resources

Optional Deepening Resources

[DOCS] Prometheus Best Practices: Instrumentation
- Link: https://prometheus.io/docs/practices/instrumentation/
- Focus: Use it for practical guidance on labels, instrumentation discipline, and what makes metrics manageable.
[DOCS] OpenTelemetry Attributes
- Link: https://opentelemetry.io/docs/specs/semconv/general/attributes/
- Focus: Use it to think about semantic dimensions and where rich contextual attributes belong.
[DOCS] Grafana Cardinality Management
- Link: https://grafana.com/docs/grafana-cloud/cost-management-and-billing/analyze-costs/metrics-costs/prometheus-cardinality/
- Focus: See a practical operational framing of how series growth affects cost and query behavior.
[SITE] Google SRE Book
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Connect observability design back to the operational questions engineers need to answer under real incidents.

Key Insights

Cardinality is a multiplicative cost, not a cosmetic label choice - Every new dimension changes the possible series space of your metrics.
Some detail belongs in traces or logs, not metrics - The best observability systems distribute context across signals intentionally.
High-cardinality observability needs a budget mindset - Diagnostic value must be weighed against ingestion load, query latency, and spend.

Knowledge Check (Test Questions)

What usually makes a metric high-cardinality?
- A) The metric name is longer than average.
- B) It combines labels whose values can take many distinct combinations, especially unbounded ones.
- C) It is stored in a dashboard.
Why is request_id usually a poor metric label?
- A) Because request IDs are hard to read.
- B) Because they create near-unique series that explode metrics cardinality while adding little fleet-wide aggregation value.
- C) Because tracing systems cannot use them.
What is the healthiest way to think about cardinality?
- A) Avoid all rich dimensions everywhere.
- B) Spend dimensional detail where it gives the most diagnostic value, and choose the right signal type for it.
- C) Put every useful dimension in metrics first, then scale the backend later.

Answers

1. B: Cardinality grows with the number of distinct label combinations, especially when labels are unbounded or highly variable.

2. B: request_id is usually better in traces or logs because it destroys metrics aggregation economics.

3. B: The right mindset is not absolute avoidance, but deliberate budgeting and signal placement.

← Back to Learning