Day 412: Statistics, Histograms, and Cardinality Errors

The core idea: A planner does not count matching rows while it is choosing a plan; it predicts them from compact statistical summaries, and any missing skew or correlation can turn a reasonable cost model into a bad production decision.

Today's "Aha!" Moment

In 11.md, Harbor Point learned to read a cost-based plan as a tree of estimated work. The next morning, that lesson became painfully concrete. The market-open dashboard query for issuer = 'MUNI-77' was estimated at a few dozen qualifying rows, but EXPLAIN ANALYZE showed a few thousand. The planner picked a nested-loop shape that looked cheap on paper, then spent seconds doing repeated index probes and heap fetches while traders waited on the screen.

The cost model had not suddenly become irrational. Its inputs were wrong. The engine knew something about issuer, something about status, and something about submitted_at, but those summaries were mostly independent, one-dimensional sketches. They did not capture the fact that for this issuer, during the first minutes after market open, almost every new reservation is also status = 'open' and concentrated in the newest time range. The planner multiplied several reasonable single-column guesses and produced an unreasonable combined guess.

That is the shift for this lesson: histograms and related statistics are not side metadata. They are the planner's model of reality. If that model is too coarse, cardinality errors propagate upward into join order, memory budgeting, sort choice, and index usage. Once you see which summary failed, the fix stops being mystical. You can ask whether the issue is stale sampling, a missing hot value, an unmodeled column dependency, or a workload whose newest slice no longer looks like the table as a whole.

Why This Matters

Harbor Point's dashboard query is interactive, parameterized, and highly sensitive to market-open skew. When the planner estimates thirty rows and gets two thousand, the damage is not limited to one scan node. A join that should have been a hash join becomes a nested loop, a top-N sort that should have stayed in memory spills, and an index designed in 10.md looks ineffective even though the real failure is estimation rather than structure.

That is why statistics quality is a production concern rather than a DBA footnote. Teams that understand where row estimates come from can separate three very different problems: a missing access path, a stale or underspecified statistical model, and a query whose value distribution changes so sharply that one generic estimate is no longer safe. Without that distinction, people reach for hints, disable plan types, or add redundant indexes that hide the symptom while leaving the underlying cardinality error intact.

Learning Objectives

By the end of this session, you will be able to:

Explain what planner statistics actually summarize - Describe the roles of row counts, most-common-values lists, histograms, and distinct-value estimates.
Trace how selectivity estimates become cardinality errors - Follow how equality, range, and join predicates are combined and why correlation breaks naive multiplication.
Choose a production fix that matches the failure mode - Distinguish between stale stats, insufficient resolution, and missing cross-column information.

Core Concepts Explained

Concept 1: Planner statistics are compressed models of the table, not exact counts

Harbor Point's reservations table is too large and too active for the optimizer to scan in full every time it plans a query. Instead, background ANALYZE work samples the table and stores summaries that are cheap to consult during planning. The point is not precision at any cost; the point is to build a small model that is good enough to rank plan alternatives quickly.

For the dashboard workload, the most relevant summaries look roughly like this:

table row count (reltuples): 12,000,000

issuer:
  n_distinct ~= 1,800
  MCVs: MUNI-77=0.006, ALPHA-01=0.005, ZETA-88=0.005

status:
  MCVs: settled=0.72, open=0.05, routed=0.18

submitted_at:
  histogram bounds:
    09:30 | 09:41 | 09:52 | 10:04 | 10:17 | ...

Each piece answers a different planning question. reltuples gives the base size of the relation. Most-common-values lists let the planner treat hot discrete values specially instead of pretending every issuer or status is equally likely. Histograms summarize the long tail and support range predicates such as "last fifteen minutes." Distinct-value estimates help with join fan-out and the expected selectivity of equality conditions. Some engines also track correlation between a column and physical row order, which influences whether an index walk is likely to touch scattered heap pages or mostly sequential ones.

The trade-off is deliberate. A statistics target with a handful of histogram buckets and MCV entries is cheap to maintain and fast to consult, but it cannot perfectly represent a table whose shape changes by time of day or by combinations of columns. If MUNI-77 becomes unusually active only during market open, a compact sample may capture its overall frequency while missing the fact that its recent rows are overwhelmingly concentrated in the newest tail of submitted_at.

Concept 2: Cardinality estimation combines those summaries with formulas that assume more independence than production data usually has

Return to Harbor Point's dashboard query:

SELECT r.reservation_id, r.account_id, r.quantity, rl.max_notional
FROM reservations r
JOIN risk_limits rl
  ON rl.account_id = r.account_id
WHERE r.issuer = 'MUNI-77'
  AND r.status = 'open'
  AND r.submitted_at >= now() - interval '15 minutes'
ORDER BY r.submitted_at DESC
LIMIT 50;

At a high level, the planner estimates qualifying rows like this:

estimated_rows
  = table_rows
  * sel(issuer = 'MUNI-77')
  * sel(status = 'open')
  * sel(submitted_at >= now() - 15 min)

If the stored summaries say sel(issuer) is 0.006, sel(status) is 0.05, and the histogram says the last fifteen minutes represent 0.008 of the table, the estimate becomes:

12,000,000 * 0.006 * 0.05 * 0.008 ~= 29 rows

That looks tidy and disastrously wrong. During market open, the actual result might be closer to 2,300 rows because the predicates are not independent. For this issuer, recent rows are far more likely to be open than the global table average suggests, and recent activity is concentrated into the newest histogram bucket. The planner may know each marginal distribution reasonably well while still missing the joint distribution that matters to the query.

Once the scan estimate is wrong, everything above it is priced on fiction. A nested loop into risk_limits looks inexpensive because the planner expects only a few dozen probes. The ORDER BY ... LIMIT 50 path looks almost free because it expects to stop after a tiny number of ordered tuples. In reality the engine churns through thousands of recent rows, performs thousands of join lookups, and may spill or thrash caches. The key lesson is that cardinality errors are multiplicative: each extra predicate and join compounds uncertainty unless the engine has statistics that model the dependency directly.

Concept 3: Production fixes work when they add the missing information, not when they merely force a different plan

When Harbor Point sees estimated rows off by two orders of magnitude, the first move is not to disable nested loops or pin a plan with hints. Those tactics may suppress today's symptom while making tomorrow's workload worse. The disciplined move is to ask which fact the planner failed to know.

If the hot issuer or status value is too important to be averaged into the long tail, the fix is more resolution: raise the per-column statistics target so the MCV list and histogram have room to represent it. If issuer and status are strongly dependent, the fix is cross-column statistics such as functional-dependency or multivariate MCV summaries. If the trouble appears right after an overnight load or a morning burst, the issue may be freshness: the sample still reflects yesterday's distribution, so ANALYZE needs to run sooner or more often on the hot partition.

Some workloads outgrow global table summaries altogether. If Harbor Point's newest fifteen minutes behave nothing like the rest of the table, partition-level statistics or a physically separate recent-data path can give the planner a model that matches the operational reality. Those fixes cost something: more background work, larger catalogs, longer planning, or more physical design complexity. But that trade-off is honest. You are paying to make the planner's model better, not simply hiding evidence that the model is wrong.

This also prepares the ground for 13.md. In MVCC engines, the same maintenance machinery that keeps statistics fresh often shares budget and scheduling pressure with old-version cleanup. Heavy write churn can therefore hurt plan quality in two ways at once: it changes the live distribution quickly and it creates background maintenance debt.

Troubleshooting

Issue: EXPLAIN ANALYZE shows an estimate of tens of rows for a query that actually returns thousands during market open.

Why it happens / is confusing: Single-column statistics can each look reasonable while their combination is catastrophically wrong. The missing piece is usually correlation between predicates such as issuer, status, and a recent time range.

Clarification / Fix: Compare estimated versus actual rows at the first scan node that goes wrong. If the error appears before joins, add or refine statistics on the filtered columns, especially multivariate statistics for the correlated predicates.

Issue: A manual ANALYZE fixes the plan for a few hours, then the bad plan returns the next trading day.

Why it happens / is confusing: The workload is changing faster than automatic statistics refresh thresholds expect. The planner keeps falling back to a sample that reflects yesterday's distribution rather than the current burst.

Clarification / Fix: Lower analyze thresholds or schedule targeted statistics refreshes for the hot table or partition after major ingest windows. Validate the change with representative morning parameters, not just midday traffic.

Issue: The planner still misses the top-N index path even though statistics are fresh.

Why it happens / is confusing: Fresh does not always mean expressive enough. A coarse histogram may know that recent rows are rare globally while still missing that one issuer dominates that recent slice, so the ordered index path remains underpriced or overpriced.

Clarification / Fix: Increase statistics resolution, add extended statistics on the correlated columns, or isolate the recent working set with partitioning if the newest data behaves unlike the table as a whole.

Advanced Connections

Connection 1: 11.md explains the arithmetic; this lesson explains the inputs

Cost-based planning only looks intelligent when its row estimates are directionally right. Yesterday's lesson showed how the planner compares scan, join, and sort alternatives. Today's lesson shows why a perfectly consistent cost model can still choose the wrong tree: the cardinality inputs came from a compressed statistical model that failed to capture the query's real data shape.

Connection 2: 13.md turns statistics freshness into a maintenance problem

Statistics are not maintained in isolation. In engines with MVCC, analyze work and old-version cleanup often compete for the same background maintenance window. Understanding cardinality errors first makes the next lesson easier: stale statistics and dead-tuple buildup are different symptoms of the same underlying operational truth that data maintenance is part of query performance.

Resources

Optional Deepening Resources

[DOC] PostgreSQL Documentation: Planner Statistics
- Focus: What ANALYZE stores, including distinct counts, most-common-values lists, histograms, and extended statistics.
[DOC] PostgreSQL Documentation: Row Estimation Examples
- Focus: Worked examples of equality and range selectivity using histogram buckets and MCV data.
[DOC] PostgreSQL Documentation: CREATE STATISTICS
- Focus: How to model cross-column dependencies when multiplying single-column selectivities is not good enough.
[PAPER] Access Path Selection in a Relational Database Management System
- Focus: The System R paper that established the classical relationship between cardinality estimation and optimizer cost choices.

Key Insights

Statistics are the planner's model of the data - Histograms, MCV lists, and distinct counts are compact summaries that stand in for full table knowledge during planning.
Cardinality errors usually come from missing structure, not bad arithmetic - The common failure mode is unmodeled skew or dependency between predicates, not an inability to multiply selectivities.
The right fix adds information to the planner - Better sampling, extended statistics, fresher maintenance, or narrower physical scope are more durable than forcing one plan shape.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub