Cost-Based Query Planning Internals

LESSON

Database Engine Internals and Implementation

034 30 min advanced

Day 411: Cost-Based Query Planning Internals

The core idea: A cost-based planner does not "know" the best plan in advance; it builds a manageable set of physical alternatives, estimates how many rows and page reads each one will produce, and chooses the cheapest estimate.

Today's "Aha!" Moment

In 10.md, Harbor Point designed a composite index so the market-open dashboard could seek directly into issuer = 'MUNI-77', status = 'open', and stream the newest reservations first. After rollout, the team expected the planner to use that index every time. Instead, one morning's EXPLAIN showed a sequential scan plus sort on reservations, and p95 latency for the dashboard jumped from tens of milliseconds into the multi-second range.

That was not the optimizer being whimsical. The planner had priced several legal execution trees and concluded that the index path was more expensive than scanning and sorting. Its decision came from estimates: how many rows the WHERE clause would match, whether the index would avoid the sort, how many random heap fetches the scan would cause, and how expensive the downstream join to risk_limits would be. The planner was not reasoning directly about SQL syntax; it was reasoning about operator trees.

The important shift is to stop asking, "Why did it ignore my index?" and start asking, "What assumptions made this other plan look cheaper?" That question turns query planning from folklore into a debuggable mechanism. It also prepares the ground for 12.md: if the row estimates are wrong, the cost model will make a logically consistent but operationally bad choice.

Why This Matters

Harbor Point's dashboard query is interactive, parameterized, and bursty. During quiet periods, issuer = 'MUNI-77' might match a few dozen open reservations. At market open, the same shape of query may hit thousands of rows, and the join to risk_limits can amplify the work if the planner expects a tiny outer relation and instead gets a flood of tuples. When that happens, the chosen plan dictates whether the engine performs a short ordered index walk, many repeated index lookups, a bitmap pass over scattered pages, or a table scan followed by an in-memory or on-disk sort.

That is why cost-based planning is operationally important even if you never touch optimizer code. A bad plan is often not a random failure and not even an indexing failure. It is usually an estimation failure that propagated upward through the plan tree. If you can read that propagation, you can decide whether the right fix is a better index, better statistics, a query rewrite, or different planner settings. If you cannot, you end up cargo-culting hints and hoping the workload never changes.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain how a cost-based planner builds candidate physical plans - Describe how scans, joins, sort orders, and limits become alternative execution trees.
  2. Trace how row estimates and operator costs produce a chosen plan - Connect cardinality, I/O assumptions, and startup versus total cost to the final decision.
  3. Diagnose a bad plan in production terms - Use estimated versus actual rows to decide whether the real problem is physical design, stale statistics, or a query shape mismatch.

Core Concepts Explained

Concept 1: Planning starts by turning one SQL statement into several legal operator trees

Harbor Point's dashboard query now joins the reservation stream to a small per-account risk table:

SELECT r.reservation_id, r.account_id, r.quantity, rl.max_notional
FROM reservations r
JOIN risk_limits rl
  ON rl.account_id = r.account_id
WHERE r.issuer = 'MUNI-77'
  AND r.status = 'open'
  AND r.submitted_at >= now() - interval '15 minutes'
ORDER BY r.submitted_at DESC
LIMIT 50;

The optimizer's first job is not to attach costs. Its first job is to enumerate legal physical shapes for this query. For reservations, that may include a sequential scan, a bitmap path, or an ordered index scan using the composite key introduced in 10.md. For risk_limits, it may include a primary-key index probe or building a hash table. Once those base-table paths exist, the planner can assemble them into join trees and decide whether the final order comes "for free" from an index or requires an explicit sort.

For a small query, it can consider many combinations. For a large query, the search space grows explosively, so real optimizers prune aggressively. They discard obviously dominated paths, keep only certain useful orderings, and cap how much join-order exploration they are willing to do. That means the planner is already making trade-offs before any cost number appears: exhaustive search would be more complete, but it would make planning itself too slow for OLTP traffic.

You can picture Harbor Point's query as a search over alternative trees:

SQL
  -> scan reservations: {seq, bitmap, idx(issuer,status,submitted_at)}
  -> access risk_limits: {index probe, hash build}
  -> join: {nested loop, hash join}
  -> final order: {already ordered, explicit sort}
  -> limit: {stop early if upstream order allows it}

The key production insight is that physical design changes the search space. The composite index from the previous lesson does not merely make one scan faster. It creates a new ordered access path that the planner can price against the other options.

Concept 2: The cost model prices those trees using estimated rows, pages, CPU work, and memory-sensitive operators

Once the planner has candidate trees, it assigns each one a cost. Those costs are usually not literal milliseconds. They are relative units produced by a model that combines child costs with expected work at each node. A simplified sketch looks like this:

plan_cost
  = child_costs
  + page_reads * page_cost
  + rows_processed * cpu_tuple_cost
  + operator_overheads(sort/hash/materialize)

For Harbor Point, the most important input is row estimation. If the planner believes the WHERE clause produces 120 rows, then an ordered index scan plus nested-loop probes into risk_limits can look excellent because it has low startup cost and can stop early after finding fifty rows. If it believes the same predicate produces 18,000 rows, the model may prefer a sequential scan plus hash join, because repeated random lookups and tuple-by-tuple processing now look more expensive than reading the table once and sorting the result.

That is why LIMIT 50 matters so much. Cost models usually track both startup cost and total cost. A plan that can return the first rows quickly may win for an interactive query even if its full-run total cost is worse. Harbor Point's index path benefits from that because the (issuer, status, submitted_at DESC) order can produce the newest rows immediately. A back-office export of all matching reservations might prefer a different plan because early-row latency no longer matters as much as total throughput.

The important consequence is that row-estimate errors compound. If the scan node is wrong, the join node inherits the error, and the sort or limit decision above it is now being priced on distorted input. The planner can therefore be internally consistent and still disastrously wrong in production. That is the bridge to the next lesson: histograms, most-common-values lists, and correlation stats determine whether the planner thinks it is processing hundreds of rows or tens of thousands.

Concept 3: A cost-based planner is a controlled approximation, so diagnosis means comparing its assumptions to reality

When Harbor Point sees a bad plan, the right move is not to treat the optimizer as a black box and start toggling knobs at random. The disciplined workflow is to compare the planner's expectations with execution reality. EXPLAIN shows the chosen plan and its estimated costs; EXPLAIN ANALYZE shows how many rows each node actually produced and how much time it spent there. The gap between those two views tells you what kind of problem you have.

If the chosen path is structurally impossible to make fast, the issue is physical design: perhaps there is no index that both narrows on issuer and preserves submitted_at DESC. If the path would have been fine under the planner's estimate but collapses under real row counts, the issue is estimation: stale statistics, skewed values, or predicate correlations the engine did not model well. If the plan is reasonable for one issuer and terrible for another, you may have a parameter-sensitive workload where a single generic plan is not robust enough across values.

This is the main trade-off of cost-based planning. It usually beats fixed heuristics because it can adapt to data size, operator choices, and query shape. But that flexibility depends on approximate statistics and a finite search budget. Engines accept that compromise because perfect planning is impossible at runtime. Your job in production is not to demand perfection; it is to determine which approximation failed and whether the cheapest repair is better statistics, better paths, or less ambiguous SQL.

Troubleshooting

Issue: A new composite index exists, but the planner still chooses a sequential scan plus sort.

Why it happens / is confusing: Seeing the "right" columns in the index tempts engineers to assume the ordered index path must be cheapest. The planner may disagree if it estimates that a large fraction of the table matches, or if using the index would still require many random heap fetches.

Clarification / Fix: Check EXPLAIN and compare the estimated rows for the scan node against reality. If the estimate is wildly high or low, investigate statistics and skew before forcing a plan. If the estimate is accurate, the issue may be that the index really is the wrong physical design for this query mix.

Issue: The same parameterized query is fast for one issuer and slow for another.

Why it happens / is confusing: The optimal plan can change with data distribution. An issuer with a small active set favors an index-driven path; a very common issuer may make the same path too expensive compared with a broader scan.

Clarification / Fix: Compare plans and row estimates for representative parameter values. Look for places where one value yields a small outer relation and another yields a large one. That tells you whether you are dealing with skew, generic-plan reuse, or a workload that needs different indexing than the median case.

Issue: A plan was fine last quarter and is bad now without any code change.

Why it happens / is confusing: Cost-based planning is sensitive to data growth, changed value distributions, and new hot ranges. The SQL text may be identical while the planner's best estimated plan changes because the table is no longer shaped like it was when the query was first tuned.

Clarification / Fix: Re-run EXPLAIN ANALYZE on current production-like data, review fresh statistics, and verify whether the planner's assumptions about selectivity, sort size, and join fan-out still hold. Query tuning is not durable if the underlying data distribution has moved.

Advanced Connections

Connection 1: 10.md changes what the planner is allowed to consider

Composite index design and cost-based planning are inseparable. The planner can only price paths that actually exist. When Harbor Point created (issuer, status, submitted_at DESC), it did not guarantee that the index would always win, but it did create an ordered access path that can avoid a sort and support an early LIMIT stop.

Connection 2: 12.md explains why cost formulas are only as good as their inputs

Today's lesson explains how the planner turns estimated row counts into scan, join, and sort costs. The next lesson focuses on where those estimates come from: statistics, histograms, most-common-values lists, and the errors that appear when column correlations or skew are under-modeled. If you can already see the cost tree, the statistics layer has an obvious place to attach.

Resources

Optional Deepening Resources

Key Insights

  1. The planner chooses operator trees, not favorite syntax - SQL is first translated into physical alternatives such as scans, joins, and sorts, and only then compared by estimated cost.
  2. Cardinality is the lever behind most plan choices - Once row estimates drift, every upstream cost based on those rows drifts with them.
  3. Cost-based planning is powerful because it is approximate - Engines search enough of the plan space to be useful at runtime, then rely on statistics to make the approximation good enough.
PREVIOUS Composite Index Ordering and Selectivity NEXT Statistics, Histograms, and Cardinality Errors

← Back to Database Engine Internals and Implementation

← Back to Learning Hub