Index Strategy and Selectivity Planning

LESSON

Data Architecture and Platforms

019 30 min advanced

Day 467: Index Strategy and Selectivity Planning

The core idea: An index pays for itself only when its leading predicates cut the search space enough to beat a scan, and that depends on real selectivity, data distribution, and write cost rather than on how often a column appears in WHERE.

Today's "Aha!" Moment

In 018.md, PayLedger split its storage responsibilities on purpose. PostgreSQL keeps the authoritative payout row that settlement workers mutate under strict transactional rules, while an LSM-backed event-history store absorbs the webhook and reconciliation firehose. That architectural split helps only if the team is equally disciplined about indexing. During the April 2026 cutoff window, support wants a live view of "failed or retrying payouts for tenant acme-co in run apr-2026-same-day, newest first, with processor details attached." The naive answer is to add indexes on status, processor_code, and updated_at everywhere the query might land.

That is the wrong mental model. Index strategy starts with selectivity planning: how many rows survive each predicate, in what order, and at what maintenance cost. status = 'failed' sounds specific, but if the query crosses every tenant and every payroll run during an outage, it may still touch far too much data. tenant_id = 'acme-co' and run_id = 'apr-2026-same-day' are far more decisive because they define the operating slice first. Once the search space is narrowed to one tenant and one run, a status predicate and descending timestamp order can become useful instead of decorative.

The production lesson is that "has an index" is not the same thing as "has a good access path." A B-tree index that leads with weak predicates taxes every write in PostgreSQL, and a secondary index in the event-history store creates another stream of SSTables that compaction must later merge. The common misconception is that indexing is a read-only optimization. In real systems it is a recurring operating cost, so the question is never "can we index this?" but "is this query selective enough to deserve an always-on physical structure?"

Why This Matters

PayLedger runs same-day payroll for several large employers. At 15:45, the settlement workers are still claiming ready payouts, processors are returning a mix of success and retry codes, and support needs to answer a precise question within seconds: which payouts for one tenant are still unhealthy right now, and which ones are blocked on the same processor response? That query sits directly on the boundary between system-of-record access and operational diagnosis.

If the team guesses wrong, the damage shows up in different places depending on the engine. In PostgreSQL, too many weak secondary indexes raise WAL volume, reduce HOT-update opportunities, and widen replica lag exactly when the cutoff window is busiest. In the LSM event-history store, secondary indexes multiply write amplification and compaction debt, so the system spends more time reorganizing data that should never have been queried there in the first place. Neither failure mode is visible from the dashboard that merely reports "index exists."

A deliberate selectivity plan changes the conversation. The authoritative payout table gets a small number of narrow indexes that match commit-time workflows and incident-time filters that truly need fresh transactional data. Broader investigative workflows move to the derived search surface introduced in 016.md. That discipline is also the prerequisite for 020.md: caching helps when the underlying query shape is already selective and stable, but it only hides mistakes when the base access path is fundamentally too wide.

Learning Objectives

By the end of this session, you will be able to:

  1. Estimate whether a predicate is selective in the place that matters - Distinguish global cardinality from the tenant-, run-, and time-local selectivity that actually drives page reads.
  2. Design composite and partial indexes around query shape - Choose index order, filter predicates, and covering fields that reduce read work without taxing the write path more than the workload can afford.
  3. Recognize when a query does not deserve an index on the primary store - Move exploratory or weakly selective filters into a derived system instead of forcing them onto the hot transactional or LSM ingest path.

Core Concepts Explained

Concept 1: Selectivity is about rows avoided, not columns admired

For PayLedger, the dangerous query is not hypothetical. Support really does ask for "all unhealthy payouts for acme-co in the current run, newest first." Suppose the payout table holds 80 million rows overall, while the active run for acme-co contributes 420,000 of them. Within that run, only 6,800 payouts are currently failed or retrying.

The selectivity math should be done against the slice the query must traverse, not against the table in the abstract:

global table size:                    80,000,000 rows
rows for tenant acme-co:              2,900,000 rows
rows for one payroll run:               420,000 rows
rows with status in (failed,retrying):    6,800 rows

If the index begins with status, the engine still has to visit unhealthy payouts across many tenants and runs before it can isolate the relevant 6,800 rows. If it begins with (tenant_id, run_id), the search space collapses to one payroll run immediately, and the remaining filter becomes cheap enough to matter. That is why column cardinality alone is a weak planning tool. A low-cardinality column can still be useful inside a narrow slice, and a seemingly specific column can be useless if skew turns one value into a huge hot spot.

This is also where ORDER BY and LIMIT matter. Support does not need every historical failure first; it needs the most recent 200 rows for one run. An access path that filters correctly and returns rows in descending updated_at order can stop early. An access path that filters weakly or sorts late reads far more pages before it even has a chance to apply the limit. The trade-off is straightforward: more index precision lowers read work, but every extra maintained index or indexed column expands write cost.

Concept 2: Composite and partial indexes turn a query into a physical path

The hot incident query on PostgreSQL looks roughly like this:

SELECT payout_id, status, processor_code, updated_at
FROM payout
WHERE tenant_id = $1
  AND run_id = $2
  AND status IN ('failed', 'retrying')
ORDER BY updated_at DESC
LIMIT 200;

The wrong index for this query is something like (status, processor_code, updated_at). It leads with columns that are attractive in a support conversation but weak as a physical boundary. PostgreSQL would still need to touch rows from many tenants and runs, then discard most of them after heap reads or bitmap filtering. That index would also be updated on every status transition, which is precisely the kind of write tax 017.md warned about.

A better design narrows the working set first and only indexes rows the query actually cares about:

CREATE INDEX CONCURRENTLY idx_payout_unhealthy_recent
ON payout (tenant_id, run_id, updated_at DESC)
INCLUDE (payout_id, status, processor_code)
WHERE status IN ('failed', 'retrying');

This index embodies three deliberate decisions. First, tenant_id and run_id come first because they define the operational shard the support query always uses. Second, the partial predicate keeps healthy rows out of the structure entirely, which shrinks the index and reduces write churn on settled payouts that support does not need for this workflow. Third, the descending timestamp order matches the requested output, so PostgreSQL can stop after it finds the first 200 rows instead of sorting a much larger result set.

The same reasoning applies when the data lives in the LSM-backed event-history store. Adding a secondary index for processor_code there is not "free because reads are faster." It means every new webhook and retry note must update another index structure, generate more immutable files, and pay more compaction later. If the resulting query is still weakly selective, the extra index is a bad bargain. That is the mechanism-level connection between B-tree maintenance cost and LSM compaction debt: index structures are only justified when they carve out a genuinely narrow path through the data.

Concept 3: Selectivity planning fails when statistics and workload shape drift apart

An index can be well designed and still disappoint in production if the planner's model no longer matches the data. Imagine processor R03 begins returning transient failures for one tenant. For two hours, failed and retrying rows spike inside acme-co's active payroll run while the global table distribution still looks mostly healthy. A planner that relies on stale statistics, or that assumes tenant_id and status are independent, may badly misestimate how many rows the query will touch.

That drift shows up in three concrete ways. The planner may choose a sequential scan because it thinks the partial index is too selective to matter. It may choose the index but still fetch far more heap pages than expected because the included columns are insufficient and the actual row count is much larger than estimated. Or it may continue using an old index whose column order made sense before the query became "latest unhealthy payouts for one run" instead of "all payouts by processor code."

The fix is not superstition; it is instrumentation. Compare estimated rows to actual rows with EXPLAIN (ANALYZE, BUFFERS). Refresh statistics often enough for volatile tables. Add extended statistics when predicate correlation is strong. Track index size, write latency, WAL bytes, and replica lag after every new index goes live. The trade-off here is operational discipline versus accidental cost. Without disciplined statistics and plan review, a carefully designed index strategy quietly decays into folklore.

Troubleshooting

Issue: The right-looking index exists, but PostgreSQL still chooses a sequential scan during the cutoff window.

Why it happens / is confusing: The index definition may be fine while the planner's row estimate is wrong. Stale statistics, tenant-specific skew, or unmodeled correlation between tenant_id, run_id, and status can make the scan look cheaper on paper than it is in reality.

Clarification / Fix: Compare estimated and actual rows with EXPLAIN (ANALYZE, BUFFERS), refresh statistics, and consider extended statistics for correlated predicates. If the query shape changed, redesign the index rather than forcing the old one.

Issue: A new support index improved one incident query but settlement throughput and replica freshness got worse.

Why it happens / is confusing: The read win is real, but so is the write tax. Every status transition now updates another B-tree, writes more WAL, and may disqualify HOT updates on the primary table.

Clarification / Fix: Measure the write-side bill explicitly. If the query is not part of commit-time correctness and can tolerate derived freshness, move it to the search or analytics surface instead of paying for it on every transactional write.

Issue: Secondary indexes in the event-history store keep growing, and compaction falls behind after webhook bursts.

Why it happens / is confusing: LSM engines postpone structural cleanup. Every extra secondary index creates another stream of SSTables and another compaction budget, even if the indexed query remains broad and poorly selective.

Clarification / Fix: Keep the event-history store focused on keyed replay and narrow time-slice reads. Route exploratory predicates to a system built for broader filtering, or materialize a smaller derived projection whose primary key already matches the query.

Advanced Connections

Connection 1: 017.md explains why every PostgreSQL index is part of the write path

This lesson uses selectivity to decide which indexes deserve that cost. The B-tree mechanics from 017.md are the reason weakly selective support indexes are not just clutter; they are a continuous write and WAL obligation on the authoritative system.

Connection 2: 018.md shows that LSM engines still need index discipline

An LSM tree changes when the bill is paid, not whether the bill exists. Secondary indexes in the event-history store are only sensible when they create a narrow, repeated access path that justifies the added compaction and space amplification.

Connection 3: 020.md depends on getting access paths right first

Caching an unselective query usually turns one bad database habit into a harder invalidation problem. Once the base query is selective and the index plan is stable, cache layers become a latency optimization instead of a crutch.

Resources

Key Insights

  1. Selectivity is local to the real query path - The useful question is not "how many distinct values does this column have?" but "how many rows survive after the leading predicates define the tenant, run, and time slice?"
  2. Good indexes are shaped around workflows, not around nouns - Composite order, partial predicates, and covering choices should mirror the way the system actually reads data, while every extra index must justify its write or compaction cost.
  3. Statistics drift can turn a correct design into a bad plan - Index strategy is not finished at CREATE INDEX; it stays healthy only if estimates, workload shape, and production measurements remain aligned.
PREVIOUS LSM Compaction and Amplification NEXT Caching Layers and Invalidation Boundaries

← Back to Data Architecture and Platforms

← Back to Learning Hub