Day 465: B-Tree Path and Maintenance Costs

The core idea: A B-tree keeps transactional reads fast by maintaining a shallow, ordered page hierarchy, but it charges for that speed on every write through page splits, index maintenance, WAL generation, and cleanup work.

Today's "Aha!" Moment

In 016.md, PayLedger established PostgreSQL as the authority for same-day payroll runs and pushed search and analytics into derived systems. The next question is the one teams usually answer too casually: if PostgreSQL is the hot path, how much indexing can it afford before the write path starts paying more than the product can tolerate?

The useful realization is that a B-tree is not just a "fast lookup" feature. It is a physical promise that keys stay sorted across fixed-size pages so the engine can walk from root to leaf in a small number of hops. That promise is why a payout lookup can stay fast even when the payout table has millions of rows. It is also why inserts, updates, and deletes must keep touching index pages, splitting them when they fill, and cleaning them up after old versions become dead.

For PayLedger, that turns index design into a production budget. During the payroll cutoff window, every status change on a payout may touch the heap row, one or more B-tree indexes, and the WAL stream that replicas must replay. A support filter that looks harmless in staging can become an expensive write tax in production if it adds another hot secondary index to a table that already changes constantly.

The common mistake is to think an index costs money only when a query uses it. In a page-oriented transactional engine, the opposite is often more important: an index is paid for continuously by the write path so that future reads can remain cheap and predictable.

Why This Matters

PayLedger closes same-day payroll runs under a narrow cutoff. While that window is open, the platform performs point reads to fetch one payout, bounded range scans to find the next batch of work, and status updates as settlements move from scheduled to submitted, settled, or failed. Support also wants richer filtering during incidents: "show failed payouts for tenant acme-co in the April 2026 run with processor code R03."

If the team treats every new filter as a reason to add another B-tree index, the read side looks better for a while and the write side quietly deteriorates. Inserts dirty more pages. Updates stop qualifying for cheap heap-only updates when indexed columns change. Page splits create extra WAL and more random I/O. Replicas replay more work. Autovacuum has more dead tuples and index pages to clean. None of that is visible in a simple query benchmark, but all of it appears when payroll traffic peaks.

Understanding the path and maintenance cost of a B-tree lets the team separate queries that belong on the authoritative OLTP path from queries that belong in a derived search system. That is the operational bridge from the baseline in 016.md to the next lesson on 018.md, where the same workload will be examined under an LSM cost model instead of a page-oriented one.

Learning Objectives

By the end of this session, you will be able to:

Explain how a B-tree lookup actually runs - Trace a point lookup or range scan from root page to leaf page and identify when the heap must still be visited.
Analyze why writes become expensive as indexes accumulate - Connect page splits, secondary-index churn, HOT eligibility, and WAL volume to the shape of the write workload.
Choose where B-trees belong in a production architecture - Decide which queries justify a hot-path B-tree index and which ones should move to a derived search surface or a different storage engine.

Core Concepts Explained

Concept 1: The read path is fast because the tree stays shallow and ordered

PayLedger keeps a composite B-tree on (tenant_id, run_id, payout_id) so the settlement worker can find the next payout for one payroll run without scanning unrelated tenants. Physically, that index is a hierarchy of fixed-size pages. The root page stores separators for broad key ranges, internal pages narrow the search further, and leaf pages contain the sorted index tuples that point to heap rows.

For a point lookup, PostgreSQL does not inspect millions of entries one by one. It follows a short path:

root page
  -> internal page for tenant_id = acme-co
      -> internal page for run_id = apr-2026-same-day
          -> leaf page with payout_id = p_842119
              -> heap tuple for the current payout row version

Because each internal page can point to many children, the fan-out is high and the tree height stays small. Upper pages are usually hot in shared buffers, so a lookup often pays only for the lower-level page reads plus the heap visit. Range scans are efficient for the same reason: once the engine reaches the first relevant leaf page, it can walk neighboring leaf pages in key order instead of jumping all over the table.

That is why B-trees fit the authoritative workload chosen in 016.md. They are strong at equality lookups and ordered range scans over a stable key order. They are much less magical than people assume, though. If a query predicates on columns that do not match the index order, or if PostgreSQL still has to visit the heap for every row, the cost profile changes quickly. A B-tree gives structure, not unlimited query freedom.

Concept 2: The write path pays the maintenance bill immediately

Now look at the same payout table during the cutoff window. Each new payout insert must land in sorted order inside every relevant B-tree. If the target leaf page has space, PostgreSQL inserts the new tuple and updates the necessary page metadata. If the page is full, the engine splits it, moves some keys to a new sibling page, updates the parent page with a new separator key, and writes all of that to WAL so crash recovery and replicas can reproduce the same structure.

Updates are often more expensive than teams expect. If PayLedger adds a secondary index on (tenant_id, status, processor_code) to help support, then every status transition from submitted to failed or settled becomes an index maintenance event. The old index entry must be invalidated and a new one must be inserted into the correct key position. If the update changes only non-indexed columns and there is room on the same heap page, PostgreSQL may be able to use a heap-only tuple update, which avoids touching secondary indexes. Once indexed columns change, or once the page no longer has room, that cheaper path disappears.

This is why source indexes are a budget, not a checkbox list. A secondary B-tree does not just consume disk. It adds latch contention on hot pages, dirty-page churn in the buffer cache, more WAL for replicas to replay, and more cleanup work for vacuum when old row versions age out. Page splits are not the only cost, but they are a good visible symptom that the tree is constantly paying to preserve order under write pressure.

The production trade-off is clear: B-trees keep reads predictable because they perform the sorting and maintenance work continuously. That is perfect when the system needs fresh transactional reads. It is expensive when the workload is dominated by high-churn writes or when the index set grows to support ad hoc exploration that could have lived in a derived store instead.

Concept 3: Good index design means matching the tree to the query shape and the update shape

For PayLedger, the hot-path B-tree set should be intentionally small. The primary key and a tenant-scoped composite index that supports payout execution are easy to justify because they sit directly on the authoritative workflow. A free-text support search over employer name, bank response text, and operator notes is not a good B-tree candidate on the transactional table, because those filters are wide, combinatorial, and not part of commit-time correctness. That is exactly the sort of workload that should move into the search projection established in 016.md.

The key design questions are mechanical. Does the leftmost index order match the predicates the query uses first? Are the leading columns selective enough to avoid scanning large swaths of the tree? Does the query need rows in sorted order, making leaf-page traversal valuable? Will the indexed columns change frequently, turning a "read optimization" into a recurring write penalty? Those questions matter more than whether the index looks elegant on paper.

Operationally, teams should watch for signs that the B-tree budget is being overspent: increasing WAL bytes per payroll run, growing replica replay lag during status transitions, rising index size relative to the table, and reduced update throughput after adding support-oriented indexes. Fillfactor tuning, periodic reindexing, and careful composite-index design can help, but they do not change the basic contract. A B-tree pays maintenance incrementally to keep data ordered now. In 018.md, you will see the alternative: an LSM system delays part of that cost and pays it later through compaction.

Troubleshooting

Issue: A query that should be selective still performs many heap reads and misses its latency target.

Why it happens / is confusing: Reaching the correct leaf page is only part of the path. If the index does not cover the needed columns, or if PostgreSQL cannot use an index-only scan because visibility information is not sufficient, the engine still has to fetch heap tuples one by one.

Clarification / Fix: Check whether the predicate order matches the index definition, whether the plan is using the intended index, and whether a covering strategy is justified. Do not add columns casually; every covering choice increases maintenance cost on writes.

Issue: Insert and update throughput dropped sharply after adding indexes for support workflows.

Why it happens / is confusing: The new indexes improved a few read queries, but every payout state change now touches more B-tree pages and generates more WAL. If indexed columns are updated frequently, the table also loses opportunities for cheaper heap-only updates.

Clarification / Fix: Remove or narrow indexes that do not serve the authoritative workflow, and move exploratory filters into the derived search system. Keep the source-of-record index set aligned with commit-time queries, not with every troubleshooting convenience.

Issue: The index keeps growing and vacuum cannot seem to catch up after a period of heavy churn.

Why it happens / is confusing: A high-update workload leaves dead tuples behind in both the heap and its indexes. Long transactions delay cleanup, and repeated page splits can leave the on-disk structure larger and less locality-friendly than the logical row count suggests.

Clarification / Fix: Investigate long-running transactions, confirm autovacuum is keeping pace, and decide whether reindexing or fillfactor adjustment is justified. If churn is inherent to the workload, question whether the query truly belongs on the transactional B-tree path.

Advanced Connections

Connection 1: 016.md defines the authority boundary; B-tree costs explain why that boundary must stay narrow

The previous lesson argued that PostgreSQL should keep only the data and indexes needed for authoritative payroll execution. This lesson provides the physical reason. Every extra B-tree index spends write-path budget inside the system of record, so the boundary around "authoritative" must be disciplined or the hot path becomes a dumping ground for read convenience.

Connection 2: 018.md studies the same pressure from the opposite side of the storage-engine trade-off

B-trees maintain order in place and therefore spread maintenance across every write. LSM engines accept temporarily disordered data in memory and immutable files, then recover order later through compaction. The workloads that strain a B-tree through page churn and index updates are often the same workloads that make LSM compaction debt worth considering.

Resources

[DOC] PostgreSQL B-Tree Indexes
- Focus: Review how PostgreSQL describes page ordering, operator classes, and the kinds of predicates B-tree indexes can satisfy efficiently.
[DOC] PostgreSQL Multicolumn Indexes
- Focus: Use it to reason about leftmost-prefix behavior and why column order matters for PayLedger's tenant-scoped composite indexes.
[DOC] PostgreSQL Index-Only Scans and Covering Indexes
- Focus: Connect the logical B-tree path to the practical question of when the heap still has to be visited.
[DOC] PostgreSQL Heap-Only Tuples (HOT)
- Focus: Study why some updates avoid secondary-index churn and why frequent updates to indexed columns are especially expensive.

Key Insights

A B-tree lookup is cheap because the engine preserves sorted page structure continuously - The shallow path from root to leaf is the result of ongoing maintenance, not a free property of the data.
Every secondary index becomes part of the write path - Inserts, updates, WAL volume, replica replay, and vacuum work all scale with the indexes that must stay transactionally current.
Index design is really workload selection - Keep B-trees for authoritative equality and ordered-range queries, and move broad exploratory reads into systems that can afford a different cost model.

← Back to Data Architecture and Platforms

← Back to Learning Hub