Day 391: B-Tree Node Structure and Search Path

The core idea: A B-tree stays fast on disk because each page is a sorted routing table, so one root-to-leaf walk narrows a huge key space to one small leaf page.

Today's "Aha!" Moment

In 06.md, Harbor Point's municipal-bond system ran into the main weakness of a fully sorted file: preserving global order under steady quote updates becomes expensive. The table still needs order for lookups and range scans, but it does not need the entire data file to be one giant sorted array. A B-tree solves that by storing order hierarchically. Instead of one sorted run, Harbor Point builds an index on (cusip, quote_ts DESC) where each page covers only a slice of the key space.

The important realization is that a B-tree root page does not point to rows. It points to ranges. An internal page might say, in effect, "keys up to 13040... are in this child; keys up to 13080... are in that child." Each descent step removes most of the search space at page granularity. If an 8 KB internal page can hold a few hundred separators and child pointers, the tree becomes shallow enough that tens of millions of index entries are often reachable in three or four levels.

That is why the usual explanation, "B-trees are O(log n)," is too weak for production reasoning. What matters operationally is page fanout. Upper levels stay hot in cache, leaves hold almost all of the data, and a lookup is usually one binary search per page plus one pointer chase to the next page. The structure of the node determines how many cache lines, buffer hits, and storage reads the engine spends to answer a query.

This also explains why the next lesson on 08.md has to exist. Search is cheap only if every parent separator still describes the correct child range after inserts and deletes. Today we focus on what the nodes mean and how a read finds its path. Next comes the maintenance work that keeps that path valid.

Why This Matters

Harbor Point serves two high-value queries from the same municipal_quotes table. Trader dashboards repeatedly ask for the newest quote for one cusip. Compliance analysts replay quote movement over a narrow time window for one issuer. A heap file alone makes both queries scan too much, while a fully sorted file makes every incoming quote pay too much maintenance.

A B-tree index is the compromise that makes the workload tractable. It gives the dashboard a short root-to-leaf path to the newest entries for one bond, and it gives compliance a way to start at the first relevant leaf and then continue in key order. Once you understand node structure, you can explain why those two access patterns can share one index and why the latency story depends more on page layout than on abstract algorithm names.

This is production-relevant for another reason: many storage problems that look like "query planning" are really "index geometry." Wide keys reduce fanout and deepen the tree. Non-covering leaves still force heap lookups after the index search finishes. Poor key ordering groups the wrong entries together. If you cannot picture the search path, you cannot diagnose those problems accurately.

Learning Objectives

By the end of this session, you will be able to:

Explain what lives in a B-tree node - Distinguish root, internal, and leaf pages and describe what each page stores.
Trace a root-to-leaf lookup - Follow how separator keys, child pointers, and leaf links answer point and range queries.
Reason about production trade-offs - Connect key width, fanout, cache residency, and heap lookups to real latency behavior.

Core Concepts Explained

Concept 1: A B-tree node is a page-sized routing structure, not a single key/value pair

Harbor Point's index on (cusip, quote_ts DESC) is not a binary tree where each node holds one comparison key. It is a page-sized node designed for block storage. Most production systems calling something a "B-tree" are really using a B+ tree variant: internal pages store routing information, while leaf pages store the full searchable keys and either row pointers or the rows themselves if the index is clustered.

At a high level, the tree looks like this:

root
[13020 | 13080 | 13140]
   |        |        |
   v        v        v
internal  internal  internal
   |                    |
   v                    v
leaf <-> leaf <-> leaf <-> leaf

Inside each page, entries are kept in sorted key order. Internal pages hold separator keys plus child page identifiers. Leaf pages hold the actual index entries for Harbor Point's quotes, such as (cusip, quote_ts DESC) -> row locator. Many engines also keep sibling links between neighboring leaves so once a range scan reaches its first leaf, it can continue without climbing back to the root for every next key.

The page-sized design is what makes B-trees flat. Suppose Harbor Point uses 8 KB index pages and, after headers and free space, an internal entry averages about 32 bytes. One internal page can then route to roughly 200 or more children. With that fanout, a four-level tree can cover an enormous key space while touching only one page per level. The exact numbers vary by engine and key width, but the principle does not: fewer, wider levels mean fewer storage reads.

That benefit is purchased with overhead. The index needs extra pages, internal separators, and free space management that a heap file would not need. But compared with the fully sorted-file idea from 06.md, the B-tree localizes order maintenance. The engine preserves sorted order inside leaves and routing order above them instead of rewriting a long contiguous run of the base file every time a new quote arrives out of order.

Concept 2: The search path is a sequence of interval decisions from root to leaf

Consider Harbor Point's dashboard query:

SELECT price
FROM municipal_quotes
WHERE cusip = '13042AB7'
  AND quote_ts <= '2026-03-31 09:37:00'
ORDER BY quote_ts DESC
LIMIT 1;

With an index on (cusip, quote_ts DESC), the engine starts at the root. It binary-searches the root's separator keys to find which child covers the interval containing 13042AB7. That child page is read next. The same process repeats on the internal page: binary search among separators, choose one child pointer, descend again. Eventually the search reaches a leaf page whose key range includes the target cusip and timestamp boundary.

At the leaf, the engine is no longer routing. It is matching. It searches the sorted leaf entries, finds the first visible entry satisfying the predicate, and either returns the value immediately if the index is covering or follows the row locator into the heap table if the index is secondary. The path therefore has two phases: narrow by intervals in internal pages, then resolve an actual tuple in a leaf.

Range scans use the same descent. Harbor Point's compliance job might ask for all quotes for one cusip between 09:30 and 09:45. The B-tree first finds the starting leaf exactly as a point lookup would. After that, the engine walks leaf-to-leaf in key order until the range ends. This is the detail that makes B-trees much better than hash organization for ordered access: the search path finds the first page, and the leaf chain preserves the continuation order.

Two production details matter here. First, upper levels are tiny relative to leaves, so the buffer pool often keeps the root and maybe the next level resident almost all the time. A cold lookup may therefore pay for one leaf read and one heap read, not a disk hit at every tree level. Second, the parent/child boundary is a correctness contract. If a separator says a child covers keys below 13080, the child really must satisfy that interval. The next lesson on 08.md is about the splits and parent updates that preserve that contract during writes.

Concept 3: Node structure turns index design choices into latency and maintenance behavior

Once you see the B-tree as a stack of page-sized routing tables, several design decisions become easier to reason about. The most obvious is key width. If Harbor Point stores a compact cusip plus timestamp in the key, internal pages keep high fanout. If it adds wide text fields or many extra columns, each page holds fewer entries, the tree gets taller, and each lookup has more work to do. "Same index, different columns" can therefore change latency even when the query predicate looks similar.

Key order is equally structural. Harbor Point's dashboard and compliance workloads both begin with cusip, so placing cusip first groups all entries for one bond into a tight part of the leaf space. If the index instead began with trading_desk_id, the same cusip values would be scattered across many desk prefixes and the search path would no longer isolate the dashboard's hot access pattern cleanly. The node structure is not neutral about column order; it physically arranges which keys become neighbors.

Leaf density creates another trade-off. A covering index can answer more queries without touching the heap because the leaf already stores every needed column. That often improves read latency. The cost is fatter leaves, lower fanout at the bottom of the tree, and more page splits or write amplification as updates arrive. Conversely, a narrow secondary index keeps the tree compact but forces a second hop to the heap after the leaf search completes. That is why "add coverage" is never a free optimization.

Finally, node structure explains why B-tree correctness work is local but not trivial. Search is cheap because the engine only touches one root-to-leaf path, not the whole ordered dataset. But that means every node on the path must maintain clean key boundaries and enough space discipline for future inserts. Fill factor, page compaction, sibling links, and parent separators are not housekeeping details; they are the reason the tree stays shallow and searchable under live traffic.

Troubleshooting

Issue: An index lookup touches more pages than expected even though the table size has not exploded.

Why it happens / is confusing: Tree height is affected by key width and page occupancy, not just row count. A wider composite key or aggressive covering index can silently reduce fanout and add another level.

Clarification / Fix: Inspect the indexed columns and page density before blaming the query planner. Narrower keys, better prefix choice, or avoiding unnecessary included columns often recover the expected search depth.

Issue: The index scan is ordered, but the query still behaves like random I/O.

Why it happens / is confusing: The B-tree ordered the index entries, not necessarily the heap table. If the leaf stores row locators, each qualifying entry may trigger a separate heap fetch.

Clarification / Fix: Distinguish "ordered index traversal" from "ordered table access." For hot point lookups, a covering index may be worth the extra leaf size. For broader scans, clustering or a different storage layout may matter more.

Issue: Write bursts against one hot cusip create latch pressure near the top of the tree.

Why it happens / is confusing: Reads usually treat upper levels as nearly free because they stay cached, but concentrated writes can force repeated structural changes along the same root-to-leaf path.

Clarification / Fix: Look at page split frequency, fill factor, and hot-key concentration rather than only read metrics. The maintenance mechanics in 08.md are the next place to investigate.

Advanced Connections

Connection 1: B-tree search path ↔ Buffer pool economics

The reason B-tree lookups often feel cheaper than their tree height suggests is that upper levels form a tiny, extremely hot working set. In PostgreSQL and similar engines, a small number of root and internal pages can stay resident while leaves take most of the churn. That is why fanout and key width directly affect cache behavior, not just asymptotic complexity.

Connection 2: B-tree node structure ↔ Split and merge maintenance

The B-tree gives Harbor Point ordered access without keeping the whole table globally sorted, but it only works because page-local changes can update the routing structure safely. A leaf split introduces a new sibling and a new separator in the parent; a merge or deletion path has to preserve the same search boundaries in reverse. That is the exact bridge into 08.md.

Resources

Optional Deepening Resources

[DOC] PostgreSQL Documentation: B-Tree Indexes
- Focus: Read the implementation notes on B-tree structure, leaf pages, and cascading page splits to connect the lesson's abstractions to a production engine.
[DOC] SQLite Database File Format: B-tree Pages
- Focus: Inspect the page header, cell pointer array, and right-most child pointer to see how a real engine serializes internal and leaf nodes on disk.
[ARTICLE] Algorithms Behind Modern Storage Systems
- Focus: Compare the B-tree fanout story with LSM-tree trade-offs so the search path stays grounded in engine-level design choices.

Key Insights

A B-tree node is a page of routing information - The storage win comes from comparing many ordered keys per page, not from binary-tree-style single-node comparisons.
The search path is an interval walk - Internal pages narrow the key space, and leaves resolve the actual tuple or row pointer.
Index geometry is an operational concern - Key width, column order, and leaf payload change fanout, cache residency, and write cost in measurable ways.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub