LESSON
Day 275: B-Tree Index Internals and Access Paths
A B-Tree is really a strategy for asking as few pages as possible to answer an ordered lookup.
Today's "Aha!" Moment
The insight: A B-Tree is not just "a sorted structure." It is a page-oriented navigation system designed to keep the tree shallow, preserve key order, and make point lookups and range scans efficient on real storage.
Why this matters: Teams often reduce indexes to a simple rule like "indexes make queries faster." That misses the important part: which queries, why, and at what maintenance cost. A B-Tree buys fast ordered access because it carefully uses page fanout and sorted keys to avoid scanning the whole table.
The universal pattern:
- keys are kept in sorted order
- internal pages route the search
- leaf pages hold entries in key order
- each lookup walks a short root-to-leaf path
- updates may split or rebalance pages to keep the structure useful
Concrete anchor: A users table has 50 million rows. Without an index on email, checking whether ana@example.com exists may require scanning huge portions of the table. With a B-Tree, the engine navigates a few internal pages, lands on the relevant leaf page, and then finds the exact key or the first matching range in sorted order.
How to recognize when this applies:
- You need fast equality lookup by a sortable key.
- You need range queries like
created_at BETWEEN .... - You care about ordered retrieval or prefix access patterns.
- Write cost is acceptable in exchange for much better read navigation.
Common misconceptions:
- [INCORRECT] "A B-Tree is basically a binary search tree on disk."
- [INCORRECT] "An index makes every query fast."
- [CORRECT] The truth: B-Trees are wide, page-aware search structures that help with access paths matching their key order, but they add write and maintenance cost and do nothing for mismatched predicates.
Real-world examples:
- Point lookup: Find one user by primary key or unique email.
- Range scan: Read all orders from one day in key order using an index on
created_at.
Why This Matters
The problem: Tables can grow large enough that "just scan it" becomes too expensive for many common access patterns. But indexes are not free either. They duplicate key information, consume pages, and make inserts, updates, and deletes more expensive because the navigation structure must stay ordered.
Before:
- Indexes are treated like generic speed boosters.
- Teams create indexes without thinking about access patterns.
- Write slowdowns or bloated maintenance cost seem mysterious.
After:
- B-Trees are understood as ordered page navigation structures.
- Teams can predict when an index helps equality and range access.
- Read benefits are weighed against split cost, space overhead, and write amplification.
Real-world impact: This understanding prevents cargo-cult indexing and makes later lessons on engine trade-offs much clearer, especially when comparing B-Trees with LSM-based designs.
Learning Objectives
By the end of this session, you will be able to:
- Explain why B-Trees are page-oriented rather than binary - Understand fanout and shallow height as the core performance win.
- Describe how lookups and range scans use the tree - Reason about internal nodes, leaf nodes, and sorted access paths.
- Evaluate B-Tree trade-offs - Recognize why ordered reads are fast and why writes pay maintenance cost.
Core Concepts Explained
Concept 1: B-Trees Win by Maximizing Fanout, Not by Being "Tree-Shaped"
If you stored keys in a naive binary tree on disk, each comparison might require another pointer chase and another page read. That would be terrible.
B-Trees solve this by making each node wide enough to hold:
- many keys
- many child pointers
inside one page.
That means one page read buys many routing decisions at once.
So instead of a tall skinny tree, you get:
- a wide shallow tree
This is the real performance trick.
Suppose an internal page can hold hundreds of keys and child pointers. Then each level narrows the search space enormously, and the full path from:
- root -> internal -> leaf
may only take a handful of page touches, even for large tables.
That is why B-Trees are designed around page layout:
- page size
- key size
- pointer count
- fill factor
all influence the fanout and therefore the height.
So the key mental model is:
- B-Tree performance is about minimizing page hops
not merely:
- "doing comparisons faster"
Concept 2: Leaves Store Ordered Access Paths, Which Is Why Range Queries Work
The B-Tree is valuable not only because it finds one key quickly, but because it preserves order.
That order is what makes:
=<>BETWEEN- prefix/range scans
so useful with a B-Tree.
The lookup pattern is usually:
- start at the root
- compare the target key against separator keys
- follow the chosen child pointer
- repeat until the leaf
- inspect the leaf entry or scan forward through nearby leaf entries
For point lookup, the engine stops at the exact match or the correct leaf location.
For range access, the engine finds the first qualifying leaf entry and then continues in sorted order.
This is why B-Trees are such a strong default for:
- primary keys
- unique lookups
- range scans on timestamps, IDs, or sortable fields
It is also why they are less helpful when the predicate does not align with key order.
For example, an index on (last_name, first_name) helps:
last_name = 'Garcia'last_name = 'Garcia' AND first_name >= 'M'
but not necessarily:
first_name = 'Ana'
because the index order is not arranged around that second column alone.
So access-path thinking matters as much as tree structure.
Concept 3: Fast Ordered Reads Are Bought With Write Maintenance
Every insert, delete, or key-changing update has to preserve B-Tree invariants:
- keys remain sorted
- node capacities stay valid
- navigation remains correct
That means writes may cause:
- page inserts in sorted position
- page splits when a node is full
- redistribution or merge in some delete-heavy cases
- additional WAL or recovery work later in the month
This is the central trade-off:
- B-Trees make many reads cheap
- but they make writes more expensive than an unordered heap append
The cost gets more visible when:
- keys are random and cause frequent page churn
- rows are large
- there are many secondary indexes
- updates touch indexed columns
This also explains why engines sometimes separate:
- heap/table storage
- index storage
The table may hold full row payloads, while the B-Tree holds keys plus references. That keeps navigation compact enough to preserve high fanout.
And it sets up the next lesson cleanly:
- once B-Trees explain the ordered-read side of the trade space, LSM trees will explain the write-optimized side and why compaction exists
Troubleshooting
Issue: "We added an index, but the query is still slow."
Why it happens / is confusing: Teams assume every predicate benefits from every index.
Clarification / Fix: Check whether the predicate matches the index key order and selectivity. A B-Tree only helps when the access pattern aligns with its sorted path.
Issue: "Writes became slower after we added several indexes."
Why it happens / is confusing: The read improvement is visible, but index maintenance cost is less obvious.
Clarification / Fix: Remember that every indexed write may now update multiple ordered structures, potentially causing page splits, extra WAL, and more cache churn.
Issue: "Why are B-Trees usually shallow even on very large tables?"
Why it happens / is confusing: People imagine binary trees with many levels.
Clarification / Fix: Each page can hold many keys and child pointers, so fanout is large and height stays low. That is the main reason the structure works well on storage devices.
Advanced Connections
Connection 1: B-Trees <-> Pages and Record Layout
The parallel: The previous lesson showed that storage engines think in pages. B-Tree nodes are just specially organized pages optimized for navigation rather than plain row storage.
Real-world case: Node fanout, split cost, and pointer layout are direct consequences of how much a page can hold.
Connection 2: B-Trees <-> LSM Trees
The parallel: This lesson explains the classic ordered-read, in-place-maintenance design. The next lesson will contrast it with LSM trees, which accept heavier read/compaction trade-offs to make writes cheaper and more sequential.
Real-world case: Choosing B-Tree versus LSM is often choosing where you want to pay the maintenance bill: incrementally during writes or later during compaction and reads.
Resources
Optional Deepening Resources
- [BOOK] Database Internals
- Link: https://www.databass.dev/
- Focus: Use it for practical explanations of B-Tree page structure, splits, and storage-engine trade-offs.
- [DOCS] PostgreSQL Documentation: Indexes
- Link: https://www.postgresql.org/docs/current/indexes.html
- Focus: Good practical reference for how B-Tree-style indexes support access paths in a production row-store.
- [DOCS] SQLite Query Planner Overview
- Link: https://www.sqlite.org/queryplanner.html
- Focus: Useful for seeing how ordered indexes become real access paths for equality and range predicates.
- [BOOK] Designing Data-Intensive Applications
- Link: https://dataintensive.net/
- Focus: Helpful for tying index design back to storage engines and workload trade-offs.
Key Insights
- B-Trees optimize page hops, not just comparisons - Wide nodes keep the tree shallow, which is what makes large indexes practical on storage.
- Sorted leaves are what make range access powerful - Equality and ordered scans both come from preserving key order through the tree.
- Read speed is bought with write maintenance - Splits, ordered inserts, and multi-index upkeep are the price of fast navigable access paths.