LESSON

Database Engine Internals and Implementation

003 30 min intermediate

Day 275: B-Tree Index Internals and Access Paths

A B-Tree is really a strategy for asking as few pages as possible to answer an ordered lookup.

Today's "Aha!" Moment

The insight: A B-Tree is not just "a sorted structure." It is a page-oriented navigation system designed to keep the tree shallow, preserve key order, and make point lookups and range scans efficient on real storage.

Why this matters: Teams often reduce indexes to a simple rule like "indexes make queries faster." That misses the important part: which queries, why, and at what maintenance cost. A B-Tree buys fast ordered access because it carefully uses page fanout and sorted keys to avoid scanning the whole table.

The universal pattern:

keys are kept in sorted order
internal pages route the search
leaf pages hold entries in key order
each lookup walks a short root-to-leaf path
updates may split or rebalance pages to keep the structure useful

Concrete anchor: A users table has 50 million rows. Without an index on email, checking whether ana@example.com exists may require scanning huge portions of the table. With a B-Tree, the engine navigates a few internal pages, lands on the relevant leaf page, and then finds the exact key or the first matching range in sorted order.

How to recognize when this applies:

You need fast equality lookup by a sortable key.
You need range queries like created_at BETWEEN ....
You care about ordered retrieval or prefix access patterns.
Write cost is acceptable in exchange for much better read navigation.

Common misconceptions:

[INCORRECT] "A B-Tree is basically a binary search tree on disk."
[INCORRECT] "An index makes every query fast."
[CORRECT] The truth: B-Trees are wide, page-aware search structures that help with access paths matching their key order, but they add write and maintenance cost and do nothing for mismatched predicates.

Real-world examples:

Point lookup: Find one user by primary key or unique email.
Range scan: Read all orders from one day in key order using an index on created_at.

Why This Matters

The problem: Tables can grow large enough that "just scan it" becomes too expensive for many common access patterns. But indexes are not free either. They duplicate key information, consume pages, and make inserts, updates, and deletes more expensive because the navigation structure must stay ordered.

Before:

Indexes are treated like generic speed boosters.
Teams create indexes without thinking about access patterns.
Write slowdowns or bloated maintenance cost seem mysterious.

After:

B-Trees are understood as ordered page navigation structures.
Teams can predict when an index helps equality and range access.
Read benefits are weighed against split cost, space overhead, and write amplification.

Real-world impact: This understanding prevents cargo-cult indexing and makes later lessons on engine trade-offs much clearer, especially when comparing B-Trees with LSM-based designs.

Learning Objectives

By the end of this session, you will be able to:

Explain why B-Trees are page-oriented rather than binary - Understand fanout and shallow height as the core performance win.
Describe how lookups and range scans use the tree - Reason about internal nodes, leaf nodes, and sorted access paths.
Evaluate B-Tree trade-offs - Recognize why ordered reads are fast and why writes pay maintenance cost.

Core Concepts Explained

Concept 1: B-Trees Win by Maximizing Fanout, Not by Being "Tree-Shaped"

If you stored keys in a naive binary tree on disk, each comparison might require another pointer chase and another page read. That would be terrible.

B-Trees solve this by making each node wide enough to hold:

many keys
many child pointers

inside one page.

That means one page read buys many routing decisions at once.

So instead of a tall skinny tree, you get:

a wide shallow tree

This is the real performance trick.

Suppose an internal page can hold hundreds of keys and child pointers. Then each level narrows the search space enormously, and the full path from:

root -> internal -> leaf

may only take a handful of page touches, even for large tables.

That is why B-Trees are designed around page layout:

page size
key size
pointer count
fill factor

all influence the fanout and therefore the height.

So the key mental model is:

B-Tree performance is about minimizing page hops

not merely:

"doing comparisons faster"

Concept 2: Leaves Store Ordered Access Paths, Which Is Why Range Queries Work

The B-Tree is valuable not only because it finds one key quickly, but because it preserves order.

That order is what makes:

=
<
>
BETWEEN
prefix/range scans

so useful with a B-Tree.

The lookup pattern is usually:

start at the root
compare the target key against separator keys
follow the chosen child pointer
repeat until the leaf
inspect the leaf entry or scan forward through nearby leaf entries

For point lookup, the engine stops at the exact match or the correct leaf location.

For range access, the engine finds the first qualifying leaf entry and then continues in sorted order.

This is why B-Trees are such a strong default for:

primary keys
unique lookups
range scans on timestamps, IDs, or sortable fields

It is also why they are less helpful when the predicate does not align with key order.

For example, an index on (last_name, first_name) helps:

last_name = 'Garcia'
last_name = 'Garcia' AND first_name >= 'M'

but not necessarily:

first_name = 'Ana'

because the index order is not arranged around that second column alone.

So access-path thinking matters as much as tree structure.

Concept 3: Fast Ordered Reads Are Bought With Write Maintenance

Every insert, delete, or key-changing update has to preserve B-Tree invariants:

keys remain sorted
node capacities stay valid
navigation remains correct

That means writes may cause:

page inserts in sorted position
page splits when a node is full
redistribution or merge in some delete-heavy cases
additional WAL or recovery work later in the month

This is the central trade-off:

B-Trees make many reads cheap
but they make writes more expensive than an unordered heap append

The cost gets more visible when:

keys are random and cause frequent page churn
rows are large
there are many secondary indexes
updates touch indexed columns

This also explains why engines sometimes separate:

heap/table storage
index storage

The table may hold full row payloads, while the B-Tree holds keys plus references. That keeps navigation compact enough to preserve high fanout.

And it sets up the next lesson cleanly:

once B-Trees explain the ordered-read side of the trade space, LSM trees will explain the write-optimized side and why compaction exists

Troubleshooting

Issue: "We added an index, but the query is still slow."

Why it happens / is confusing: Teams assume every predicate benefits from every index.

Clarification / Fix: Check whether the predicate matches the index key order and selectivity. A B-Tree only helps when the access pattern aligns with its sorted path.

Issue: "Writes became slower after we added several indexes."

Why it happens / is confusing: The read improvement is visible, but index maintenance cost is less obvious.

Clarification / Fix: Remember that every indexed write may now update multiple ordered structures, potentially causing page splits, extra WAL, and more cache churn.

Issue: "Why are B-Trees usually shallow even on very large tables?"

Why it happens / is confusing: People imagine binary trees with many levels.

Clarification / Fix: Each page can hold many keys and child pointers, so fanout is large and height stays low. That is the main reason the structure works well on storage devices.

Advanced Connections

Connection 1: B-Trees <-> Pages and Record Layout

The parallel: The previous lesson showed that storage engines think in pages. B-Tree nodes are just specially organized pages optimized for navigation rather than plain row storage.

Real-world case: Node fanout, split cost, and pointer layout are direct consequences of how much a page can hold.

Connection 2: B-Trees <-> LSM Trees

The parallel: This lesson explains the classic ordered-read, in-place-maintenance design. The next lesson will contrast it with LSM trees, which accept heavier read/compaction trade-offs to make writes cheaper and more sequential.

Real-world case: Choosing B-Tree versus LSM is often choosing where you want to pay the maintenance bill: incrementally during writes or later during compaction and reads.

Resources

Optional Deepening Resources

[BOOK] Database Internals
- Link: https://www.databass.dev/
- Focus: Use it for practical explanations of B-Tree page structure, splits, and storage-engine trade-offs.
[DOCS] PostgreSQL Documentation: Indexes
- Link: https://www.postgresql.org/docs/current/indexes.html
- Focus: Good practical reference for how B-Tree-style indexes support access paths in a production row-store.
[DOCS] SQLite Query Planner Overview
- Link: https://www.sqlite.org/queryplanner.html
- Focus: Useful for seeing how ordered indexes become real access paths for equality and range predicates.
[BOOK] Designing Data-Intensive Applications
- Link: https://dataintensive.net/
- Focus: Helpful for tying index design back to storage engines and workload trade-offs.

Key Insights

B-Trees optimize page hops, not just comparisons - Wide nodes keep the tree shallow, which is what makes large indexes practical on storage.
Sorted leaves are what make range access powerful - Equality and ordered scans both come from preserving key order through the tree.
Read speed is bought with write maintenance - Splits, ordered inserts, and multi-index upkeep are the price of fast navigable access paths.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub