LESSON

Database Engine Internals and Implementation

007 30 min intermediate

Day 279: Buffer Pools, Caching, and I/O Scheduling

The core idea: a good query plan still needs the right pages in memory at the right time, and a storage engine lives or dies by how it manages that working set and schedules background I/O around foreground reads.

Today's "Aha!" Moment

The insight: Query cost is not just about operators. It is also about whether the pages those operators need are already hot, whether dirty pages are blocking eviction, and whether background writeback is quietly competing with foreground latency.

Why this matters: Two identical queries can behave very differently depending on cache warmth, eviction pressure, and I/O scheduling.

Concrete anchor: Yesterday's query plan looked fine in staging, but production is slower. The SQL did not change. The data did not change much. What changed was the memory working set and the fact that checkpoint writeback, scans, and index probes are now competing for the same storage path.

The practical sentence to remember:
Execution plans decide what work is needed; buffer pools and I/O scheduling decide how painful that work becomes.

Why This Matters

The problem: Storage is orders of magnitude slower than CPU, so engines try hard to keep useful pages in memory. But memory is limited, some pages are dirty, some are pinned by active work, and scans can easily evict data that short point-lookups needed.

Without this model:

Teams over-focus on SQL text and under-focus on memory residency.
Cache hit rate is treated as a vanity metric instead of a workload signal.
Writeback storms and flush pressure get mistaken for "random storage slowness."

With this model:

You can reason about why a warm-cache query is fast and a cold-cache query is not.
You can explain why eviction and dirty-page policy matter as much as index choice.
You can see I/O scheduling as part of query performance, not just a storage-layer concern.

Operational payoff: Better sizing of memory, better tuning of shared buffers and flush policy, fewer checkpoint spikes, and a clearer understanding of why some workloads get hurt by scans while others get hurt by writeback.

Learning Objectives

By the end of this lesson, you should be able to:

Explain the role of the buffer pool as the managed in-memory working set for data pages.
Describe the main cache-management decisions around hits, misses, dirty pages, eviction, and scan pollution.
Reason about I/O scheduling trade-offs between foreground latency and background maintenance work.

Core Concepts Explained

Concept 1: The Buffer Pool Is the Engine's Working Memory

Concrete example / mini-scenario: A query needs a B-Tree root page, then a few internal pages, then a leaf page, then finally the target row page. If all of them are already in memory, the query feels instant. If not, each miss may trigger storage I/O.

Intuition: A buffer pool is not just "a cache." It is the engine's actively managed set of database pages in memory.

What lives there:

Table pages
Index pages
Metadata pages
Sometimes temporary or internal work pages

How it works mechanically:

An operator requests a page.
If the page is already resident, the engine gets a buffer hit.
If not, the engine triggers a buffer miss and loads the page from storage.
The page is pinned while in use so it cannot be evicted mid-operation.
If modified, the page becomes dirty and must be flushed later before eviction or checkpoint progress.

Why engines manage this directly:

They understand page semantics better than the generic OS cache.
They need precise control over dirty pages, checkpoints, and eviction.
They often need to distinguish foreground critical pages from background scan traffic.

Important subtlety: In many systems, the database cache and the OS page cache can interact awkwardly. This is why some engines prefer direct I/O or careful buffer sizing: too many layers of caching can waste memory and hide pressure until it hurts.

Concept 2: Cache Policy Is About Protecting the Right Working Set

Concrete example / mini-scenario: A reporting query sequentially scans a huge table. Meanwhile, OLTP traffic keeps asking for the same small hot index pages. If the scan pollutes the cache, point lookups suddenly slow down.

Intuition: The hard question is not "should we cache?" It is "which pages deserve to stay?"

Common pressures on eviction policy:

Hot small working sets versus large cold scans
Read-mostly pages versus dirty pages that are expensive to evict
Random point lookups versus sequential access

Why a naive policy fails:

Plain LRU can be polluted by long scans.
Dirty pages cannot be dropped as cheaply as clean ones.
Pages pinned by active operators cannot be evicted at all.

What engines typically add:

Clock or second-chance style replacement
Separate treatment for sequential scans
Background flushing to create clean eviction candidates
Readahead or prefetch when access patterns look predictable

The key systems idea:
Eviction policy is really admission control for memory pressure.

If the engine protects the wrong pages, I/O explodes even though nothing about the query text changed.

Connection to yesterday's lesson: A great query plan can still run badly if the pages it expects to probe cheaply keep getting evicted by unrelated work.

Concept 3: I/O Scheduling Is a Latency Policy, Not Just Plumbing

Concrete example / mini-scenario: Foreground user queries need low-latency random reads, but a checkpoint starts flushing large numbers of dirty pages and saturates the device queue.

Intuition: Once reads, writeback, checkpointing, log flushes, and scans all share the same storage path, "I/O scheduling" becomes a policy question about whose latency matters most right now.

Main sources of I/O pressure:

Buffer misses during reads
Dirty page flushes
Checkpoints
Background compaction or vacuum-like maintenance
Readahead for scans

Why background work is dangerous:

It is usually necessary for correctness or long-term health.
But if scheduled badly, it steals bandwidth from foreground latency-sensitive requests.

Typical balancing mechanisms:

Limit checkpoint aggressiveness
Smooth writeback instead of letting debt accumulate into a flush storm
Separate log devices from data devices when possible
Throttle background maintenance
Batch I/O when it improves throughput without blowing tail latency

The trade-off:

Aggressive flushing reduces future risk and can shorten recovery.
But it can damage current latency.
Delaying background work protects immediate queries for a while.
But it increases future debt and raises the chance of bigger latency spikes later.

Mental model:
The buffer pool decides what deserves scarce memory.
The I/O scheduler decides which debt gets paid first when memory is no longer enough.

Troubleshooting

Issue: Queries are fast after warm-up but very slow after restart or failover.

Why it happens: The working set is no longer resident, so operators that used to hit memory now miss and fetch from storage.

Clarification / Fix: Distinguish cold-cache from warm-cache performance. Capacity planning should consider both, especially for failover scenarios.

Issue: Checkpoints cause periodic latency spikes.

Why it happens: Dirty-page flushing may be too bursty, so checkpoint work competes with foreground reads and writes in large bursts.

Clarification / Fix: Smooth writeback, revisit checkpoint cadence, and measure how much dirty-page debt accumulates between flush phases.

Issue: Cache hit rate looks acceptable, but user-facing latency is still unstable.

Why it happens: Averages can hide the fact that the pages missing are exactly the ones on latency-sensitive paths, or that background writes are saturating storage anyway.

Clarification / Fix: Look beyond overall hit rate. Segment by workload class, page type, and tail latency.

Issue: Large scans hurt unrelated OLTP traffic.

Why it happens: Scan pages may pollute the buffer pool and evict hot pages needed by point lookups.

Clarification / Fix: Use scan-aware caching behavior, readahead policy, or separate analytical workloads when needed.

Advanced Connections

Connection 1: Buffer Pools <-> WAL and Checkpoints

The parallel: Dirty pages can stay in memory for performance, but WAL is what makes that safe. Checkpoints then decide how much of that dirty state should be pushed out to bound recovery time.

Why this matters: Memory management, durability, and recovery are one system, not three separate topics.

Connection 2: Buffer Pools <-> Concurrency Control

The parallel: Once many queries touch the same in-memory pages, coordination matters too. That is why the next lesson moves naturally into locks, latches, and deadlocks.

Why this matters: Hot pages are good for I/O, but they can also become hot contention points.

Resources

Suggested Resources

[DOC] PostgreSQL Resource Consumption and Shared Buffers - Documentation
Focus: practical tuning intuition for memory, buffers, and cache-related trade-offs.
[DOC] MySQL InnoDB Buffer Pool - Documentation
Focus: a concrete production example of how a major engine manages its in-memory working set.
[BOOK] Database Internals - Book site
Focus: deep mental models for page caching, buffer management, and storage-engine behavior.

Key Insights

The buffer pool is the engine's managed working set, not just a generic cache.
Eviction policy is a workload decision because the wrong pages in memory can ruin an otherwise good plan.
I/O scheduling is part of latency control because background maintenance and foreground queries compete for the same storage path.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub