Day 414: Hot/Cold Data Tiering and Working Set Control

The core idea: Tiering works only when the engine protects the true working set and keeps the metadata that locates colder data on a fast path; moving bytes to cheaper storage without controlling promotions, cache pollution, and metadata locality just converts storage savings into unpredictable tail latency.

Today's "Aha!" Moment

In 13.md, Harbor Point learned that dead MVCC versions can quietly occupy pages that should belong to live traffic. Vacuum helped, but the market-open slowdown did not disappear. The next bottleneck was retention. Harbor Point must keep seven years of reservation history for compliance, yet the matching and risk systems care almost entirely about today's open and recently filled reservations. When the quarterly surveillance job scanned older history, the fast tier and buffer pool were still being asked to host both the trading path and a mountain of data that almost never mattered to that path.

That is the shift for this lesson: hot/cold tiering is not "archive old rows somewhere else." It is the discipline of deciding which physical pages, index levels, manifests, filters, and partition summaries must stay on the fastest media so the latency-sensitive workload does not keep rediscovering cold history. A system can retain all of its data and still behave as if only a small active footprint exists, but only if it treats the working set as a first-class design target.

The common misconception is that age alone defines temperature. Harbor Point still replays some previous-day partitions constantly during risk checks, while some fresh compliance exports become cold almost immediately after landing. Temperature is therefore an operational property: how often the bytes are touched, how expensive a miss is, and whether a lookup can cheaply decide that the answer lives on a slower tier.

Why This Matters

Harbor Point's retention policy is fixed, but its latency budget is not. Before tiering, one large historical scan could evict current-day index and heap pages, expand checkpoint pressure, and make every failover slower because the storage engine had to warm an enormous undifferentiated footprint. After tiering, the latest trading-day partitions, their hottest secondary indexes, and the metadata needed to route cold lookups stay on NVMe and in cache, while older immutable partitions move to cheaper SSD or object-backed storage. The result is not free performance. Cold reads become intentionally slower, and the engine needs background movement logic, promotion limits, and careful observability. But the system finally pays premium latency only for data that actually earns it.

Learning Objectives

By the end of this session, you will be able to:

Explain what a storage working set really is - Distinguish the active physical footprint from the total logical dataset and from simple age-based retention buckets.
Trace how hot/cold tiering works inside an engine - Follow how data is classified, moved, and found again without making every cold lookup expensive.
Diagnose tiering failures in production - Identify cache pollution, metadata-locality problems, and promotion storms before they turn storage savings into user-facing latency spikes.

Core Concepts Explained

Concept 1: The working set is a physical footprint, not a business label

At Harbor Point, the logical dataset is enormous: years of filled and canceled reservations, audit events, and surveillance extracts. The trading path, however, repeatedly touches a much smaller slice: today's open reservations, a short band of recently filled orders, the top levels of a few secondary indexes, and the metadata structures that describe where those records live. That slice is the working set. The storage engine pays latency in pages, blocks, and index fan-out, not in abstract rows or tables.

This distinction matters because a single cold analytical scan can touch enough pages to displace the hot footprint even if it returns data no trading service will request again today. The problem is not that the historical query is illegitimate. The problem is that without explicit working-set control, the engine treats one-time scans and high-frequency operational lookups as equally deserving of the same premium cache and device space.

The lesson from 13.md carries forward cleanly: dead versions were one way to bloat the hot footprint with bytes the foreground workload did not truly need. Hot/cold tiering tackles the next layer of the same problem by asking which live data should remain in the fastest tier and which live data can tolerate a slower path. In practice, temperature is estimated from a mix of recency, frequency, and miss cost. A partition that is six months old may still be hot if every risk recalculation touches it. A partition created this morning may already be cold if it exists only for a one-off export.

The trade-off is direct. A smaller hot tier lowers cost and keeps the fast path dense, but every misclassified page raises the chance that a user request has to wait for a slower fetch or an emergency promotion. Tiering is valuable only when the latency-sensitive workload has a bounded, learnable footprint.

Concept 2: Real tiering systems move coarse data units and keep routing metadata hot

Very few engines move individual rows between storage classes on every access. The bookkeeping would be worse than the problem. Real systems usually tier at a coarser unit: a partition in a heap-oriented database, an SSTable in an LSM engine, or an immutable part or stripe in an analytical store. Harbor Point uses daily partitions for recent reservations and weekly immutable partitions once trading closes, which gives the storage layer something stable to classify and move.

The internal loop looks roughly like this:

foreground read
  -> consult hot partition map / sparse index / Bloom filter
  -> if data is on fast tier, serve normally
  -> if data is cold, fetch the needed block or segment
  -> optionally promote if repeated access justifies it

background control loop
  -> collect hit-rate and recency signals
  -> demote cooled partitions or files to slower media
  -> pin top-level indexes, manifests, and filters on the fast tier
  -> throttle promotions so one burst does not displace the hot set

The subtle requirement is metadata locality. A cold lookup should not need three cold lookups just to discover where the real data lives. If the partition directory, sparse index, Bloom filter, or zone-map summary for an older partition is also cold, one user request can trigger multiple remote reads before it even knows whether the row exists. Good tiering therefore keeps the routing structures compact and warm, even when the data blocks themselves are far away.

This is why systems such as RocksDB expose partitioned index and filter structures, and why relational systems often combine partitioning with carefully chosen local indexes rather than trying to tier one giant monolithic table file. The main trade-off is precision versus overhead. Fine-grained temperature tracking reacts better to mixed workloads, but it costs more CPU, metadata, and movement churn. Coarse-grained tiering is cheaper to manage, but it can strand moderately hot data on a slow tier or waste fast-tier space on partitions that are mostly cold.

Concept 3: Working-set control is mostly policy about what not to promote

Harbor Point's worst regressions came from cold scans that looked harmless in isolation. A compliance job reading eighteen months of reservations once per quarter did not need those pages to become part of the hot working set, yet a naive cache would happily admit them and evict current-day pages on the way. That is why working-set control is not only about demotion; it is also about refusing to let one-time traffic pollute the fast path.

Production systems use several defenses. Scan-resistant admission policies keep sequential one-touch reads from displacing frequently reused blocks. Pinned or prewarmed structures keep the latest trading-day partitions and their top index levels resident after restart or failover. Promotion limits prevent a thundering herd of cold misses from copying an entire historical partition back to NVMe because one incident investigation suddenly touched it from many workers at once. Some systems also separate caches or I/O budgets by workload class so that analytics and operational traffic do not compete as if they had the same latency target.

These controls all encode the same principle: misses are acceptable when they are rare, bounded, and intentional. They become dangerous when the system promotes too aggressively and teaches itself that every cold burst is now hot. The trade-off is that stronger admission and tighter promotion budgets can make legitimate exploratory analysis feel slower. That is usually the correct bargain for an operational store, but it has to be chosen explicitly and measured against real SLOs.

This also sets up 15.md. Once data is spread across hot and cold tiers, an online schema migration can no longer assume one immediate rewrite of one local file. The migration has to preserve readable structure across tiers, often for a long overlap period.

Troubleshooting

Issue: Historical compliance queries still blow up market-open latency even after older partitions were moved to a colder tier.

Why it happens / is confusing: The cold bytes moved, but the read path still admits those fetched blocks into the main cache or promotes entire partitions back to the fast tier. The system saved storage money without protecting the working set.

Clarification / Fix: Use scan-resistant admission, non-promoting reads for one-off analytics, or workload-level cache isolation. Then verify that hot-partition hit rate stays stable while the cold query runs.

Issue: Point lookups into old reservations are much slower than expected, even though each lookup returns only one row.

Why it happens / is confusing: The real latency is often in the routing path, not the data row itself. If manifests, sparse indexes, or filters for cold partitions are also remote, each lookup pays several serial reads before the row fetch begins.

Clarification / Fix: Keep partition maps and summary metadata on a fast tier, and prefer tiering units that let one metadata check rule out large amounts of irrelevant cold data.

Issue: An age-based policy demoted partitions that the risk service still reuses heavily every morning.

Why it happens / is confusing: Age is only a proxy. Repeated read bursts can make yesterday's partition more valuable to the fast tier than a newer export that nobody will touch again.

Clarification / Fix: Classify temperature with recency plus observed reuse, and allow explicit pinning for business-critical slices whose access pattern is predictable but not well captured by simple aging.

Advanced Connections

Connection 1: 13.md and this lesson are both about keeping irrelevant bytes out of the hot path

Vacuum removes obsolete versions that distort the physical footprint of a table. Tiering removes or isolates live but infrequently used data that would otherwise compete with the active working set. One lesson is about historical bytes that are no longer logically needed; the other is about logically valid bytes that are operationally cold. Both are working-set control problems.

Connection 2: 15.md turns tier placement into a schema-compatibility problem

Online schema migrations are harder once some data lives on a hot local tier and some lives in cold immutable storage. A new index or column layout may be cheap to build for today's partition and expensive to backfill for old partitions, so migration safety depends on dual-read compatibility and careful rollout boundaries rather than one immediate rewrite.

Resources

Optional Deepening Resources

[DOC] MySQL 8.4 Reference Manual: The InnoDB Buffer Pool
- Focus: How a mainstream transactional engine defines, caches, and protects hot pages in memory.
[DOC] PostgreSQL Documentation: Declarative Partitioning
- Focus: Practical partition boundaries for keeping recent data small and separating colder ranges cleanly.
[DOC] RocksDB Wiki: Block Cache
- Focus: Cache admission, eviction, and block-level working-set behavior in an LSM engine.
[DOC] RocksDB Wiki: Partitioned Index/Filters
- Focus: Why lookup metadata often needs to stay compact and hot even when most data files are cold.
[DOC] PostgreSQL Documentation: pg_prewarm
- Focus: A concrete mechanism for restoring a known hot working set after restart or failover.

Key Insights

Hot/cold tiering is a working-set decision, not a retention decision - The key question is which physical bytes deserve premium latency, not which rows are oldest.
Metadata locality is part of data locality - If the structures that locate cold data are themselves cold, every miss multiplies into a slow chain of lookups.
Promotion policy matters as much as placement policy - Systems fail when they eagerly teach themselves that every rare cold burst now belongs in the hot tier.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub