Vacuum and Garbage Collection of Old Versions

LESSON

Database Engine Internals and Implementation

036 30 min advanced

Day 413: Vacuum and Garbage Collection of Old Versions

The core idea: An old MVCC version is reclaimable only after the oldest snapshot that could still see it has moved past it, so vacuum is really visibility-horizon management plus background storage cleanup.

Today's "Aha!" Moment

In 12.md, Harbor Point fixed a bad market-open query by refreshing planner statistics and modeling skew more accurately. The estimate improved, but the query was still slower than it had been a week earlier. EXPLAIN ANALYZE now looked reasonable, yet the index scan on reservations touched far more heap pages than expected, and the table kept growing even though the number of live rows was nearly flat.

The missing piece was old-version cleanup. Harbor Point's matching workflow updates each reservation several times in a short burst: open, then routed, then partially_filled, then filled or canceled. In an MVCC engine, those updates do not overwrite one row in place from every reader's perspective. They create a new visible version while older versions remain part of history for any snapshot that started earlier. During a long-running risk report, one old snapshot stayed open for twenty minutes, so autovacuum could see obsolete versions piling up but could not yet reclaim many of them.

That is the shift for this lesson: vacuum is not janitorial work that runs "when the database has time." It is a correctness-constrained mechanism that asks a strict question for every old version: can any active snapshot still legally read this? If the answer is yes, reclaiming it would break snapshot semantics. If the answer is no, leaving it around inflates the heap, bloats indexes, and drags more cold pages into the working set. Cleanup policy therefore sits directly between concurrency control and storage efficiency.

Why This Matters

Harbor Point's reservations table is not unusual by OLTP standards: heavy updates, short read queries, and occasional long analytics sessions that want a stable historical view. That combination is exactly where MVCC shines for readers and exactly where old-version debt becomes operationally expensive. A planner can have accurate statistics and a well-designed index, yet the real query still slows down because the engine is scanning through dead tuple versions, following stale index pointers, and caching pages whose only current purpose is to preserve history for one lingering snapshot.

Understanding vacuum changes the diagnosis. Before, storage growth and rising latency look like vague "database slowness." After, you can distinguish between live-data growth, dead-version backlog, and file-level bloat that persists even after tuples become reusable. You can also reason about why a replica with an old read view, an idle transaction, or an underpowered background worker can quietly hold the reclaim horizon in place.

That matters for correctness as well as performance. MVCC engines must eventually retire transaction metadata safely, or they risk wraparound-style failure modes where visibility information itself becomes ambiguous. Vacuum is therefore not optional polish. It is part of the mechanism that keeps old history readable long enough, then proves when it is safe to forget it.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain when an old row version becomes safe to reclaim - Connect tuple visibility to the oldest active snapshot rather than to the moment an update commits.
  2. Trace what vacuum or purge work actually does inside the engine - Follow how the system identifies dead versions, cleans indexes or undo state, and returns space to reuse.
  3. Diagnose production symptoms of cleanup debt - Separate horizon blockers, reusable free space, and true relation or index bloat when performance degrades.

Core Concepts Explained

Concept 1: A version is dead only relative to every active snapshot, not just to the newest writer

Harbor Point updates reservation R-18422 three times in quick succession during market open. From the application's point of view, there is one reservation whose status changed. Inside an MVCC engine, the history is closer to this:

time ---->

v1: status=open           xmin=8401 xmax=8408
v2: status=routed         xmin=8408 xmax=8412
v3: status=partially_fill xmin=8412 xmax=inf

snapshot S_old starts before 8408  -> can still see v1
snapshot S_mid starts before 8412  -> can still see v2
new snapshot starts after 8412     -> sees v3

The key invariant is that each reader gets a consistent snapshot. A reader that started before transaction 8408 committed must still be able to read v1 even after v2 and v3 exist. That is why old versions cannot be reclaimed at commit time. The engine needs a global visibility horizon, usually expressed as the oldest transaction ID or oldest read view that any active session still needs. Only versions older than that horizon, and already superseded from every relevant snapshot, are candidates for cleanup.

PostgreSQL and InnoDB package the details differently, but the rule is the same. PostgreSQL stores version metadata with heap tuples and decides visibility from xmin and xmax. InnoDB keeps prior versions in undo records and exposes them through read views. The implementation differs; the correctness boundary does not. Garbage collection is blocked by the oldest snapshot that still makes an old version reachable.

This is the fundamental trade-off of MVCC. Readers avoid blocking writers because they can walk historical versions instead of waiting for in-place overwrites to finish. The cost is deferred cleanup. Cheap, nonblocking reads are financed by background work that must later discover which history is still needed and which history can finally disappear.

Concept 2: Vacuum is a pipeline of classification, cleanup, and reuse metadata, not a single "delete old rows" step

When Harbor Point's autovacuum worker visits reservations, it is not simply erasing rows marked old. It has to inspect pages, classify tuple versions by visibility, and decide what kind of cleanup is safe now. In a PostgreSQL-like heap, the sequence is roughly:

scan page
  -> determine which tuple versions are live, dead, or still potentially visible
  -> remove or mark removable index references for versions that are definitely dead
  -> mark heap space reusable
  -> refresh free-space / visibility metadata
  -> freeze very old transaction IDs when needed

Each stage serves a different purpose. Identifying a tuple as dead is about correctness. Cleaning an index entry is about preventing future probes from landing on a version no query can use. Updating free-space metadata is about helping later inserts and updates reuse holes instead of extending the file. Freezing very old transaction IDs is about keeping visibility checks well-defined as transaction counters age.

Two production nuances matter here. First, vacuum often makes space reusable inside the relation long before the operating system sees a smaller file. If dead tuples are scattered across many pages, the engine can reuse those holes for future writes without being able to truncate the file tail. Second, a vacuum pass can do real work and still leave a table looking bloated if one old snapshot prevents many versions from crossing the reclaim horizon. "Vacuum ran" is therefore not the same statement as "cleanup succeeded."

InnoDB expresses the same idea with different structures. Purge threads retire undo records and remove delete-marked records only after no read view can need them. Again, the important lesson is the mechanism: old-version cleanup touches both the history structure and the live access paths. It is not just housekeeping at the edge of the system; it changes read amplification for future transactions.

Concept 3: Vacuum debt becomes performance debt by inflating scans, indexes, and the working set

Harbor Point first noticed the problem in query latency, not in a vacuum dashboard. The live row count stayed near twelve million, but the heap and its hot indexes expanded well past that footprint because every burst of reservation updates created more dead versions than cleanup could reclaim. As the backlog grew, ordered index scans had to traverse more stale pointers, more heap pages lost their all-visible status, and cache residency shifted from currently tradable reservations toward pages that mostly existed to preserve old history.

That is why vacuum tuning is not merely a storage-admin chore. It directly affects the size of the active working set. A dead version that remains on a frequently touched page can force extra visibility checks and deny index-only scans. A dead version on a cold page can keep the table file larger than it needs to be, increasing checkpoint and backup work. If the backlog becomes extreme, the engine may need emergency anti-wraparound cleanup that is far more disruptive than steady routine maintenance would have been.

The operational levers map cleanly to the mechanism. If the horizon is pinned by a long transaction, fix the transaction lifecycle or the replica feedback policy. If cleanup simply cannot keep pace with write churn, adjust autovacuum or purge capacity and thresholds so maintenance starts earlier. If the workload creates avoidable version churn, redesign the table or update path so hot rows generate fewer non-HOT updates, or partition old data so the newest, most frequently updated slice stays compact. Those choices all cost something: more background I/O, more planning around transaction lifetimes, or more schema complexity.

This also sets up 14.md. Hot and cold data tiering is much easier to reason about once you see that dead versions are a hidden kind of cold data occupying hot storage. If garbage collection lags, the engine's working set expands with pages that no current transaction actually wants.

Troubleshooting

Issue: VACUUM completed, but the table file on disk barely shrank.

Why it happens / is confusing: Standard vacuum usually returns space to the relation's own free-space pool first. If free bytes are spread across the file instead of concentrated at the end, the engine can reuse them later without being able to truncate the file immediately.

Clarification / Fix: Check whether live-row count is stable while free space inside the relation is increasing. If the real problem is file-level bloat rather than reuse debt, you may need a rewrite-style operation, partition rotation, or reindexing in addition to routine vacuum.

Issue: Autovacuum keeps visiting a table, but dead tuples remain high for hours.

Why it happens / is confusing: The worker may be blocked by the visibility horizon rather than by lack of effort. A long transaction, an idle session left open in a transaction block, or a replica holding an old snapshot can make many versions ineligible for reclamation.

Clarification / Fix: Find the oldest active snapshot or read view before tuning worker cost limits. Removing the blocker usually releases more reclaimable space than simply making vacuum scan faster.

Issue: Query plans look reasonable again after 12.md, but index scans are still getting slower over time.

Why it happens / is confusing: Statistics quality and version cleanup solve different problems. The planner may now estimate row counts correctly while the physical access path still pays for dead index entries, wider heap footprint, and lost visibility-map coverage.

Clarification / Fix: Measure dead tuples, index bloat, and long-lived transactions alongside EXPLAIN ANALYZE. If estimates are accurate but page reads and heap fetches keep climbing, the bottleneck has shifted from planning error to cleanup debt.

Advanced Connections

Connection 1: 12.md and this lesson share the same maintenance budget

Yesterday's lesson showed how stale or underspecified statistics cause cardinality errors. Today's lesson shows a different maintenance failure mode: even with fresh statistics, old-version debt can make the chosen plan more expensive to execute than its logical shape suggests. In production, analyze work and garbage collection often compete for the same background I/O and scheduling attention, so plan quality and cleanup quality frequently drift together.

Connection 2: 14.md turns cleanup lag into a working-set design problem

Vacuum debt is not only about wasted bytes. It decides which pages stay in active circulation and which indexes remain slim enough for cache-friendly access. The next lesson broadens that idea from tuple history to the full storage hierarchy: once you know dead versions can pollute the hot set, hot/cold tiering becomes an extension of the same control problem.

Resources

Optional Deepening Resources

Key Insights

  1. Old versions disappear only after the oldest snapshot releases them - The reclaim boundary is a visibility rule, not a timer that starts when an update commits.
  2. Vacuum changes future read cost, not just past storage usage - Cleaning heap history, index entries, and visibility metadata reduces the amount of dead state that later queries must step around.
  3. Cleanup debt is both performance debt and correctness debt - If background reclamation falls behind, scans bloat first and transaction-ID safety problems arrive later.
PREVIOUS Statistics, Histograms, and Cardinality Errors NEXT Hot/Cold Data Tiering and Working Set Control

← Back to Database Engine Internals and Implementation

← Back to Learning Hub