Day 447: Online Reindexing and Maintenance at Scale

The core idea: Online reindexing is a long-running state machine, not a background convenience. The engine has to backfill a shadow index from a stable snapshot, keep up with concurrent writes, prove the rebuilt structure is equivalent to the old one, and cut over atomically without letting maintenance work steal the serving path.

Today's "Aha!" Moment

In 045.md, Harbor Point learned that tenant identity has to survive all the way into quotas, queues, and background work. That lesson becomes painfully concrete when CedarHarbor asks to rebuild the bloated idx_reservations_by_settlement index on its reservations table. The index has grown inefficient after months of correction imports and delete churn, but Harbor Point cannot pause NorthPier's 09:30 booking traffic just to clean up one tenant's access path.

That is the real shape of online reindexing. It is not "make a copy in the background and swap it later." Rows keep changing while the rebuild runs. New inserts arrive, updates move keys between index pages, deletes create garbage, replicas lag, and compaction or vacuum work accumulates behind the scenes. A correct rebuild therefore behaves like a resumable protocol: establish a snapshot frontier, build a shadow structure, capture everything that changed after the snapshot, validate parity, and only then flip the metadata that makes planners trust the new index.

The misconception to discard is that the scan is the hard part. At scale, the scan is usually the easy part. The hard part is preserving invariants while the rest of the system is still live: every committed row that should be visible after cutover must have a matching entry in the new index, queries must never read from a half-built structure, and one tenant's maintenance job must not silently consume another tenant's latency budget. Once you see reindexing as a state machine with budgets, the operational decisions become much clearer.

Why This Matters

Indexes age. In B-tree engines they become bloated after heavy update and delete churn. In LSM-backed engines they accumulate write amplification and compaction debt in new ways as schemas evolve. Query regressions often appear long before a table itself is "too large"; the hot path slows down because the access structure no longer matches the workload or has become physically inefficient.

The naive fix is downtime: block writes, rebuild the index, then reopen the table. That works for a small internal tool. It does not work for Harbor Point's production cluster, where NorthPier is making time-sensitive reservations, BeaconStreet is replaying corrections, and CedarHarbor still needs its audit queries to recover. Taking a long exclusive lock would turn a maintenance task into a customer-visible outage.

Online reindexing changes the problem from "can we rebuild the index?" to "can we rebuild it while the database remains honest about visibility, durability, and resource ownership?" If the answer is yes, the organization gains a safer maintenance path, shorter incident windows, and fewer emergency migrations caused by avoidable index decay. If the answer is no, maintenance becomes something teams postpone until the day it is urgent and dangerous.

This lesson also prepares the ground for 047.md. Once maintenance is modeled as a resumable protocol, failover and outage handling stop being afterthoughts. Operators need to know whether a cutover can resume, roll back, or safely wait when a region disappears halfway through a rebuild.

Learning Objectives

By the end of this session, you will be able to:

Explain the correctness invariant behind online reindexing - Describe what must be true about the old and new index before the planner is allowed to trust the rebuilt structure.
Trace the reindexing state machine - Follow snapshot selection, backfill, concurrent change capture, validation, and atomic cutover in a live system.
Evaluate production trade-offs - Judge when online reindexing is worth the extra write amplification, storage overhead, and operational controls required at scale.

Core Concepts Explained

Concept 1: The real invariant is index equivalence under live traffic

Harbor Point is not rebuilding idx_reservations_by_settlement because the name of the index changed. It is rebuilding because the existing structure has become expensive: page splits and dead entries have increased read cost for CedarHarbor's audit lookups, and the planner is starting to prefer broader scans during end-of-day checks. The database cannot simply delete the old index and recreate it, because NorthPier and BeaconStreet are still writing new reservations, correcting settlement dates, and canceling stale holds while the rebuild is happening.

That leads to the central invariant: after cutover, the new index must represent the same logical row set the old index represented for every committed transaction that should be visible. "Same logical row set" is stronger than "the build job finished." It means inserts that happened after the initial snapshot cannot be missed, deletes cannot leave ghost entries that would produce wrong lookups, and updates that move a row from one key range to another cannot appear in both places or neither place. If the engine cannot guarantee that equivalence, then the rebuilt index is not a maintenance improvement. It is silent data corruption at the access-path layer.

This is why mature engines treat online reindexing as a descriptor or catalog state transition rather than as one monolithic command. The old index remains authoritative while the new one sits in a state such as write_only, backfilling, or not_valid, depending on the engine. Queries are either forbidden from using the shadow index or allowed to use it only under narrowly defined rules. The important point is not the exact label. The important point is that planner visibility and write visibility move in controlled phases instead of changing all at once.

The trade-off starts here. Offline rebuilds are simpler because there is only one truth at a time. Online rebuilds preserve availability, but they introduce dual structures, extra metadata states, and more failure paths to test. That complexity is justified only when the application cannot tolerate the lock profile or downtime of an offline rebuild.

Concept 2: The mechanism is snapshot backfill plus catch-up, not a single pass

Once Harbor Point commits to an online rebuild, the engine usually follows a sequence close to this:

old index remains serving queries
        |
choose snapshot frontier T0
        |
scan base table as of T0 -> build shadow index
        |
concurrent writes after T0
   -> old index
   -> shadow index or change log
        |
catch up shadow index to present frontier
        |
validate parity
        |
atomic metadata cutover
        |
retire old index and reclaim space

The first critical move is choosing a consistent snapshot frontier, T0. In an MVCC system, the backfill reads the table as it existed at T0 so the job sees a stable version of each row instead of a moving target. Without that snapshot guarantee, the scan could read half of one transaction and half of another, building an index that never matched any real database state.

The second critical move is handling writes that arrive after T0. Engines do this in different ways. Some dual-write directly into both old and new indexes once the shadow structure exists. Others record post-snapshot changes in a side log or schema-change job queue and replay them into the new index before cutover. PostgreSQL's concurrent builds, distributed SQL schema changers, and online DDL systems in MySQL-derived fleets differ in details, but they all solve the same problem: a long backfill alone cannot make the new index current, because the table keeps changing while the scan runs.

At scale, the catch-up phase is often where reality bites. If CedarHarbor writes rows faster than the rebuild can absorb post-snapshot changes, the lag between the shadow index and the serving frontier keeps growing. The database then has three bad options: let the rebuild run forever, increase maintenance throughput and risk hurting foreground traffic, or pause and resume under tighter budgets. That is why engines persist progress durably. The rebuild must be resumable by range, partition, or key span, with a recorded frontier, so a node crash or lease transfer does not force a full restart.

Validation and cutover are the last correctness gate. The engine checks that the shadow index is complete enough to trust, sometimes with row-count parity, duplicate detection for unique indexes, checksum-style consistency checks, or a combination of metadata proofs and targeted scans. Only then does it flip the catalog or descriptor pointer that makes the planner prefer the rebuilt index. The cutover needs to be atomic from the reader's point of view: there may be long background work before and after the switch, but there cannot be a period where queries are free to use a half-valid index because a maintenance job "looked mostly done."

Concept 3: At scale, maintenance is a resource-scheduling problem as much as a correctness problem

Rebuilding one index on an idle database is mostly a storage-engine concern. Rebuilding indexes in Harbor Point's shared production cluster is also an admission-control problem. The backfill reads table pages, writes new index entries, emits WAL or replication traffic, warms and evicts cache lines, and may trigger page splits or LSM compaction. Every one of those steps competes with foreground work. If the system treats maintenance as free background progress, then CedarHarbor's repair job can consume the same IO queue depth and WAL bandwidth that NorthPier needs for booking commits.

This is where the previous lesson matters. Online reindexing must spend the owning tenant's budget, not invisible cluster magic. Harbor Point therefore needs tenant-aware maintenance tokens: limits for scan bandwidth, write rate, compaction debt, and concurrent schema jobs. Those controls should interact with the same admission system that governs foreground workloads, so the cluster can say "pause the backfill for 30 seconds because market-open writes are burning the latency budget" instead of discovering the conflict afterward in an incident review.

The observability surface also changes. Median query latency is not enough. Operators need to see shadow-index size growth, backfill scan throughput, post-snapshot catch-up lag, WAL bytes generated by the job, replica apply lag, lock wait time around cutover, and any user-visible query-plan changes once the new index becomes available. A rebuild that "completed successfully" but pushed replica lag high enough to break read scaling was not actually successful in production terms.

Unique indexes make the trade-offs sharper. If Harbor Point rebuilds a unique index on reservation reference numbers, the job must detect duplicates that were introduced or revealed while the shadow index was catching up. That can require stronger validation phases, duplicate resolution rules, or failure semantics that leave the old index in place and mark the new one as invalid. The lesson is general: the more semantic meaning an index carries, the more the cutover behaves like a controlled migration rather than a mechanical copy.

At some point, the cheapest answer is not "make the online rebuild smarter." It is "change the operating plan." Teams often rate-limit by tenant, split the rebuild into partitions, run follower-side preparation work before final cutover, or schedule the last validation phase away from known traffic peaks. Those decisions are not signs of weakness. They are signs that the team understands maintenance as part of the serving system, not as work that happens outside it.

Troubleshooting

Issue: The rebuild never reaches cutover because catch-up lag keeps increasing.

Why it happens / is confusing: The initial backfill can look healthy while post-snapshot writes outpace the job's ability to apply deltas. Dashboards may show strong scan throughput even though the shadow index is falling further behind the live frontier.

Clarification / Fix: Track a real freshness metric, such as "shadow index caught up through timestamp X" or "delta queue depth," not just rows scanned. If lag grows under normal traffic, lower maintenance concurrency, shard the work more finely, or move the final catch-up phase to a quieter window rather than pretending the scan rate is the whole story.

Issue: Foreground write latency spikes whenever the rebuild is active.

Why it happens / is confusing: Online reindexing preserves availability, but it does not eliminate extra work. Dual writes, additional WAL, cache churn, and compaction can all show up as write-path regression even when no table lock is held.

Clarification / Fix: Charge the rebuild to explicit maintenance budgets and graph its write amplification separately from application traffic. If the cluster cannot throttle scan and index-write tokens independently, the job is operating outside the same control system that protects user-facing SLOs.

Issue: A node or region fails during validation or cutover, leaving operators unsure whether the new index is safe to use.

Why it happens / is confusing: "Job running" is not precise enough state for failure recovery. Without durable phase markers, teams do not know whether to resume validation, retry cutover, or discard the shadow index.

Clarification / Fix: Persist rebuild state explicitly: snapshot frontier, completed spans, validation status, and cutover status. Recovery logic should be able to answer "resume," "abort," or "already public" from metadata alone. If humans must infer the state from logs during an outage, the maintenance protocol is under-specified.

Advanced Connections

Connection 1: 045.md explains who pays for the maintenance work

Multi-tenant isolation defined the policy boundary: one tenant's background work should not quietly consume another tenant's latency budget. Online reindexing is the exact case that tests whether that policy is real. A shadow index build produces read amplification, write amplification, and compaction or vacuum debt, so it forces the engine to prove that maintenance tokens and tenant ownership reach below the SQL surface.

Connection 2: 047.md turns online maintenance into an outage-management problem

A resumable rebuild already looks like a runbook: it has phase markers, cutover conditions, and abort criteria. The capstone lesson on primary-region outages extends that operational mindset. If the primary region fails during a schema job, teams need deterministic answers about whether the secondary can continue serving, whether the rebuild should resume after failover, and which metadata state is authoritative.

Resources

Optional Deepening Resources

[DOC] PostgreSQL CREATE INDEX
- Link: https://www.postgresql.org/docs/current/sql-createindex.html
- Focus: Read the CONCURRENTLY behavior and locking notes to see how a widely used MVCC engine exposes online index builds while keeping planner visibility staged.
[DOC] PostgreSQL REINDEX
- Link: https://www.postgresql.org/docs/current/sql-reindex.html
- Focus: Compare ordinary REINDEX with REINDEX CONCURRENTLY and note why some repairs still require extra cleanup and multiple passes.
[DOC] CockroachDB Docs: Online Schema Changes
- Link: https://www.cockroachlabs.com/docs/stable/online-schema-changes.html
- Focus: Study how a distributed SQL engine persists schema-change jobs, backfill progress, and cutover state so maintenance can survive node movement and failures.
[DOC] MySQL 8.4 Reference Manual: InnoDB and Online DDL
- Link: https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl-operations.html
- Focus: Separate true online behavior from INPLACE and COPY implementation details, and map those choices to the availability and write-amplification trade-offs in this lesson.
[DOC] Vitess Online DDL Overview
- Link: https://vitess.io/docs/24.0/user-guides/schema-changes/overview/
- Focus: See how large fleets wrap schema and index changes in queued, resumable workflows instead of treating DDL as a one-shot administrative command.

Key Insights

Online reindexing is a correctness protocol before it is a performance tool - The rebuilt index is safe only if snapshot backfill, concurrent-write capture, validation, and cutover preserve logical equivalence with the old structure.
The catch-up phase is where scale shows up - A rebuild that cannot keep pace with live writes is not "almost done"; it is a maintenance job whose operating budget no longer matches the workload.
Maintenance belongs inside the same control plane as serving traffic - If reindexing is not throttled, attributed, and observable like any other tenant workload, it becomes a hidden path for noisy-neighbor incidents.

← Back to Database Engine Internals and Implementation

← Back to Learning Hub