Online Schema Migrations and DDL Safety

LESSON

Database Engine Internals and Implementation

038 30 min advanced

Day 415: Online Schema Migrations and DDL Safety

The core idea: A schema migration is safe only when old code, new code, replicas, and recovery can all operate correctly throughout the overlap period; "online" DDL is therefore a compatibility protocol, not just a faster ALTER TABLE.


Today's "Aha!" Moment

In 14.md, Harbor Point moved older reservation partitions to colder storage so the trading path could keep its real working set on the fast tier. That helped market-open latency, but it created a new constraint. Compliance now wants a risk_bucket column and a new index on active reservations, yet the reservations table spans hot NVMe-backed partitions and colder immutable partitions kept for years. Harbor Point cannot freeze order intake, rewrite every partition in one shot, or assume a failover will happen only after the migration completes.

The tempting mental model is "run one ALTER TABLE, wait, and the schema is updated." Production databases rarely behave that cleanly. Some changes are metadata-only. Some build new indexes in the background but still need metadata locks at the beginning and end. Some rewrite the whole table or create a shadow copy, which turns one DDL statement into a long-running data movement job. Even when the SQL syntax looks simple, the engine may be copying rows, generating WAL or binlog traffic, invalidating caches, and pushing replicas behind.

The useful shift is to stop asking whether the DDL statement is "online" and start asking whether the whole migration window is compatible. During that window, old application binaries may still write the old shape, new binaries may expect the new column, replicas may replay migration traffic later than the primary, and cold historical partitions may lag behind hot partitions in backfill progress. If every actor can still read and write safely, the migration is online in the only sense that matters. If not, the label on the DDL algorithm does not save you.


Why This Matters

Harbor Point's migration is not just a developer convenience change. The new risk_bucket field feeds surveillance queries and lets the database avoid expensive ad hoc computation during incident analysis. But the table being changed sits directly on the critical path for order routing and post-trade reconciliation. A naive rewrite could queue behind one long transaction, take a metadata lock that blocks fresh writes, and then create a wall of replica lag just when the team needs the database to stay failover-ready.

This is why safe online DDL belongs in storage-engine reasoning, not in release-checklist folklore. Before the team understands migration mechanics, schema evolution feels like a binary choice between "ship the change" and "schedule downtime." Afterward, the team can separate the change into compatibility-safe stages: introduce a structure the old code can ignore, backfill at a controlled rate, validate that both representations agree, switch reads only when all paths are ready, and retire the old form last.

That matters even more because of the tiering work from 14.md. Historical partitions may not deserve an eager rewrite, but they still have to remain readable during the transition. Online migration safety therefore means preserving service availability, replica health, and data compatibility across multiple storage tiers at once. It is the bridge from last lesson's working-set control to 16.md, where you will combine engine, index, and isolation choices into one coherent architecture decision.


Learning Objectives

By the end of this session, you will be able to:

  1. Classify a schema change by its physical consequences - Distinguish metadata-only DDL, background rebuilds, and full rewrites before choosing a rollout strategy.
  2. Trace a safe online migration sequence - Explain how expand, backfill, validation, cutover, and cleanup preserve compatibility under live traffic.
  3. Evaluate DDL safety in production terms - Reason about metadata locks, replica lag, backfill pressure, and rollback options instead of treating schema changes as isolated SQL statements.

Core Concepts Explained

Concept 1: The first safety question is "what physical work does this DDL really trigger?"

Harbor Point's request sounds simple: add risk_bucket, populate it for all reservations, and support a new index for analysts filtering by trading day and risk class. The dangerous assumption is that one piece of SQL always implies one kind of work. In practice, the same ALTER TABLE family spans several very different engine behaviors. One change may touch only metadata. Another may scan the table in the background to build an index. A third may rewrite every row into a new physical layout or maintain a shadow table until cutover.

That classification is the first production decision because it tells you where the risk actually lives. If the change is metadata-only, the main danger may be a brief metadata lock and application-version skew. If the engine needs a background rebuild, the risk shifts toward I/O amplification, replica lag, and cancellation semantics. If the change forces a table rewrite, you are no longer talking about a small control-plane action; you are running a data-movement job under the name of DDL.

For Harbor Point, the decision tree looks like this:

requested change
  -> add nullable column?          often metadata-only or very light
  -> build secondary index?        background scan + validation + final lock
  -> change type / drop old form?  may force rewrite or shadow-copy flow

The exact rules depend on the engine. PostgreSQL distinguishes ordinary index builds from CREATE INDEX CONCURRENTLY, and some ALTER TABLE forms still rewrite the heap while others do not. MySQL/InnoDB distinguishes ALGORITHM=INSTANT, INPLACE, and COPY, which sound reassuring until you remember that "in place" can still consume significant background resources and acquire metadata locks. The mechanism differs by engine, but the operational reading is the same: always determine whether the change rewrites data, what lock it takes, and how replication replays it before running it on a production primary.

The trade-off is straightforward. Treating DDL as a physical operation adds planning overhead, but skipping that classification is how teams discover too late that a "quick" schema change is really a table copy competing with user traffic.

Concept 2: Safe migrations preserve compatibility first, then move data

Harbor Point should not introduce risk_bucket by immediately making it mandatory everywhere. The safe pattern is expand, backfill, cut over, then contract. That sequence looks conservative because it is designed around overlapping versions of the system, not around the convenience of one final schema state.

The expansion step introduces structures that old code can tolerate. Harbor Point adds the nullable column and the new index path without making either one required yet:

ALTER TABLE reservations
  ADD COLUMN risk_bucket smallint;

CREATE INDEX CONCURRENTLY idx_reservations_day_bucket
  ON reservations (trading_day, risk_bucket);

Application code then enters a compatibility window. New writers compute risk_bucket for fresh rows and updates, but readers still work if the column is null because older rows have not been backfilled yet. If Harbor Point were replacing an old representation rather than adding a new one, this is where dual-write or dual-read logic would live. The invariant is simple and strict: every row must remain readable by both the old and the new application version until the team deliberately ends that overlap.

Backfill comes next, and it is not a clerical step. It is a write workload with real consequences. Updating billions of historical rows generates WAL or binlog traffic, touches indexes, and can pull cold partitions back into active I/O if the team is careless. Because of the tiering decisions from 14.md, Harbor Point may choose to backfill recent hot partitions immediately while handling very old immutable partitions through a slower offline batch or through compatibility logic that computes the field on demand until they are rewritten. That is still one migration; it just respects the physical reality of the storage layout.

Only after parity checks pass does cutover become safe. Harbor Point validates that newly written rows always have risk_bucket, that backfilled rows match the expected derivation, that the new index serves the intended query plan, and that replicas have applied the same history. Then reads switch to depend on the new column. Constraints such as NOT NULL, old-column removal, or trigger retirement happen last because they intentionally end compatibility. Until that final step, rollback remains possible: the application can fall back to the old read path while the expanded schema still accepts both versions.

The trade-off is more code and a longer migration window. In exchange, the team can survive partial rollout, failover, paused backfills, and rollback without turning schema evolution into downtime.

Concept 3: "Online" still fails when lock budgets, lag budgets, or rollback rules are missing

Many incidents blamed on "bad migrations" are really failures of operational control around an otherwise reasonable plan. Harbor Point could choose a safe expand-and-backfill design and still cause an outage if the initial ALTER TABLE waits behind a long transaction, because metadata lock acquisition often queues and then blocks new work behind it. A DDL statement that would have held the lock for milliseconds can create a traffic jam lasting minutes if it enters the lock queue at the wrong time.

Replica behavior is the next budget to protect. Backfills and index builds create a second workload that replicas must replay, so a migration can silently reduce failover safety long before user queries fail. If Harbor Point promotes a lagging replica during the migration window, the new application may suddenly talk to a node that has the old schema or an incomplete backfill. That is why robust migrations gate every stage on replica state and schema version, not just on primary success.

Operationally, safe online DDL needs concrete guardrails:

These controls make migrations slower and more procedural. That is the right price. Online DDL is valuable because it turns schema change into a managed background activity rather than a stop-the-world event, but that only works when the migration is treated like any other production workload with SLOs, budgets, and abort conditions.


Troubleshooting

Issue: The migration was advertised as online, but application writes still froze when the DDL started.

Why it happens / is confusing: The DDL itself may need only a brief metadata lock, but if it waits behind a long transaction, it can block later queries in the same lock queue. The outage comes from waiting for the lock, not from holding it for a long time once acquired.

Clarification / Fix: Run with aggressive lock timeouts, inspect long-lived transactions before the migration, and schedule the start when lock queues are short. "Online" never means "ignore metadata locks."

Issue: Backfill looks healthy on the primary, but replicas fall behind and failover confidence drops.

Why it happens / is confusing: Every row rewrite or index build generates replication work. A migration can meet primary-side latency goals while quietly turning replicas into stale recovery targets.

Clarification / Fix: Throttle by replica apply lag, not just by primary CPU. Pause backfill when lag exceeds budget, and do not cut over reads until failover targets have fully replayed the migration.

Issue: The new application version starts reading risk_bucket, but some rows still contain nulls.

Why it happens / is confusing: The schema change succeeded, so it is easy to assume the data is ready too. In reality, DDL completion and backfill completion are different milestones.

Clarification / Fix: Keep reads backward-compatible until parity checks pass. Enforce NOT NULL or remove fallback logic only after validation proves that the overlap window is finished.


Advanced Connections

Connection 1: 14.md and this lesson are both about controlling background work so it does not poison the fast path

Tiering taught Harbor Point to protect the hot working set from cold reads. Online schema migrations apply the same discipline to change traffic. A backfill that rewrites old partitions too aggressively can pollute caches, consume bandwidth, and undo the latency gains from tiering. Safe migrations therefore need placement policy and throttling policy, not just correct SQL.

Connection 2: 16.md turns migration mechanics into an architecture-choice question

The monthly capstone asks you to pick an engine, index strategy, and isolation model for a workload. This lesson provides one of the selection criteria: how safely that stack supports schema evolution under live traffic. An engine that looks fast on day one but makes every future DDL change a blocking rewrite may be the wrong fit for Harbor Point's operational reality.


Resources

Optional Deepening Resources


Key Insights

  1. Online DDL safety is a compatibility property - The migration is safe only if mixed schema versions can coexist across applications, replicas, and recovery during the overlap window.
  2. DDL syntax hides very different physical behaviors - Metadata-only changes, background rebuilds, and full rewrites need different rollout plans because they stress the system in different ways.
  3. Backfill and cutover are production workloads - They need lock budgets, lag budgets, validation, and rollback rules just like any other high-risk operational change.

PREVIOUS Hot/Cold Data Tiering and Working Set Control NEXT Monthly Capstone: Pick Engine, Indexes, and Isolation

← Back to Database Engine Internals and Implementation

← Back to Learning Hub