Day 460: Schema Evolution and Compatibility Modes

The core idea: Schema evolution is safe only when the change rules match deployment reality; compatibility modes are rollout contracts for old and new readers across time, not abstract labels for a schema registry.

Today's "Aha!" Moment

In Serialization Formats and Binary Contracts, PayLedger decided how to represent a SettlementInstruction on the wire. That solved the "what do these bytes mean today?" problem. Production systems immediately run into the harder follow-up question: what happens next week when the schema changes but the fleet does not upgrade at once?

PayLedger has exactly that problem. The settlement coordinator publishes instructions to Kafka, the bank adapter consumes them almost immediately, the reconciliation service may lag for hours during month-end traffic, and the audit replay job can reread fourteen days of old events at any time. Treasury now wants a new settlement_purpose_code field for cross-border payouts and wants fee breakdowns separated from the old amount_minor total. The system has to support new data, old binaries, and historical events simultaneously.

That is what compatibility modes actually describe. They tell you which pairings must keep working: new readers on old data, old readers on new data, or both. Once you see that, "backward" and "forward" stop sounding like vocabulary from a registry UI and start sounding like deployment rules. The common mistake is to treat schema evolution as an additive field exercise. In production, the dangerous breaks are usually semantic: reusing a field number, changing a unit, or making an old replay impossible even though the new payload still parses.

Why This Matters

For PayLedger, a schema mistake does not stay inside the data team. If the coordinator starts emitting a new event shape that the lagging reconciliation service cannot read, settlement appears successful in one service and missing in another. If a replay job cannot decode yesterday's events after a deploy, the team loses the fastest path to rebuilding derived state during an incident. If a field keeps the same name but changes from "gross payout amount" to "net payout amount after fees," everything may deserialize cleanly while downstream ledgers become wrong.

Schema evolution is therefore a time-travel problem as much as a data-format problem. Durable logs, rolling deploys, canaries, batch backfills, and multiple language runtimes guarantee that old and new versions will coexist. The question is not whether change will happen. The question is whether the platform defines what kinds of change are safe, how that safety is enforced, and in what deployment order producers and consumers move.

When teams get this right, schema changes become boring in the best way. The registry or CI checks reject unsafe edits before deployment. Engineers know whether consumers must roll before producers or vice versa. Historical reprocessing stays viable. Most importantly, the system can evolve without turning every new field into a cross-team coordination incident.

Learning Objectives

By the end of this session, you will be able to:

Explain why schema evolution is a deployment problem - Trace how rolling upgrades, retained events, and replay jobs force multiple schema versions to coexist.
Distinguish common compatibility modes precisely - Map backward, forward, full, and transitive compatibility to concrete reader-writer pairings and rollout order.
Evaluate whether a schema change is truly safe - Separate parse-level compatibility from semantic compatibility and choose safer migration patterns.

Core Concepts Explained

Concept 1: Schema evolution starts when old data and old binaries still matter

PayLedger's Kafka topic for settlement instructions retains events for fourteen days. That one operational detail changes everything. Even if the coordinator is upgraded in minutes, other components do not all move at the same speed. The bank adapter is close to real time, the reconciliation worker might be paused behind backlog, and the replay tooling might intentionally reread last week's events after a bug fix. The system is living with multiple versions whether the team plans for it or not.

That means every schema change has to survive two different directions of time. A new reader may need to understand old events that were already written before the deploy. An old reader may need to survive events written by a newly deployed producer that got ahead of the rest of the fleet. Those are different guarantees. Treating them as the same is how teams end up with a schema that is "compatible" in one document and broken in one rollout.

The simplest mental model is a four-cell matrix:

                     Data written by
                 old producer   new producer
old reader          works?         works?
new reader          works?         works

The bottom-right cell is trivial: new code reading data it wrote itself is the easy case. The other cells are the real contract. New reader plus old data is the replay and retained-history problem. Old reader plus new data is the rolling-upgrade and lagging-consumer problem. Once the team writes those cells down explicitly, compatibility stops being a vague promise and becomes a set of obligations that can be tested.

This also explains why schema evolution is built on the previous lesson's wire-contract details. A Protobuf message can often tolerate added fields because unknown tags can be skipped. An Avro workflow can resolve differences between writer schema and reader schema if defaults and field names line up. A positional binary record with no self-description has almost no room to evolve safely because old readers cannot recover when the layout shifts. Evolution depends on how field identity, defaults, and unknown data are handled at the format level.

Concept 2: Compatibility modes encode rollout order and replay guarantees

The exact labels vary a bit by tool, but the standard vocabulary is consistent enough to be useful. In registry-based workflows, backward compatibility means the new schema can read data produced with the previous schema. Operationally, that supports "readers first, writers second" rollouts because updated consumers can still process the backlog that exists before producers switch. Forward compatibility means the old schema can read data produced with the new schema, which supports "writers first, readers second" rollouts because lagging consumers can survive data from the upgraded producer.

Full compatibility means both directions hold for the adjacent version pair. That gives deployment more flexibility because either side can move first without breaking the other, at least for the current version transition. Transitive variants widen the promise from "compatible with the immediately previous schema" to "compatible with all earlier schemas in scope." That distinction matters a lot in systems like PayLedger where replay jobs may reach back many versions, not just one deployment behind.

You can summarize the operational meaning like this:

backward          new readers can read old data        deploy consumers before producers
forward           old readers can read new data        deploy producers before consumers
full              both of the above for adjacent versions
full transitive   both directions across all retained versions

For PayLedger, backward compatibility alone is not enough if the reconciliation service might lag behind while the coordinator starts writing newer events. Forward compatibility alone is not enough if the audit replay job must rebuild state from older events after a deploy. Because the system has both lagging consumers and historical replay, the safest platform rule is usually a transitive mode for durable event schemas, even if short-lived RPC contracts can tolerate a narrower adjacent-version guarantee.

One important nuance is that a registry can validate only what the schema language expresses. It can often tell you that field 7 still exists or that a default value lets readers fill in missing data. It cannot reliably tell you whether amount_minor still means the same business quantity as before. Compatibility modes are necessary, but they are not the whole review. You still need human judgment for semantic changes.

Concept 3: Safe evolution depends on representation details and disciplined migration patterns

Suppose PayLedger starts with this Protobuf event:

message SettlementInstruction {
  string run_id = 1;
  int64 employer_id = 2;
  int64 amount_minor = 3;
  string currency = 4;
  int64 settlement_deadline_unix_ms = 5;
}

Treasury now wants fee transparency, so the team is tempted to replace amount_minor with net_amount_minor and fee_amount_minor. Doing that as an in-place type or meaning change is the classic dangerous move. If field 3 used to mean gross settlement amount and now means net amount after fees, an old consumer may still parse the message perfectly and quietly compute the wrong ledger entries. Parse-level compatibility can coexist with semantic breakage.

The safer migration is staged. First add new fields with new identifiers, keeping the old one:

message SettlementInstruction {
  string run_id = 1;
  int64 employer_id = 2;
  int64 amount_minor = 3;          // deprecated, still populated during migration
  string currency = 4;
  int64 settlement_deadline_unix_ms = 5;
  string settlement_purpose_code = 6;
  int64 net_amount_minor = 7;
  int64 fee_amount_minor = 8;
}

Then update consumers to understand the new fields while still accepting the old one. Only after all critical readers have switched do producers stop depending on amount_minor. Eventually the team can deprecate the field and reserve its identifier so nobody reuses tag 3 for a different meaning later. The same general discipline applies outside Protobuf: in Avro, additive changes with sensible defaults are usually safer than removals; in JSON contracts, introducing a new field is often easier than silently changing the type or unit of an old field.

The trade-off is obvious: disciplined evolution creates temporary redundancy. Dual-writing fields, carrying deprecated names, and preserving reserved identifiers all feel slower than a clean break. But that temporary mess buys production stability. The alternative is to save a week in schema cleanup and spend a month untangling corrupted derived data, failed replays, or consumers that only break under backlog conditions. In real platforms, schema hygiene is not about elegance. It is about keeping change survivable.

Troubleshooting

Issue: A replay job fails after a consumer deploy, even though live traffic still looks healthy.

Why it happens / is confusing: The new reader can handle events being produced now, but it cannot decode older retained events because a supposedly harmless change removed a field, changed a type, or relied on a default that old data never had. Live traffic hides the problem because it exercises only the newest writer-reader pairing.

Clarification / Fix: Test consumer changes against retained historical samples, not just newly produced fixtures. If replay is an operational requirement, adopt a transitive compatibility rule and reject schema changes that only work against the latest version.

Issue: The registry reports a compatible change, but downstream numbers become inconsistent.

Why it happens / is confusing: Compatibility checks reason about structure, not business meaning. A field can keep the same type and still change semantics from gross amount to net amount, from local time to UTC, or from one enum interpretation to another.

Clarification / Fix: Treat semantic changes as new fields or new schema versions with explicit migration notes. Keep the old field until consumers have switched, and review unit, timezone, and enum meaning changes as API breaks even when the registry stays green.

Issue: An old consumer starts misreading data after a field was removed months ago.

Why it happens / is confusing: Someone reused the old field number or field name for a different purpose. The payload still parses, but the old reader maps the familiar identifier to the wrong meaning.

Clarification / Fix: Reserve removed identifiers permanently where the schema system supports it, and document deprecations in the schema review process. "Unused" is not the same as "available for reuse."

Advanced Connections

Connection 1: Wire-format design determines how much evolution room you have

This lesson is a direct continuation of Serialization Formats and Binary Contracts. Tagged formats, explicit defaults, and clear field identity rules are what make backward and forward compatibility possible in the first place. Teams often blame schema governance for a problem that was actually created earlier by choosing a representation with weak evolution properties.

Connection 2: Durable event logs turn schema compatibility into platform policy

Compatibility choices become stricter when data outlives the process that wrote it. Kafka topics, CDC streams, and rebuildable projections all depend on the ability to decode history long after the original producer has changed. That is why event-platform teams often enforce transitive modes more aggressively than request-response APIs: retained data is an operational asset only if later code can still read it.

Resources

[DOC] Protocol Buffers: Updating A Message Type
- Focus: Review which field edits are safe, why tag reuse is dangerous, and how unknown fields affect rolling upgrades.
[SPEC] Apache Avro Specification
- Focus: Read the schema resolution rules to see how writer and reader schemas cooperate during evolution.
[DOC] Confluent Schema Evolution and Compatibility
- Focus: Map backward, forward, full, and transitive modes to real deployment sequences and registry enforcement.

Key Insights

Compatibility modes are deployment contracts - They tell you which reader-writer pairings must survive while producers, consumers, and replay tools run different versions.
Structural compatibility is not semantic safety - A message can deserialize correctly and still carry the wrong business meaning after a field reinterpretation.
Safe evolution prefers additive, staged change - New fields, defaults, dual-read or dual-write windows, and reserved identifiers are slower upfront but far cheaper than repairing broken history.

← Back to Data Architecture and Platforms

← Back to Learning Hub