Schemas, Contracts, and Versioned Messages

LESSON

Distributed Systems Foundations

013 20 min beginner

Schemas, Contracts, and Versioned Messages

Core Insight

Imagine an order_created event. The order service deploys a new version at 10:00. The invoice service will not deploy until 10:20. For twenty minutes, the system contains old consumers, new producers, old messages in queues, and new messages being written.

If the new producer sends a shape that the old invoice service cannot read, the rollout breaks even though every service may look healthy on its own. The failure is not a CPU problem or a network problem. It is a contract problem.

A message contract is the agreement that lets independently deployed participants keep understanding one another while they evolve. A schema describes the shape of a message: field names, types, optional fields, required fields, and sometimes defaults. A compatibility rule says which old and new participants can safely coexist.

The trade-off is that safe evolution often turns one desired change into several releases. That feels slower than changing everything at once. It is also what lets a distributed system stay alive while different parts move at different times.

A Mixed-Version Rollout

Use an order_created event as the worked example.

Version 1 contains two fields:

order_created v1:
  order_id
  total

The team wants version 2 to include currency:

order_created v2:
  order_id
  total
  currency

Adding currency can be safe if three things are true:

The same change becomes unsafe if the old consumer rejects unknown fields, if the new consumer assumes currency always exists, or if the producer silently changes the meaning of total.

The important deployment question is:

Which combinations of old and new producers,
old and new consumers,
and old and new messages
must coexist safely?

That question matters because messages are not always read immediately. They can sit in queues, be retried after a deploy, move to a dead-letter queue, or be replayed months later during recovery.

Shape Compatibility And Meaning Compatibility

Schema tools are good at checking shape. They can often tell whether a field was removed, whether a type changed, whether a required field has no default, or whether a reader can ignore a new optional field.

Shape is not the whole contract.

Consider this change:

before:
  total = price before tax

after:
  total = price after tax

The field name and type stayed the same. A schema checker may be satisfied. The invoice service may still produce the wrong invoice because the meaning changed under the same name.

That is semantic compatibility. It asks whether old and new participants attach the same meaning to the message. Renaming a field can break shape. Reusing a field with changed meaning can break trust more quietly.

A useful contract review separates the two:

shape:
  Can the reader parse the message?

meaning:
  Will the reader interpret the message correctly?

Both need to be true for safe evolution.

Rollout Order

A safe rollout usually prepares readers before writers depend on a new shape or meaning.

For the currency change, the sequence might be:

1. Deploy consumers that can read v1 and v2.
2. Deploy producers that start sending currency.
3. Wait until old messages have drained or remain readable.
4. Make new behavior depend on currency only after the fleet is ready.
5. Remove old compatibility code after the retention and replay window closes.

This sequence protects mixed-version operation. At each step, the system can still process messages produced by the previous step.

The same idea applies to request/response APIs, event streams, database records, and command payloads. The slower and more asynchronous the workflow, the more important compatibility becomes. A synchronous HTTP caller usually meets a callee in the same moment. An event may be written today and interpreted by future code later.

Version numbers help only when they are attached to real behavior. A field called schema_version = 2 is useful if readers know what version 2 means and what old versions may still appear. It is not magic.

Worked Trace: Add currency Without Breaking A Queue

Suppose the order service publishes to a queue and invoicing, analytics, and a repair worker consume the event. A message produced yesterday can still be delivered today.

M1, produced by v1
{ "order_id": "order-42", "total": 25.00 }

M2, produced by v2
{ "order_id": "order-43", "total": 25.00, "currency": "EUR" }

The safe rollout begins with readers. The new invoice consumer can read M1 with a documented legacy interpretation or an explicit unknown state, and it can read M2 when currency exists. It must not invent a default that could produce an incorrect invoice.

new reader + old message -> works while M1 exists
new reader + new message -> target behavior

Only then does the producer send M2. Old consumers must ignore the unknown optional field or use a compatible decoder; otherwise the producer cannot safely emit M2 until those consumers are upgraded. This is why a version number is not the compatibility plan. The real plan is the coexistence matrix:

old reader + old message -> works
new reader + old message -> works
old reader + new message -> works until old readers are gone
new reader + new message -> works

Shape compatibility is necessary but insufficient. total: 25.00 must state whether it includes tax, shipping, discounts, and which currency applies. If the meaning changes, add total_including_tax rather than silently redefining total. A parser cannot catch a field whose name and type stayed the same while its business meaning changed.

After all producers write currency, compatibility is still needed for queued messages, retries, dead-letter stores, backups, and replay tools. Remove the legacy reader path only after the retention and recovery window has passed and telemetry shows no v1 messages remain. A schema registry can enforce mechanical checks, but ownership, meaning, rollout order, and replay readiness are contract decisions that teams must make explicit.

Useful evidence includes schema-validation rejections, unknown-field decode failures, missing-field fallback use, oldest queued message by version, and replay failures. The trade-off is more releases and temporary compatibility code in exchange for independent deployment and a safe operational recovery path.

Compatibility Is A Directional Promise

Teams often say that a schema is “compatible” without naming the direction. That leaves a gap precisely when an old queue item or an older consumer appears during a rollout.

Backward compatibility asks whether a new reader can interpret old data. In this example, it means the new invoice consumer can process M1, even though currency is missing. A default is safe only if it is a true, documented default. If the historical order may be in several currencies, the reader must obtain the currency from an authoritative record or stop and request repair.

Forward compatibility asks whether an old reader can tolerate data written by a new producer. It means the old analytics consumer ignores the new currency field rather than rejecting the entire event. Some formats support this naturally; some generated decoders need an explicit unknown-field rule. The contract needs a tested answer, not an assumption.

Full compatibility is the mixed-fleet promise that both directions hold while old and new writers, readers, and messages coexist. It costs more discipline, but it is usually the appropriate target for independently deployed event systems.

new reader reads M1 -> backward-compatible path
old reader reads M2 -> forward-compatible path
both paths work      -> mixed rollout can proceed

The field type matters as well. Changing total from a decimal value to a formatted string such as "EUR 25.00" may look convenient to a user interface, but it breaks readers that calculate with a number. Adding a separate display field is safer. Replacing an enum value can similarly break old readers that assume the set is closed. When a reader cannot make a safe interpretation, it should reject the message into a visible repair path rather than silently inventing business data.

Replay Is A Contract Test, Not An Afterthought

The hardest consumer is often not the one currently deployed. It is tomorrow's incident tool reading last month's events. A replay can cross a schema migration, code deployment, and changed business rules all at once.

Before retiring M1 support, rehearse a small replay: select a stored M1 event, run it through the current consumers, and verify the resulting invoice or order state. Do the same for M2. The test should include an old message from a retry or dead-letter queue, because those are exactly the records that return when normal assumptions have failed.

Keep a written retirement date and an owner for that decision. Otherwise compatibility code tends either to disappear too early during cleanup or to stay indefinitely because nobody can prove the replay boundary has passed. The owner should review retained data, recovery tools, consumer metrics, and the documented meaning before approving removal.

replay check:
  M1 -> current consumer -> documented legacy outcome
  M2 -> current consumer -> current outcome
  duplicate delivery -> idempotent outcome
  malformed message -> visible dead-letter and repair evidence

This exposes an important boundary: schema compatibility does not make processing idempotent. A reader may parse the same order_created event twice and still create two invoices unless the consumer uses order_id or an event id to recognize prior work. Contracts need both an interpretation rule and a processing rule.

The operational signals become release gates. If legacy fallback is still used, if dead-letter volume rises after the producer deploys, or if a replay cannot explain an old record, the rollout is incomplete. The right response is to pause the writer change or extend compatibility, not to declare the old messages irrelevant because the current service looks healthy.

This evidence also gives support and finance teams a concrete answer when a historical order must be reconstructed months after the original deployment.

Failure Modes And Trade-offs

The representative failure is an old consumer rejecting a new message and stopping a pipeline. The message is valid to the new producer, but not survivable for the mixed fleet.

A quieter failure is a new consumer assuming a field is always present before all old producers can send it. The rollout may work in staging, then fail when one older producer emits a v1 message from a retry or delayed queue.

Another failure is semantic drift. Two teams use the same field name but no longer mean the same thing. These bugs can pass parsers, tests, and dashboards while corrupting business behavior.

The trade-off is discipline versus speed. Compatibility slows some refactors, keeps deprecated fields alive longer, and forces rollout planning. The payoff is that services can deploy independently without turning every contract change into an all-at-once event.

Practice Prompt

Pick one message or API payload: order_created, user_updated, payment_authorized, inventory_reserved, or password_changed. Fill in these lines:

message:
current required fields:
current optional fields:
proposed change:
old consumer reading new message:
new consumer reading old message:
queued or replayed message risk:
semantic meaning that must not change silently:
safe rollout order:

If you cannot say what old consumers will do with the new shape, the rollout is not ready.

Resources

Key Takeaways

PREVIOUS Backpressure, Load, and Cascading Failure NEXT Degraded Modes, Playbooks, and Incident Evidence