Serialization, Schemas, and Protocol Choices

LESSON

002 30 min intermediate

Serialization, Schemas, and Protocol Choices

Core Insight

Imagine the same learning platform from the previous lesson sending a progress update to two places. One copy goes to a mobile client so the learner can see that a lesson is complete. Another copy goes into an internal event stream so analytics, recommendations, and certificates can react later. Both paths move bytes, but the real risk is not the movement. The real risk is that different systems disagree about what those bytes mean.

Serialization is the moment a local object becomes a cross-boundary contract. Inside one process, percent_complete may be an integer in memory. Across a network, it becomes an encoded field whose name, type, optionality, default behavior, and future meaning must survive independent deployments. A mobile app might be weeks behind the backend. An event consumer might replay messages written months ago.

That is why protocol choice is not just a performance decision. JSON, Protocol Buffers, and Avro each make a different trade-off between human readability, compact machine exchange, schema discipline, and long-term evolution. The useful question is not "which format is best?" It is "which boundary needs which kind of contract?"

Serialization As A Boundary Contract

Across a process boundary, the receiver cannot see the sender's in-memory object. It sees bytes. Serialization gives those bytes enough structure for the receiver to rebuild meaning.

For a lesson progress update, the message might need:

learner identifier
lesson identifier
completion percentage
update timestamp
source device or locale

The network can deliver the payload perfectly and the system can still fail if the receiver interprets the payload differently. Is percent_complete allowed to be null? Is updated_at_ms milliseconds since epoch or seconds? If source_device is missing, does that mean "unknown", "web", or "old client"? These are contract questions, not transport questions.

local object
  -> encoded message
  -> network or log boundary
  -> decoded structure
  -> local meaning on the receiver

The trade-off is clarity versus flexibility. Loose message shapes feel easy early because they require little ceremony. Explicit schemas make change slower at first, but they reduce ambiguity once producers and consumers evolve independently.

Choosing The Format For The Boundary

JSON is often a good fit at public edges, admin APIs, debugging tools, and integration-heavy boundaries. Humans can inspect it, logs are readable, and nearly every language can produce and consume it. That operational convenience is real. Its cost is that JSON alone does not enforce much about required fields, unknown fields, numeric precision, or compatibility rules.

Schema-driven binary formats such as Protocol Buffers optimize a different shape of work. They are usually more compact, faster to parse, and friendlier to generated client/server code. They ask the team to manage field numbers, defaults, unknown fields, and compatibility rules with more discipline.

Avro and similar schema-oriented formats become attractive for event streams and data pipelines because messages may be stored, replayed, and read by consumers that did not exist when the producer wrote the event. In that setting, schema resolution and version history are part of the system's memory.

public API: readability and broad interoperability may dominate
internal RPC: compact messages and generated contracts may dominate
event stream: schema evolution and replay safety may dominate

The trade-off is not text versus binary in the abstract. It is human operability versus machine efficiency versus evolution safety. A mature system may use different formats at different boundaries for good reasons.

Schema Evolution Is Where The Contract Is Tested

The hardest serialization failures often arrive after the first release. The backend adds completion_source. The mobile app has not upgraded. A batch job replays last month's events. A new consumer reads an old message and assumes a missing field means something the producer never intended.

A schema is useful because it gives safe-change rules:

which fields may be added
which fields are required or optional
which defaults old readers should assume
how unknown fields are handled
which identifiers or field numbers must never be reused

message LessonProgress {
  string learner_id = 1;
  string lesson_id = 2;
  int32 percent_complete = 3;
  int64 updated_at_ms = 4;

  reserved 5;
}

The important part is not that this example uses Protocol Buffers. The important part is that a field number becomes part of history once messages escape into production. Reusing it later for a different meaning can break readers that are not deploying in lockstep.

Event streams make this sharper because they separate systems by time as well as by network. A message can outlive the process, release, and team that produced it. In that world, schema evolution is not documentation polish. It is how the system keeps old facts understandable.

Common Design Mistakes

One mistake is choosing a format only from benchmark numbers. Payload size and parse speed matter, but so do debugging, integration, tooling, compatibility checks, and incident response. A slightly larger readable payload can be the right choice at an operational boundary. A compact schema-driven payload can be the right choice in high-volume internal traffic.

Another mistake is treating "the JSON shape is obvious" as a contract. It may be obvious when one service writes and one service reads. It stops being obvious once there are mobile clients, old deployments, new consumers, test fixtures, replays, and partial migrations.

A healthier review question is: who will read this message, how independently will they change, how long can the message live, and what kind of mistake would be expensive? Those answers should drive the protocol choice.

Resources

[DOC] Protocol Buffers Language Guide (proto3)
- Link: https://protobuf.dev/programming-guides/proto3/
- Focus: Study field numbering, unknown fields, defaults, and compatibility discipline.
[RFC] JSON Data Interchange Syntax (RFC 8259)
- Link: https://www.rfc-editor.org/rfc/rfc8259
- Focus: Revisit why JSON is broadly interoperable and human-friendly despite weaker built-in schema rules.
[DOC] Apache Avro Specification
- Link: https://avro.apache.org/docs/1.12.0/specification/
- Focus: Compare a schema-oriented format designed for long-lived data exchange and schema resolution.

Key Takeaways

Serialization preserves shared meaning across boundaries, not just bytes on the wire.
Format choice should follow the boundary: public API, internal RPC, event stream, or long-lived data pipeline.
Schema evolution is where the contract proves whether independently deployed systems can keep understanding each other.
The main trade-off is readability, efficiency, tooling discipline, and compatibility safety at the same time.

← Back to Networking and Failure Models

← Back to Distributed Systems

← Back to Learning Hub