What Makes a System Distributed

LESSON

Distributed Systems Foundations

002 25 min beginner

What Makes a System Distributed

By the end of this lesson, you will be able to: - identify the participants, messages, and promise in a small system; - explain why a web application and a separate database already create distributed-systems concerns; - classify a user action by its owner, uncertainty, and failure mode.

Idea in one sentence: A system becomes distributed when a useful promise depends on independent parts that coordinate through messages.

Core Insight

Imagine a concert ticket site during a popular sale. A customer chooses two seats, presses Reserve, and expects the page to make one clear promise: either those seats are held for them for the next few minutes, or they are not. Behind that simple promise, several independent pieces may be involved. The browser sends a request to the web application. The web application asks a seat inventory service to hold the seats. A payment service may later authorize money. A notification worker may send a receipt. A database or cache may store the current hold.

The system is not distributed merely because it has many files, classes, or developers. It is distributed because useful behavior depends on independent participants communicating through messages. The customer-visible promise crosses a boundary where no single participant can inspect or change all relevant state directly.

That definition matters because it tells you when local-program intuition stops being enough. In local code, one function can often rely on shared memory, a single call stack, and one process failure boundary. In a distributed system, the pieces can run, pause, fail, retry, deploy, overload, and recover separately. They learn about each other through messages that can be delayed, duplicated, reordered, rejected, or never answered.

The practical question is not "How many machines are there?" The better question is: does this promise require independent parts to coordinate through evidence rather than shared local truth? Here, evidence means a signal such as a response, a durable record, a timeout, or a trace. It is not direct access to another participant's state. If the answer is yes, you are in distributed-systems territory, even if the system is small.

The Three-Part Test

A useful way to recognize a distributed system is to ask three questions.

First, are there independent participants? A participant is any part that can make progress, fail, deploy, queue work, or store state separately from another part. It might be a service, database, cache, message broker, browser, mobile app, worker, region, or third-party API. Independence is the important word. Two functions in the same process are usually not independent participants. A web service and its database are.

Second, does useful behavior cross a message boundary? A boundary appears when one participant cannot directly read or write another participant's memory and must send a request, event, command, replication record, or file. The message is the evidence carrier. It says, "Please reserve these seats," "I reserved them," "This payment was authorized," or "Here is version 81 of the record."

Third, is there a shared promise that depends on those messages? The promise might be "do not sell the same seat twice," "show the latest profile name," "deliver this notification at least once," "accept writes in one region while another region is down," or "keep the checkout page responsive under load." If there is no shared promise, several independent machines may exist near each other without creating much of a distributed design problem.

For the ticket site, the three answers are clear:

participants:
  browser, web app, seat inventory, payment, database, notification worker

message boundaries:
  reserve request, hold response, payment authorization, receipt event

shared promise:
  one customer should not lose or duplicate a seat hold because messages are slow

This test is stricter than counting servers and more useful than arguing about labels. A single laptop syncing notes with a cloud service is distributed. A monolith calling a separate database is distributed at the database boundary. A batch job that writes a file for another job to read later is distributed if the downstream behavior depends on that handoff. A large codebase running entirely inside one process may be complicated, but that complication is not automatically distributed-system complexity.

Pause and predict: Is a web application plus a separate PostgreSQL database already distributed?

Check: Yes. The web process and database can progress, slow down, commit, and fail independently. The database boundary creates message delay and uncertainty even if the rest of the application is a monolith.

So far: count independent boundaries and the promise that crosses them, not files, services, or machines. The next question is what those boundaries change about what a caller can safely know.

What Changes At The Boundary

The boundary changes what each participant can know and what it can safely promise.

Inside one process, a caller can often treat a function call as a direct action. That does not mean local code is always simple, but the call has a tight relationship to one memory space and one failure boundary. If the function returns, the caller has a result. If it throws, the caller has a local failure. If the process dies, both caller and callee usually die together.

Across a boundary, the caller sends a message and waits for evidence. The message may arrive after the caller's timeout. The receiver may act but lose the response. The response may arrive after the user has already retried. The receiver may be running a newer version that interprets one field differently. A queue may accept the message while workers fall behind.

The ticket site makes this concrete:

web -> inventory: hold seats A12,A13 for cart-91
inventory -> database: write hold expires_at=19:35:00
inventory -> web: hold confirmed
web -> browser: seats reserved

That is the happy path. Now put a slow network link between web and inventory:

time ------------------------------------------------------------------>
web:        send hold -------- waits -------- timeout -> shows checking
inventory:       receives -> writes hold -> sends confirmation
network:                                             confirmation delayed

web knows:       it did not receive a reply before its timeout
inventory knows: the hold for A12,A13 exists until 19:35:00

The inventory service may have done the correct thing. The web service may also have done the correct thing by not waiting forever. The user experience depends on whether the system has a state for the gap between them.

That gap is why distribution changes design decisions. The web service should not claim the seats are free merely because it timed out. It should not create a second unrelated hold if the user presses the button again. It should not tell the customer the seats are guaranteed unless it has enough evidence from the owner of the seat state.

The boundary has turned a simple-looking action into a question about evidence, ownership, and repair.

Pause and predict: The customer presses Reserve again after the timeout. Can the web service safely say that the first hold failed?

Check: No. It only knows that its wait ended. It should reuse the reservation identity, check with the inventory owner, or show a pending state instead of creating a second unrelated hold.

Worked Classification: Is This Distributed?

The label is less important than the reasoning, so classify systems by the pressure they create.

Consider four small designs.

Design A:
  one process
  in-memory seat map
  one thread handles each reservation

Design B:
  web process
  separate PostgreSQL database stores seat holds

Design C:
  web process
  inventory service
  payment provider
  notification queue

Design D:
  two regional inventory services
  asynchronous replication
  each region can keep selling during a network partition

Design A may have concurrency bugs, but if all relevant state lives inside one process, many distributed questions are absent. A function can inspect the in-memory map directly. If the process dies, the whole reservation system may die together. That is not automatically good, but it is a different failure shape.

Design B is already distributed at the web/database boundary. The database can be slow while the web process is healthy. A transaction can commit while the response is lost. A connection pool can exhaust. The web service has to decide what to show after a database timeout. This is a small distributed system, but small does not mean trivial.

Design C is more obviously distributed because the product promise crosses multiple owners. Inventory owns seats, payment owns authorization, the queue owns delivery progress, and the web process coordinates the user interaction. A failed notification should not undo a seat hold. A payment timeout should not silently sell the seat to someone else. Different facts have different owners.

Design D adds regional disagreement. Both regions may be healthy from their own point of view while unable to communicate with each other. If both can sell the same seat during a partition, the system must decide whether availability is worth the risk of conflict, or whether one region should refuse some work to protect the stronger promise.

The key point is that "distributed" is not a prestige badge. It is a warning label for a family of questions:

Those questions scale from a web app plus database to a multi-region platform.

The Central Mechanism: Evidence Moves, Not Truth

The central mechanism in a distributed system is message-mediated evidence. One participant does not receive another participant's truth directly. It receives a message, reads a record, observes a timeout, sees a queue offset, or reconstructs a trace. Those are pieces of evidence.

For the ticket example, the seat inventory service might store this local fact:

seat_hold:
  cart_id: cart-91
  seats: A12,A13
  status: held
  expires_at: 19:35:00
  version: 42

The web service might store this different local fact:

reservation_attempt:
  cart_id: cart-91
  request_sent_at: 19:30:00
  inventory_response: timed_out
  user_visible_state: checking

Neither record is the whole system. Together they tell a better story. The inventory record says the seats are held. The web record says the user interaction has not yet received confirmation. A reconciliation step, read-after-timeout check, or later response can connect those facts.

This is why state ownership matters. If inventory owns seat holds, the web service should treat the inventory record as authoritative for whether A12 and A13 are held. The web service may own the user interaction state, but it should not invent seat availability from its own timeout.

This also explains why observability is not a late add-on. If the system cannot connect cart-91, hold-request-17, the inventory version, the timeout, and the user-visible state, operators will struggle to answer the basic incident question: did the customer actually reserve the seats?

What Distribution Is Not

Distribution is easy to overgeneralize, so it helps to name a few boundaries.

It is not the same thing as architectural complexity. A single-process compiler, game engine, or spreadsheet can be enormously complex without being distributed in the sense this track studies. Its problems may involve algorithms, memory, concurrency, or performance, but not necessarily message-mediated coordination across independent participants.

It is not the same thing as microservices. Microservices are one style of building distributed systems, but a mobile app syncing with a server, a browser talking to an API, a database primary replicating to followers, and a queue feeding background workers are also distributed. The lesson applies whenever the promise crosses an independent boundary.

It is not always a reason to split things apart. Distribution can buy scale, resilience, locality, and team independence, but the trade-off is extra failure modes and operational evidence work. If a product can keep its promise with a simpler local design, that may be the better engineering choice.

It is also not automatically solved by using a database transaction, a queue, a load balancer, or a consensus system. Those mechanisms help with specific boundaries, but they do not remove the need to define the promise, ownership, timeout behavior, and repair path.

Failure Modes From Misclassifying The System

The first failure mode is treating a remote dependency as a local helper. A team may set no timeout, no retry policy, and no pending state because the call "should be fast." When the dependency slows down, request threads pile up and the user sees random failures.

The second failure mode is counting machines without finding the promise. A diagram may show six services, but nobody can answer which service owns "seat held" or what happens when payment succeeds after the web request times out. The diagram is distributed, but the design is not yet coherent.

The third failure mode is ignoring small distributed systems. A web app plus database can still have committed transactions with lost responses, stale reads from replicas, connection pool exhaustion, migration timing issues, and backup/restore boundaries. Waiting until "real scale" to learn distributed reasoning means learning it during incidents.

The fourth failure mode is splitting too early. A team may create five services for a workflow that still has one tightly coupled promise. Now every change requires cross-service coordination, every failure needs a trace, and every local decision has to account for remote state. Distribution added cost before it bought a clear benefit.

Design Check

Start with a guided answer. For Design B above, the participants are the web process and PostgreSQL; the message is the database request and response; PostgreSQL owns the official seat-hold record; and the promise is that a confirmed hold is not silently lost or duplicated. Treating the database call as local would make a lost response look like a failed hold even if the transaction committed.

Take a system you know and classify one user action. Use this checklist:

action:
participants:
messages:
state owner for the official fact:
local decisions each participant can make:
promise that crosses the boundary:
failure if the boundary is treated like local code:

Use the Design B answer as a rubric: name the independent participants, the evidence they exchange, the owner of the official fact, the user promise, and the unsafe conclusion a timeout or missing message could cause. If the action has participants, messages, and a shared promise, treat it as distributed enough to need explicit design. If it has many components but no shared promise across them, the main problem may be complexity, not distribution.

The next lesson narrows this definition to network boundaries, latency, and partial failure. Once you can spot the boundary, the next skill is understanding what a caller can and cannot conclude when the boundary becomes slow.

Resources

Key Takeaways

PREVIOUS The Distributed Systems Mindset NEXT Network Boundaries, Latency, and Partial Failure