Network Partitions and Failure Models

LESSON

Networking and Failure Models

004 30 min intermediate

Network Partitions and Failure Models

Core Insight

Imagine the learning platform running across three zones. The progress service in zone A can still talk to zone B, but not to zone C. The certificate service in C still answers some local reads. The load balancer sees intermittent health checks. No single machine has necessarily crashed, but different parts of the system now see different communication maps.

That is what makes partitions hard. The danger is not simply that "the network is down." The danger is that the system no longer shares one reliable view of who can talk to whom, which writes can be coordinated, and which answers are still authoritative. A timeout from A to C is only a local symptom. It might mean crash, delay, overload, packet loss, asymmetric routing, or a partition.

A failure model names the uncertainty the system is designed to survive. A protocol that assumes only crash failures can make different decisions from a protocol that must handle delayed messages, lost messages, or split reachability. Once the failure shape is clear, design choices that look mysterious become legible: quorum systems reject some writes, eventually consistent systems accept divergence, and user-facing services choose which answers are allowed to be stale.

The useful shift is to stop asking whether the system "handles failure" in the abstract. Ask what the system can still know under a specific failure model, and what it is allowed to do when that knowledge is incomplete.

Local Observation Is Not Global Truth

Suppose an API server in zone A calls the certificate service in zone C and times out. The API server knows one thing: it did not receive a useful answer before its deadline. It does not know whether C is down, slow, overloaded, unreachable only from A, or still reachable from B.

what A observes: no response from C before deadline

possible realities:
  C crashed
  C is slow
  A -> C traffic is dropped
  C -> A replies are dropped
  C is reachable from B but not A

This is why failure models are models of uncertainty, not just lists of broken components. They define which observations can be misleading. In an asynchronous distributed system, a slow message and a lost message can look the same to the receiver until more evidence arrives. A missing heartbeat is not proof of death. It is proof that this node has not heard from that peer within the expected window.

The practical consequence is language discipline. "Network issue" is too vague for design work. A useful incident review asks whether the system saw delay, loss, congestion, asymmetric reachability, a minority partition, a majority partition, or total isolation. Each shape makes different behavior safe.

The trade-off is simplicity versus realism. A simpler failure model is easier to reason about, but it can make the system overconfident. A richer model is harder to implement against, but it prevents local symptoms from being mistaken for global facts.

Quorum Preserves Authority By Refusing Some Work

Now make the progress service replicated. It has five nodes, and a write must be accepted by at least three before it is considered committed. During a partition, two nodes can still talk to each other but cannot reach the other three. Those two nodes may be healthy in the ordinary sense: their disks work, their processes run, and their local network is alive. They still should not accept globally authoritative writes.

5-node group
quorum = 3

partition A: 2 nodes
partition B: 3 nodes

2-node side: alive, but not authoritative for new commits
3-node side: enough overlap to commit

The rule is not about punishing the isolated side. It is about avoiding split authority. If both sides accepted writes as the single truth, the system could heal into two incompatible histories. Quorum rules make sure a committed decision overlaps with future committed decisions, so the system has a defensible path back to one history.

def can_commit(reachable_nodes, quorum_size):
    return len(reachable_nodes) >= quorum_size

The function is simple because it hides the important design point: local health is not the same as coordinated authority. A node can be alive and still not know enough to make a safe global decision.

The trade-off is availability versus coordination safety. Quorum systems keep stronger guarantees about committed state, but they intentionally reject some work during communication uncertainty. That rejection is a feature when the alternative is conflicting truth.

Staying Writable Means Owning Reconciliation

Not every subsystem should choose the same response. Suppose the platform tracks non-critical recommendation hints: which lessons a learner opened, which topics looked interesting, and which prompts were dismissed. During a partition, it may be acceptable for both sides to keep accepting these hints locally because losing responsiveness is worse than temporary divergence.

That choice moves the cost. Instead of paying coordination cost before every write, the system pays merge cost after connectivity returns.

partition starts
  -> side A accepts local updates
  -> side B accepts local updates
  -> histories diverge
partition heals
  -> system merges or resolves the histories

The merge rule must match the domain. For a set of completed lessons, union may be reasonable. For a profile field, the system may need a version, timestamp, or explicit user confirmation. For money, certificates, quotas, or authorization, "just merge later" may be unacceptable because the intermediate state is itself a promise.

This is the real meaning behind partition trade-offs. Remaining available is not free. It usually means accepting bounded staleness, divergent writes, conflict resolution, weaker immediate guarantees, or a narrower definition of what remains available.

The trade-off is continuity versus semantic complexity. You keep more user or workflow paths alive during the partition, but you must explain what those paths guarantee and how the system repairs disagreement later.

Common Design Mistakes

One mistake is treating a timeout as proof that a peer is dead. The previous lesson showed why a timeout only tells the caller that waiting ended. In partitioned systems, that distinction becomes structural: one node's failed call may say more about the path than about the peer.

Another mistake is treating partition tolerance as if it meant normal behavior continues. A system can preserve coordination by refusing writes, or preserve availability by accepting divergence and reconciliation work. It cannot pretend the partition imposed no cost.

A third mistake is applying one failure response to every subsystem. A certificate issuer, a recommendation hint stream, and a public lesson catalog do not need identical guarantees. The better design names which state is authoritative, which state may be stale, which state may diverge, and which operations should fail closed.

Connections

Timeouts and retries are local tools for ambiguous observations. Failure models widen the lens: they ask what many nodes can and cannot know at the same time. A timeout policy that is sensible for a single call can still be unsafe if the larger system interprets many timeouts as proof that leadership changed.

This lesson also opens the path to consensus and replication. Leader election, terms, quorum writes, read leases, and conflict-free data types are concrete mechanisms for surviving different communication failures while preserving different guarantees.

Resources

Key Takeaways

PREVIOUS Timeouts, Retries, and Backoff NEXT Health Checks, Load Balancing, and Traffic Steering