Distributed Testing, Simulation, and Deterministic Replay: Network Simulation, Latency, and Backpressure

LESSON

Distributed Testing, Simulation, and Deterministic Replay

008 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Network Simulation, Latency, and Backpressure

Core Insight

A deterministic network simulator should model more than "message delivered" or "message dropped." Many distributed bugs appear in the middle: a message is delayed long enough to trigger a retry, a queue fills and applies backpressure, a response arrives after a timeout, or one traffic class is blocked while another keeps moving.

In CheckoutService, a duplicate payment capture might not require a full partition. It may require payment responses to be slow, idempotency replication to sit behind a busy queue, and client retry timers to fire while the system is still making progress. If the harness only supports instant delivery and total message loss, it will miss the slower failure shape that production actually sees.

Network simulation gives the harness a controlled transport with explicit latency, ordering, capacity, and delivery rules. The trade-off is fidelity versus tractability: richer network models reveal more realistic bugs, but they create more state, more enabled events, and more ways for a test to become difficult to explain.

What the Network Model Owns

The network model is the part of the harness that decides what happens to messages after a simulated component sends them.

It should usually own these decisions:

whether a message is accepted for transport
which queue the message enters
when the message becomes deliverable
whether the message is dropped, duplicated, reordered, or delayed
whether a sender receives backpressure, rejection, or no immediate signal
how link capacity, queue size, and partition rules affect delivery

That means the service node does not directly call another node. It hands a message to the simulated transport:

order-1 sends m42: replicate in-flight key k1 -> order-2
network accepts m42 into link(order-1, order-2)
scheduler later chooses: deliver m42, delay m42, drop m42, or hold m42

This preserves the replay boundary. The scheduler can reproduce not only whether the message arrived, but also when it arrived relative to timers, crashes, retries, and other messages.

Latency and Jitter

Latency is delay between send and delivery. Jitter is variation in that delay. Both matter because distributed systems often turn time into control flow.

For a deterministic harness, latency should be represented as scheduled availability:

send m42 at step 100
network latency model marks m42 deliverable at step 117
scheduler may deliver m42 any time after step 117 if no rule blocks it

This is more useful than sleeping for seventeen milliseconds. A sleep depends on the host runtime. A deliverable step belongs to the simulation history.

Latency can be modeled in several ways:

fixed delay for simple causal tests
seeded random delay for broader exploration
per-link delay to represent regions or dependency boundaries
traffic-class delay for replication, client traffic, heartbeats, or payment calls
state-dependent delay when queues fill or links are saturated

The harness should record the chosen delay and why it was chosen. If a payment response was delayed because the payment link had a seeded 500-step tail latency, the replay should say that. If it was delayed because the queue was full behind replication traffic, the replay should say that instead.

Queues and Backpressure

Backpressure is the system's signal that downstream capacity is constrained. It may appear as a full queue, a refused send, a slow write, a retry-after response, an increased latency, or a caller blocked waiting for capacity.

Backpressure bugs are common because they are not pure failures. The system is still alive, but progress becomes uneven.

In a simulated network, a link can have capacity:

link order-1 -> order-2:
  max_queue: 3 messages
  delivery_rate: 1 message per scheduler turn

queued:
  m40 heartbeat
  m41 replicate order-100
  m42 replicate in-flight key k1

If order-1 tries to send m43, the model must choose a behavior:

accept and grow the queue
block the sender
reject the send
drop an old message
drop the new message
apply extra latency

Those are different systems. A real service built on TCP backpressure behaves differently from a UDP-like unreliable channel, and a message broker with bounded queues behaves differently from both. The network model should match the failure model and the system contract.

Worked Example

Suppose CheckoutService has two order replicas and a payment adapter. The team wants to test whether client retries can outrun replication when the network is congested.

The simulated network starts with:

links:
- order-1 -> order-2 replication: max_queue 2, delivery_rate slow
- order-1 -> payment-adapter: max_queue 5, delivery_rate normal
- client-A -> order replicas: max_queue 5, delivery_rate normal

latency:
- replication messages: seeded delay 20-80 steps
- payment responses: seeded delay 5-200 steps
- client retry timer: fires after 50 simulated steps

A failing replay might look like this:

step 1  client-A -> order-1: confirm(order-101, k1)
step 2  order-1 -> payment-adapter: capture(order-101, k1)
step 3  order-1 -> order-2: replicate in-flight k1
step 4  network queues replication behind m38 and m39
step 5  payment response latency chosen: 120 steps
step 50 client-A retry timer becomes ready
step 51 client-A -> order-2: retry confirm(order-101, k1)
step 52 order-2 has not received in-flight k1
step 53 order-2 -> payment-adapter: capture(order-101, k1)

No message was permanently lost. No replica crashed. The bug appears because latency and queueing let the retry outrun the replication signal. That is a different failure from lesson 7, and it is closer to many production incidents where systems are slow rather than cleanly partitioned.

The oracle still checks the same invariant: one external capture per idempotency key. The network model supplies the causal explanation.

Delivery Semantics

The network simulator should make delivery semantics explicit. A few examples:

Semantics	Meaning
FIFO per link	messages from A to B arrive in send order
Non-FIFO	messages from A to B may reorder
At-most-once	a message is delivered zero or one time
At-least-once	a message may be duplicated but should eventually arrive unless blocked
Bounded queue	sends may block, fail, or drop when capacity is exhausted
Traffic classes	replication, heartbeat, client, and dependency traffic can behave differently

These semantics are part of the contract between the harness and the test. If a protocol assumes FIFO links but the simulator reorders messages, failures may be invalid. If production uses a broker that can redeliver messages, an at-most-once simulator may hide duplicate-handling bugs.

Backpressure semantics are especially important. A blocked send is observable to the sender. A delayed message may not be. A dropped message might trigger a timeout later. A rejected send may return an error immediately. The service can make different decisions in each case, so the replay log must distinguish them.

Common Network Simulation Mistakes

One mistake is modeling the network as either perfect or fully broken. Many important histories live between those extremes: slow delivery, asymmetric reachability, queue saturation, and delayed responses.

Another mistake is using wall-clock sleeps as a proxy for network latency. Sleeps make tests slow and flaky, and they do not record the simulated reason for a delay.

A third mistake is forgetting backpressure. If every send succeeds instantly, the harness cannot test overload paths, retry storms, queue growth, load shedding, or degraded-but-alive behavior.

The last mistake is treating all messages the same. Heartbeats, replication updates, client requests, and external dependency calls often have different routing, priority, reliability, and timeout behavior. A useful simulator can represent those differences when they matter.

Practice

Take one workflow from the track so far and define a network model for it:

Which links exist between clients, replicas, and dependencies?
Which links are FIFO and which can reorder?
Which queues are bounded?
What happens when a queue is full?
Which delays should be fixed, seeded random, or state-dependent?

Then write one replay where no message is lost but latency and backpressure still violate an invariant. If you cannot produce one, your failure model may be too binary.

Connections

Builds on Fault Injection for Crashes, Partitions, and Message Loss by treating network behavior as a richer controlled environment, not only a source of drops and partitions.
Prepares for State Model Checking and Randomized Exploration, where network queues and delays become part of the explored state space.
Connects to production reliability because many distributed failures begin as slow queues, retry storms, tail latency, and overloaded dependencies rather than hard outages.

Resources

[PAPER] The Tail at Scale
[BOOK] Designing Data-Intensive Applications
[DOC] Jepsen
[PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store

Key Takeaways

A deterministic network simulator controls latency, delivery, ordering, capacity, and backpressure as replayable events.
Slow and congested networks can expose bugs even when no message is permanently lost.
Backpressure behavior must be explicit because blocking, dropping, delaying, and rejecting sends create different histories.
Richer network models improve realism but increase the state space the harness must explore and explain.

← Back to Distributed Testing, Simulation, and Deterministic Replay

← Back to Distributed Systems

← Back to Learning Hub