Distributed Testing, Simulation, and Deterministic Replay: State Model Checking and Randomized Exploration

LESSON

Distributed Testing, Simulation, and Deterministic Replay

009 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: State Model Checking and Randomized Exploration

Core Insight

A deterministic simulator gives you a controlled world, but control by itself does not decide which histories to run. In CheckoutService, there may be ten messages in network queues, three retry timers ready, one replica that can crash, and a client that can send the next request. The bug may appear only if the scheduler picks a specific chain: delay replication, fire the retry timer, deliver the retry to the stale replica, then deliver the original response later.

State model checking is the discipline of treating those choices as a search problem. The harness represents a state, lists the actions that are currently enabled, applies one action, checks invariants, and repeats. Randomized exploration uses the same machinery but samples paths instead of trying to enumerate all of them.

The non-obvious part is that randomness is useful only when it is reproducible. A random scheduler that cannot replay a failure is just another source of flakiness. A seeded randomized explorer, by contrast, can cover surprising interleavings quickly and still hand you the exact path that violated the invariant.

The State Space

A state is the complete simulated situation at one point in the run. For a distributed test harness, it is usually not just application data.

It may include:

local state for each node
durable state and volatile state, if the system distinguishes them
network queues and deliverable messages
timers and scheduled callbacks
crash, restart, and partition status
client requests that have not finished
external dependency responses that are pending
the random seed or schedule decisions that led here

An action is one step the scheduler can choose from that state:

deliver message m42 to order-2
fire retry timer t17
crash order-1
restart order-3
client-A sends confirm(order-101, key k1)
advance simulated clock to the next timer

An enabled action is legal right now. If a message is still blocked by latency, delivering it is not enabled. If a node is crashed, its application callback is not enabled. If a timer has not reached its simulated time, firing it is not enabled.

This matters because the search engine should not invent impossible histories. It should explore surprising histories that the modeled system could actually experience.

Model Checking as Search

A simple model checker repeatedly does four things:

Record the current state.
Generate the currently enabled actions.
Explore one or more actions from that state.
Check invariants after each transition.

In pseudocode:

search(state):
  assert invariants_hold(state)

  if depth_limit_reached(state):
    return

  for action in enabled_actions(state):
    next_state = apply(state, action)
    search(next_state)

That outline is small, but the engineering problem is large because distributed systems create huge state spaces. Five pending messages can be delivered in many orders. Add timers, crashes, restarts, partitions, and client retries, and exhaustive exploration can become impossible very quickly.

The goal is not always to prove the system correct for every conceivable execution. Often the goal is to put systematic pressure on the histories that ordinary tests rarely hit: stale reads, duplicate retries, leadership churn, delayed responses, saturated queues, and recovery races.

Exhaustive, Bounded, and Randomized Exploration

Exhaustive model checking tries every enabled action at each state. It is powerful when the model is small enough.

state S0 has enabled actions:
- deliver replication m1
- fire client retry t1
- crash order-1

exhaustive search explores all three branches

The benefit is completeness inside the chosen bounds. If the harness explores all schedules up to depth 20 with two replicas and two clients, and the invariant never fails, you know something useful about that bounded world.

The cost is state explosion. A realistic simulator may have too many nodes, messages, timers, and data values for exhaustive search to finish.

Bounded model checking accepts that limit explicitly. It may explore only:

a fixed number of clients
a fixed number of replicas
a maximum search depth
a limited number of crashes
a small key space
a short workload prefix

Those bounds are not a weakness if they are honest. Many severe bugs appear in small cases. A duplicate external charge may require only two replicas, one idempotency key, one retry, and one delayed replication message.

Randomized exploration samples action choices instead of enumerating every branch. It can run thousands of short schedules and discover failures that a human would not write by hand.

seed 91827 chooses:
1. accept client request on order-1
2. delay replication m1
3. fire retry timer
4. route retry to order-2
5. deliver payment response for first capture
6. deliver payment response for second capture

The seed and decision log make the run reproducible. Without them, the failure is not a useful test artifact.

Worked Example

CheckoutService has two replicas and an invariant: for each idempotency key, the payment adapter may observe at most one successful capture.

The simulator state includes:

nodes:
- order-1: alive, local keys = {}
- order-2: alive, local keys = {}

network:
- order-1 -> order-2: empty replication queue
- order-2 -> order-1: empty replication queue
- orders -> payment-adapter: deliverable responses controlled by simulator

clients:
- client-A: ready to confirm(order-101, key k1)

timers:
- none yet

The first action is straightforward:

client-A sends confirm(order-101, k1) to order-1

After that, the enabled actions change. The system may send a payment request, enqueue an in-flight idempotency marker, or start a timeout. A deterministic harness records all of those effects as state transitions.

A model checker might discover this path:

step 1  client-A -> order-1: confirm(order-101, k1)
step 2  order-1 -> payment: capture(order-101, k1)
step 3  order-1 -> order-2: replicate in-flight k1
step 4  scheduler delays replication m1
step 5  scheduler fires client retry timer
step 6  client-A -> order-2: confirm(order-101, k1)
step 7  order-2 has not seen k1
step 8  order-2 -> payment: capture(order-101, k1)
step 9  invariant fails: two captures for k1

The failure is easier to reason about than a flaky integration test because the path is explicit. The replay does not merely say "duplicate capture happened." It says which action choices made the stale replica possible.

The same scenario can be explored in different modes:

Mode	What it does	Useful when
Exhaustive bounded search	tries all enabled actions up to a limit	the state space is small and confidence matters
Seeded random search	samples many schedules from a seed	the state space is large and unknown bugs matter
Guided search	prioritizes actions likely to expose races	the team has a hypothesis, such as delayed replication before retry
Replay search	reruns one recorded action sequence	a discovered failure needs debugging or regression coverage

These modes can share the same simulator. The difference is the policy that chooses actions.

Pruning and State Identity

A model checker needs a way to recognize states it has already explored. Otherwise, it may loop forever or repeat equivalent histories.

State identity is not always obvious. Suppose two messages have different internal IDs but the same payload, source, destination, and delivery status. Are those states the same? Sometimes yes. Suppose two client IDs are swapped but the outstanding requests are structurally identical. Are those states the same? Maybe, if the clients are symmetric.

Canonicalization turns equivalent states into a stable representation:

raw:
  m817: order-1 -> order-2 replicate k1
  m923: order-1 -> order-2 replicate k2

canonical:
  message[0]: order-A -> order-B replicate key[0]
  message[1]: order-A -> order-B replicate key[1]

Pruning skips work that is unlikely to add information:

do not explore two branches that differ only by irrelevant message IDs
do not keep exploring after an invariant has already failed
avoid timer advances that do not enable new behavior
collapse repeated no-op retries when the system state is unchanged
limit failures to combinations the failure model says are in scope

Pruning is a trade-off. Too little pruning makes the search unusable. Too much pruning can hide the bug. The right question is always: does this pruning preserve the behaviors the invariant cares about?

Randomness With Discipline

Random exploration should still be engineered like a test.

A useful random run records:

the seed
the initial model configuration
every action selected by the scheduler
every random value drawn by the model
the invariant checked after each transition
the final state or failing state

This is the difference between reproducible exploration and accidental chaos. If the run fails, the seed should be enough to replay the same schedule. If a later code change fixes the bug, the same seed becomes a regression test.

Many teams also keep a small corpus of interesting seeds:

seed 1044: delayed replication before retry
seed 2318: leader crash during commit acknowledgement
seed 8290: partition heals after stale read
seed 91827: payment response races retry

The corpus should not replace fresh exploration. It gives the project a memory of past failures while new random runs keep searching for new ones.

Common Failure Modes

One mistake is putting too much detail into the state. If the model includes irrelevant metrics, log counters, generated UUIDs, and timestamps that do not affect behavior, the search will treat equivalent states as different. The result is state explosion without better bug finding.

Another mistake is hiding important behavior outside the model. If the real system has retry timers but the simulator does not model timers, the checker cannot explore retry races. If the real system has bounded queues but the simulator makes every send instant, the checker cannot explore backpressure.

A third mistake is trusting random tests that do not replay. A failure that cannot be reproduced may still be a real bug, but it is expensive evidence. Deterministic replay turns random discovery into a debuggable artifact.

A fourth mistake is writing invariants that are too weak. "No node panics" is useful, but it will not catch duplicate external captures, stale successful reads, lost acknowledgements, or broken monotonicity. The checker is only as valuable as the properties it checks.

Practice

Take one replicated workflow and define a small search model for it:

List the state fields that affect correctness.
List the enabled actions the scheduler can choose.
Write one invariant that must hold after every transition.
Choose a small bound, such as two replicas, one key, one retry, and depth 20.
Decide which parts should be exhaustive and which should be randomized.

Then remove one state field from the model and ask what bug the checker can no longer find. That exercise is often more revealing than adding more features.

Connections

Builds on Network Simulation, Latency, and Backpressure, because queued messages and delayed timers become part of the searched state space.
Prepares for Property-Based Testing for Replicated Protocols, where generated operations and invariants add richer workloads to the same exploration machinery.
Connects to formal methods through the shared idea of checking states and transitions, even when the harness is executable test code rather than a standalone specification.

Resources

[BOOK] TLA+ Language and Tools for Hardware and Software Engineers
[PAPER] How Amazon Web Services Uses Formal Methods
[PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store
[DOC] Stateright

Key Takeaways

State model checking treats scheduler choices as a search over states, enabled actions, transitions, and invariants.
Exhaustive exploration gives strong bounded confidence, while seeded randomized exploration covers more large spaces with reproducible failures.
State identity, pruning, and honest bounds determine whether the search is useful or overwhelmed.
Randomness becomes engineering evidence only when the seed and action log can replay the failing path.

← Back to Distributed Testing, Simulation, and Deterministic Replay

← Back to Distributed Systems

← Back to Learning Hub