Distributed Testing, Simulation, and Deterministic Replay: Property-Based Testing for Replicated Protocols

LESSON

Distributed Testing, Simulation, and Deterministic Replay

010 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Property-Based Testing for Replicated Protocols

Core Insight

In CartReplica, two data centers accept cart updates while the network is unreliable. A customer adds a laptop in region A, removes an old charger in region B, and refreshes the page while replication is still catching up. A hand-written test might check one tidy sequence, but the dangerous cases are usually strange combinations: concurrent adds, repeated removes, duplicate messages, delayed anti-entropy, and replicas that merge in different orders.

Property-based testing changes the question from "does this example pass?" to "does this property hold across many generated histories?" For a replicated protocol, a property might say that all replicas converge after every delivered update, that acknowledged writes are not lost, that counters never go negative, or that idempotency keys produce one external effect no matter how many retries occur.

The trade-off is precision versus search power. A generator can create far more histories than a person will write, but only if the properties are meaningful, the generated operations are realistic, and the failing case can be shrunk and replayed. Otherwise, property-based testing becomes a fast way to produce confusing failures.

From Examples to Properties

An example test fixes the inputs and the expected result:

1. replica A: add item laptop
2. deliver A -> B update
3. assert replica B contains laptop

That is still useful. It documents a behavior and guards a known path. The problem is that replicated protocols often fail between the obvious paths.

A property test defines a space of histories and a rule that must hold for every generated history inside that space:

for all generated histories:
  run the protocol in the simulator
  eventually deliver all non-dropped messages
  assert all live replicas expose the same cart state

The property does not predict each intermediate state. During a partition, replicas may disagree. The property states what must be true after the allowed uncertainty is resolved.

Good replicated-protocol properties usually separate three ideas:

Those properties are stronger than "no crash" and less brittle than expecting every replica to be identical at every step.

The Pieces of a Stateful Property Test

A stateful property test has four moving pieces.

The first is a generator. It creates operations such as client commands, message deliveries, crashes, restarts, partitions, and timer firings. In a replicated protocol test, the generator should produce concurrency and delay, not only neat serial sequences.

generated operations:
- client-1 add(cart-7, laptop) at replica A
- client-2 remove(cart-7, charger) at replica B
- delay delta A -> B
- duplicate delta B -> A
- restart replica A
- deliver all remaining messages

The second is the system under test. This may be the real protocol code wrapped in a simulator, a smaller executable model, or a protocol implementation with deterministic dependencies injected.

The third is a reference model or oracle. For some protocols, the oracle can be a simple mathematical model. For others, it is a set of invariants that must hold without computing one exact expected final state.

The fourth is a shrinker. When a generated history fails, the shrinker tries to remove irrelevant operations while preserving the failure. A 400-step failure is evidence. A 9-step failure is a debugging tool.

Worked Example

Suppose CartReplica uses operation-based replication. Each replica accepts cart operations locally and sends deltas to peers.

The protocol supports:

add(cart_id, item_id, operation_id)
remove(cart_id, item_id, operation_id)
merge(delta)

Each operation has a unique operation_id so duplicate messages can be ignored. The team wants remove-wins semantics: if an add and a remove for the same item are both known, the remove should win unless there is a later add.

A useful property is convergence after delivery:

property convergence_after_drain(history):
  cluster = new_simulated_cluster(replicas = 3)
  run_generated_history(cluster, history)

  drain_all_network_messages(cluster)

  visible = [replica.cart_state("cart-7") for replica in cluster.live_replicas()]
  assert all_equal(visible)

That property alone is not enough. Three replicas can converge to the wrong answer. Add a preservation property:

property acknowledged_adds_are_visible_unless_removed(history):
  model = reference_cart_model()
  cluster = new_simulated_cluster(replicas = 3)

  for step in history:
    result = cluster.apply(step)
    model.observe(step, result)

  drain_all_network_messages(cluster)

  for item in model.acknowledged_adds_without_later_remove():
    assert item in every_live_replica_cart(cluster, "cart-7")

Now the test checks both agreement and meaning. If all replicas agree on an empty cart after a successful add that was never removed, convergence passes but preservation fails.

The generator might create this history:

step 1  client-1 -> replica A: add(cart-7, laptop, op-1)
step 2  replica A acknowledges op-1
step 3  A -> B delta(op-1) is delayed
step 4  client-2 -> replica B: remove(cart-7, laptop, op-2)
step 5  replica B acknowledges op-2
step 6  B -> C delta(op-2) is delivered
step 7  C -> A anti-entropy sends op-2
step 8  A merges op-2 before B ever sees op-1
step 9  drain remaining messages

If the protocol records only "item removed" instead of the causal relationship between add and remove, it may delete a later or concurrent add incorrectly. The failure is not that the code cannot handle add or remove. The failure is that the merge rule lacks enough information to make the desired semantics deterministic.

Generator Design

The generator is part of the test design, not a neutral random input machine.

A weak generator creates many histories that do not stress the protocol:

add item
deliver immediately
remove item
deliver immediately
read item

A stronger generator creates histories around the protocol's risk:

The generator should also respect real preconditions. If the production API never allows an unauthenticated write, the generator should not spend most of its time generating impossible unauthenticated writes unless that is the behavior being tested. Impossible inputs can test validation logic, but they are usually noise for replicated-protocol correctness.

Bias matters. A purely uniform generator may rarely create the interesting case. For example, if item IDs are chosen from a huge space, two clients almost never touch the same item. The generator may need a small hot set:

item_id = choose(
  80% from ["laptop", "charger", "dock"],
  20% from random_new_item()
)

That bias is not cheating. It expresses the risk the test is meant to explore: conflicts, retries, and duplicate effects on shared state.

Shrinking and Replay

When a property fails, the first generated history is often too large:

seed 664102 produced 3 replicas, 5 clients, 117 operations, 46 messages, 4 restarts

Shrinking tries to preserve the failure while simplifying the history:

1. remove unrelated cart IDs
2. remove reads that do not affect state
3. reduce replicas from 3 to 2 if the bug remains
4. remove extra duplicate messages
5. reduce to one add, one remove, one delayed delivery

The final counterexample should still replay through the deterministic harness:

seed 664102
shrunk history:
1. A accepts add(laptop, op-1)
2. delay A -> B delta(op-1)
3. B accepts remove(laptop, op-2)
4. deliver B -> A delta(op-2)
5. drain network
6. invariant fails

This is where property-based testing and deterministic replay reinforce each other. The generator discovers the shape. The shrinker reduces it. The replay log makes it stable enough to debug and keep.

Common Failure Modes

One mistake is checking only final convergence. Agreement is important, but replicas can agree on a state that violates the product contract. Pair convergence with properties about acknowledged writes, idempotency, monotonicity, or causal visibility.

Another mistake is generating unrealistic histories. A generator that ignores real API preconditions, durability boundaries, or failure assumptions can produce failures that do not teach the team anything useful.

A third mistake is using randomness without replay and shrinking. A property test that finds a failure once but cannot reproduce it creates expensive uncertainty.

A fourth mistake is making the reference model as complex and bug-prone as the protocol. The oracle should be simpler than the system under test. If it cannot be simpler, use multiple smaller invariants instead of one giant expected-state function.

Practice

Design a property test for one replicated protocol:

  1. Name the operations clients can perform.
  2. Name the network and failure actions the simulator may generate.
  3. Write one convergence property.
  4. Write one safety or preservation property.
  5. Decide what the shrinker should remove first.

Then inspect the generator. If it almost never creates conflicting operations on the same key, delayed delivery before acknowledgement, or retries with the same idempotency token, it is probably under-testing the protocol's real risk.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: State Model Checking and Randomized Exploration NEXT Distributed Testing, Simulation, and Deterministic Replay: Workload Modeling and Jepsen-Style Histories