Distributed Testing, Simulation, and Deterministic Replay: Property-Based Testing for Replicated Protocols
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Property-Based Testing for Replicated Protocols
Core Insight
In CartReplica, two data centers accept cart updates while the network is unreliable. A customer adds a laptop in region A, removes an old charger in region B, and refreshes the page while replication is still catching up. A hand-written test might check one tidy sequence, but the dangerous cases are usually strange combinations: concurrent adds, repeated removes, duplicate messages, delayed anti-entropy, and replicas that merge in different orders.
Property-based testing changes the question from "does this example pass?" to "does this property hold across many generated histories?" For a replicated protocol, a property might say that all replicas converge after every delivered update, that acknowledged writes are not lost, that counters never go negative, or that idempotency keys produce one external effect no matter how many retries occur.
The trade-off is precision versus search power. A generator can create far more histories than a person will write, but only if the properties are meaningful, the generated operations are realistic, and the failing case can be shrunk and replayed. Otherwise, property-based testing becomes a fast way to produce confusing failures.
From Examples to Properties
An example test fixes the inputs and the expected result:
1. replica A: add item laptop
2. deliver A -> B update
3. assert replica B contains laptop
That is still useful. It documents a behavior and guards a known path. The problem is that replicated protocols often fail between the obvious paths.
A property test defines a space of histories and a rule that must hold for every generated history inside that space:
for all generated histories:
run the protocol in the simulator
eventually deliver all non-dropped messages
assert all live replicas expose the same cart state
The property does not predict each intermediate state. During a partition, replicas may disagree. The property states what must be true after the allowed uncertainty is resolved.
Good replicated-protocol properties usually separate three ideas:
- Safety: something bad never happens, such as two successful captures for one idempotency key.
- Convergence: replicas that receive the same durable updates eventually agree.
- Preservation: acknowledged facts are not silently lost during merge, retry, replay, or restart.
Those properties are stronger than "no crash" and less brittle than expecting every replica to be identical at every step.
The Pieces of a Stateful Property Test
A stateful property test has four moving pieces.
The first is a generator. It creates operations such as client commands, message deliveries, crashes, restarts, partitions, and timer firings. In a replicated protocol test, the generator should produce concurrency and delay, not only neat serial sequences.
generated operations:
- client-1 add(cart-7, laptop) at replica A
- client-2 remove(cart-7, charger) at replica B
- delay delta A -> B
- duplicate delta B -> A
- restart replica A
- deliver all remaining messages
The second is the system under test. This may be the real protocol code wrapped in a simulator, a smaller executable model, or a protocol implementation with deterministic dependencies injected.
The third is a reference model or oracle. For some protocols, the oracle can be a simple mathematical model. For others, it is a set of invariants that must hold without computing one exact expected final state.
The fourth is a shrinker. When a generated history fails, the shrinker tries to remove irrelevant operations while preserving the failure. A 400-step failure is evidence. A 9-step failure is a debugging tool.
Worked Example
Suppose CartReplica uses operation-based replication. Each replica accepts cart operations locally and sends deltas to peers.
The protocol supports:
add(cart_id, item_id, operation_id)
remove(cart_id, item_id, operation_id)
merge(delta)
Each operation has a unique operation_id so duplicate messages can be ignored. The team wants remove-wins semantics: if an add and a remove for the same item are both known, the remove should win unless there is a later add.
A useful property is convergence after delivery:
property convergence_after_drain(history):
cluster = new_simulated_cluster(replicas = 3)
run_generated_history(cluster, history)
drain_all_network_messages(cluster)
visible = [replica.cart_state("cart-7") for replica in cluster.live_replicas()]
assert all_equal(visible)
That property alone is not enough. Three replicas can converge to the wrong answer. Add a preservation property:
property acknowledged_adds_are_visible_unless_removed(history):
model = reference_cart_model()
cluster = new_simulated_cluster(replicas = 3)
for step in history:
result = cluster.apply(step)
model.observe(step, result)
drain_all_network_messages(cluster)
for item in model.acknowledged_adds_without_later_remove():
assert item in every_live_replica_cart(cluster, "cart-7")
Now the test checks both agreement and meaning. If all replicas agree on an empty cart after a successful add that was never removed, convergence passes but preservation fails.
The generator might create this history:
step 1 client-1 -> replica A: add(cart-7, laptop, op-1)
step 2 replica A acknowledges op-1
step 3 A -> B delta(op-1) is delayed
step 4 client-2 -> replica B: remove(cart-7, laptop, op-2)
step 5 replica B acknowledges op-2
step 6 B -> C delta(op-2) is delivered
step 7 C -> A anti-entropy sends op-2
step 8 A merges op-2 before B ever sees op-1
step 9 drain remaining messages
If the protocol records only "item removed" instead of the causal relationship between add and remove, it may delete a later or concurrent add incorrectly. The failure is not that the code cannot handle add or remove. The failure is that the merge rule lacks enough information to make the desired semantics deterministic.
Generator Design
The generator is part of the test design, not a neutral random input machine.
A weak generator creates many histories that do not stress the protocol:
add item
deliver immediately
remove item
deliver immediately
read item
A stronger generator creates histories around the protocol's risk:
- concurrent operations on the same key
- repeated operations with the same idempotency token
- duplicate and reordered messages
- partitions that isolate one replica while others continue
- restarts after local acknowledgement but before replication
- anti-entropy rounds that deliver old and new deltas together
- reads during lag and reads after full drain
The generator should also respect real preconditions. If the production API never allows an unauthenticated write, the generator should not spend most of its time generating impossible unauthenticated writes unless that is the behavior being tested. Impossible inputs can test validation logic, but they are usually noise for replicated-protocol correctness.
Bias matters. A purely uniform generator may rarely create the interesting case. For example, if item IDs are chosen from a huge space, two clients almost never touch the same item. The generator may need a small hot set:
item_id = choose(
80% from ["laptop", "charger", "dock"],
20% from random_new_item()
)
That bias is not cheating. It expresses the risk the test is meant to explore: conflicts, retries, and duplicate effects on shared state.
Shrinking and Replay
When a property fails, the first generated history is often too large:
seed 664102 produced 3 replicas, 5 clients, 117 operations, 46 messages, 4 restarts
Shrinking tries to preserve the failure while simplifying the history:
1. remove unrelated cart IDs
2. remove reads that do not affect state
3. reduce replicas from 3 to 2 if the bug remains
4. remove extra duplicate messages
5. reduce to one add, one remove, one delayed delivery
The final counterexample should still replay through the deterministic harness:
seed 664102
shrunk history:
1. A accepts add(laptop, op-1)
2. delay A -> B delta(op-1)
3. B accepts remove(laptop, op-2)
4. deliver B -> A delta(op-2)
5. drain network
6. invariant fails
This is where property-based testing and deterministic replay reinforce each other. The generator discovers the shape. The shrinker reduces it. The replay log makes it stable enough to debug and keep.
Common Failure Modes
One mistake is checking only final convergence. Agreement is important, but replicas can agree on a state that violates the product contract. Pair convergence with properties about acknowledged writes, idempotency, monotonicity, or causal visibility.
Another mistake is generating unrealistic histories. A generator that ignores real API preconditions, durability boundaries, or failure assumptions can produce failures that do not teach the team anything useful.
A third mistake is using randomness without replay and shrinking. A property test that finds a failure once but cannot reproduce it creates expensive uncertainty.
A fourth mistake is making the reference model as complex and bug-prone as the protocol. The oracle should be simpler than the system under test. If it cannot be simpler, use multiple smaller invariants instead of one giant expected-state function.
Practice
Design a property test for one replicated protocol:
- Name the operations clients can perform.
- Name the network and failure actions the simulator may generate.
- Write one convergence property.
- Write one safety or preservation property.
- Decide what the shrinker should remove first.
Then inspect the generator. If it almost never creates conflicting operations on the same key, delayed delivery before acknowledgement, or retries with the same idempotency token, it is probably under-testing the protocol's real risk.
Connections
- Builds on State Model Checking and Randomized Exploration, because property-based tests need the same replayable histories and scheduler decisions.
- Prepares for Workload Modeling and Jepsen-Style Histories, where generated operations become histories that can be analyzed against consistency claims.
- Connects to replication design because merge semantics, idempotency, and causal metadata determine which properties can actually hold.
Resources
- [PAPER] QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs
- [DOC] Hypothesis: Property-Based Testing for Python
- [DOC] PropEr: Property-Based Testing for Erlang
- [PAPER] A comprehensive study of Convergent and Commutative Replicated Data Types
Key Takeaways
- Property-based testing checks replicated protocols by generating many histories and asserting durable properties over them.
- Useful properties distinguish convergence, safety, and preservation instead of relying on one brittle expected output.
- Generator bias is part of the design: it should create the conflicts, delays, retries, and duplicate deliveries the protocol must survive.
- Shrinking and deterministic replay turn random discovery into a small, stable counterexample.
← Back to Distributed Testing, Simulation, and Deterministic Replay