Distributed Testing, Simulation, and Deterministic Replay: Shrinking, Delta Debugging, and Minimal Counterexamples

LESSON

Distributed Testing, Simulation, and Deterministic Replay

015 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Shrinking, Delta Debugging, and Minimal Counterexamples

Core Insight

In CheckoutService, a randomized simulation finds a duplicate payment capture after 600 scheduler steps, 12 clients, 4 replicas, 200 messages, 6 retries, a regional slowdown, and several unrelated cart updates. The failure is real, but the first replay is too large for a human to reason about. Debugging starts only when the harness can reduce that mess to the few events that still make the invariant fail.

Shrinking and delta debugging are reduction techniques. They repeatedly remove, simplify, or replace pieces of a failing input while checking whether the same failure still happens. In distributed testing, the "input" is often a replay record: client operations, network events, timer firings, crashes, dependency responses, and scheduler choices.

The trade-off is simplicity versus semantic preservation. A smaller counterexample is easier to understand, but a reduction that changes the bug is misleading. A good shrinker makes the failure smaller while preserving the relevant claim, invariant, and causal shape.

What a Counterexample Is

A counterexample is a concrete execution that violates a property.

For a distributed harness, it may include:

The first counterexample is rarely minimal:

seed: 91827
replicas: 4
clients: 12
steps: 600
messages: 200
faults: 9
failure: duplicate external capture for idempotency key k1

The useful counterexample might be:

replicas: 2
clients: 1
steps: 11
messages: 2
faults: 0 hard failures, 1 delayed replication message
failure: duplicate external capture for idempotency key k1

That smaller execution teaches the mechanism: retry outran replication. The large one only proves that something went wrong.

Shrinking Versus Delta Debugging

Shrinking usually comes from property-based testing. The generator knows how to simplify its own generated values. If it generated 12 clients, it can try 6. If it generated 20 operations, it can try a shorter sequence. If it generated an amount of 1837, it can try 1.

Delta debugging is a more general reduction strategy. It treats the failing artifact as chunks and repeatedly asks which chunks are necessary.

does the failure still happen without the first half of events?
does it still happen without these three clients?
does it still happen without this partition?
does it still happen if this latency is 50 steps instead of 500?

Both strategies depend on a replay predicate:

fails(candidate):
  replay candidate deterministically
  return invariant "one capture per idempotency key" fails

If fails(candidate) is flaky, shrinking becomes unreliable. The reducer may discard a necessary event because the bug happened to reproduce once without it, or keep a noisy event because the bug happened not to reproduce when it was removed. Deterministic replay is the foundation that makes reduction trustworthy.

Worked Example

Start with a replayed incident:

1   C1 confirm(order-1, k1) -> A
2   A records in-flight k1
3   A sends replication m1 -> B
4   network delays m1 by 500 steps
5   A sends capture(k1) -> payment
6   C2 updates unrelated cart
7   B restarts
8   C3 reads unrelated order
9   C1 retry timer fires
10  C1 retry confirm(order-1, k1) -> B
11  B has not seen m1
12  B sends capture(k1) -> payment
13  payment returns approved for both captures
14  invariant fails

The reducer can try removing chunks.

Remove unrelated operations:

remove step 6: unrelated cart update
remove step 8: unrelated order read
replay still fails

Remove the restart:

remove step 7: B restart
replay still fails

Reduce the latency:

delay m1 by 500 steps -> still fails
delay m1 by 100 steps -> still fails
delay m1 by 40 steps -> failure disappears

The minimal delay is not necessarily important as an exact number. The important relationship is:

replication delivery time > client retry timer

The final counterexample can express that relationship directly:

1  C confirm(order-1, k1) -> A
2  A records in-flight k1
3  A sends replication m1 -> B
4  scheduler holds m1
5  A sends capture(k1) -> payment
6  retry timer fires before m1 is delivered
7  C retry confirm(order-1, k1) -> B
8  B has not seen k1
9  B sends capture(k1) -> payment
10 payment records two captures for k1
11 invariant fails

That counterexample is small enough to become a regression test and a design review artifact.

Reduction Operations

A distributed shrinker needs operations that understand distributed structure.

Useful operations include:

These operations must preserve validity. If a message delivery remains after its send event was removed, the replay is invalid. If a retry remains after the original operation was removed, the scenario may no longer mean the same thing. If a node receives a response from a dependency call it never made, the reducer has created nonsense.

A reducer should check candidate validity before checking failure:

candidate = remove(events, chunk)

if not valid_history(candidate):
  reject candidate
else if fails(candidate):
  keep candidate
else:
  restore chunk

Validity checks make the reducer faster and protect the reader from fake counterexamples.

Preserving the Bug Identity

Not every smaller failing case is the same bug.

Suppose the original failure is duplicate capture because retry outran replication. A reducer might remove the replication message entirely and still get duplicate capture. That smaller case may describe a different bug: no replication was attempted. It is useful, but it is not the same counterexample.

The replay predicate should therefore check more than "some invariant failed." It can include a bug signature:

bug_signature:
- invariant: one capture per idempotency key fails
- first capture from replica A
- second capture from replica B
- replication message m1 was sent before retry
- retry reached B before m1 was delivered

The signature does not need to overfit every detail. It should preserve the causal claim the team is debugging.

This is especially important when multiple bugs exist. A reducer can accidentally switch from one failure to another. The final counterexample should still answer the same engineering question.

When Minimal Is Too Minimal

The smallest possible counterexample is not always the best teaching artifact.

A one-line sequence such as this may technically fail:

B sends capture(k1) without checking idempotency

But it may hide the distributed pressure. The team needs to see that the second capture was caused by stale state during retry, not by an arbitrary malformed call.

A good counterexample is minimal enough to understand and complete enough to preserve the mechanism. It should keep:

Everything else is negotiable.

Common Failure Modes

One mistake is reducing without deterministic replay. If the failure is flaky, reduction results are guesses.

Another mistake is shrinking syntax instead of semantics. Removing lines from a replay record is not enough; the remaining execution must still be a valid distributed history.

A third mistake is keeping too much because the reducer is afraid to change topology, keys, or timing. Many distributed bugs have a tiny core once unrelated clients and messages are removed.

A fourth mistake is accepting a smaller failure that is a different bug. The reducer should preserve the invariant and the causal signature that matter.

A fifth mistake is optimizing only for machine minimality. A counterexample that is one event shorter but much harder to read may be worse for debugging and design review.

Practice

Take a failing replay from lessons 13 or 14 and design a shrink plan:

  1. Name the invariant and bug signature.
  2. List the events that are probably essential.
  3. List the events that are likely noise.
  4. Choose three reduction operations, such as removing clients, reducing replicas, or shortening delay.
  5. Define validity checks for the candidate replay.
  6. Decide when the counterexample is small enough for a regression test.

Then compare two results: the smallest counterexample the reducer can find and the clearest counterexample a human would want in a design review. They are often close, but they are not always identical.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: Replaying Production Incidents Without Recreating Production NEXT Distributed Testing, Simulation, and Deterministic Replay: Flakiness, Nondeterminism, and Test Stabilization