Distributed Testing, Simulation, and Deterministic Replay: Flakiness, Nondeterminism, and Test Stabilization

LESSON

Distributed Testing, Simulation, and Deterministic Replay

016 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Flakiness, Nondeterminism, and Test Stabilization

Core Insight

In CheckoutService, a regression test for duplicate payment capture fails on CI twice a week. The same test passes on a laptop, passes when rerun, and fails more often when the build machine is under load. Calling it "flaky" is accurate, but it is not a diagnosis. The test is exposing uncontrolled nondeterminism somewhere in the system, the harness, or the environment.

Distributed tests become flaky when the test result depends on choices the test does not control: wall-clock time, scheduler order, message delivery order, random seeds, background threads, live dependencies, retries, resource pressure, or cleanup from earlier runs. Stabilization means bringing the relevant choices under explicit control so that a failure can be reproduced, minimized, and trusted.

The trade-off is exploration versus reproducibility. A harness should explore many interleavings, latencies, and failure timings, but each failing run should collapse into a deterministic replay. If stabilization only makes tests pass by removing pressure, it has destroyed the signal. If exploration produces failures that cannot be replayed, it has created noise.

What Flakiness Means

A flaky distributed test is a test whose result changes without an intentional code or test change. The test may pass, fail, hang, time out, or produce different failure messages across runs.

The important question is not whether the failure is "real." The important question is which uncontrolled input changed:

same code
same test name
different outcome

therefore:
some unrecorded choice changed between runs

That choice may be in the product code. A race condition can depend on thread scheduling. A timeout bug can depend on host load. A replica bug can depend on message ordering.

The choice may be in the test harness. A random generator may not record its seed. A fake clock may coexist with a real clock. A simulated network may still use real timers. Cleanup may leak state between tests.

The choice may be in the environment. CI machines can run slower than laptops. DNS, object storage, queues, and feature flag services can behave differently between runs. Resource limits can change latency relationships.

Treat flakiness as an observability problem first. Before changing the assertion, capture enough evidence to know which source of nondeterminism moved.

Common Sources of Nondeterminism

Distributed systems naturally contain many sources of nondeterminism. The harness does not need to remove all of them, but it must decide which ones are controlled, which ones are sampled, and which ones are recorded.

Scheduler nondeterminism controls when tasks, fibers, actors, goroutines, or threads run.

task A checks idempotency table
task B applies replication message
task A sends external capture

If a different scheduler order changes the result, the test must either control scheduling or record enough scheduling decisions to replay the failure.

Time nondeterminism controls when timers fire, deadlines expire, leases renew, and retry loops run.

retry timer fires at 50 ms
replication arrives at 80 ms

If the test uses host time, load on the build machine can change those relationships. Simulated time makes the relationship explicit.

Network nondeterminism controls message delay, delivery, duplication, loss, and reordering. A test that uses real sockets may accidentally test the host network stack more than the distributed algorithm.

Randomness controls generated operations, fault choices, payloads, topology, and schedule exploration. Randomness is useful, but only if the seed and generated choices are recorded.

Dependency nondeterminism comes from services outside the harness: databases, queues, payment adapters, object stores, authentication services, and feature flag systems. Live dependencies create hidden inputs and side effects.

State nondeterminism comes from leaked data, shared namespaces, reused ports, persistent queues, global caches, and cleanup races. These failures often look like product bugs until the same test behaves differently after a clean checkout.

The Stabilization Workflow

Stabilization should make the failing behavior more explainable, not merely less frequent.

First, classify the failure.

does it fail with the same invariant?
does it fail at the same logical step?
does it hang, timeout, or assert?
does rerun with the same seed reproduce it?
does local replay reproduce it?

Second, preserve the evidence. Keep the seed, generated operations, fault plan, simulated time, message schedule, node logs, dependency responses, and invariant failure. Do this before adding sleeps or broad retries, because those changes can erase the original timing relationship.

Third, identify uncontrolled inputs. Ask what the test reads that is not included in the replay record.

host clock?
wall-clock sleep?
thread scheduler?
unseeded random source?
live dependency?
shared filesystem state?
network socket timing?

Fourth, move one input at a time under control. Replace host time with simulated time. Record the random seed. Stub a dependency. Pin a scheduler choice. Isolate state. Then rerun the same replay.

Fifth, decide whether the test is stable because it is controlled or stable because it is weaker. A test that stopped failing after increasing every timeout by 10x may still hide the bug. A test that replays the same failing schedule on demand has become stronger.

Worked Example

The flaky CI test says:

test: confirm_is_idempotent_during_replication_lag
failure: expected 1 payment capture for key k1, observed 2
frequency: about 1 in 80 runs
local rerun: usually passes

The first replay record is incomplete:

seed: 91827
clients: 2
replicas: A, B
fault: delay replication A -> B
assertion: one capture per idempotency key

That is not enough. The seed generated the operations, but the failure also depends on when the retry timer fires relative to replication delivery. The harness uses simulated network delay, but the client retry loop still uses the host clock.

On a slow CI machine, the relationship sometimes becomes:

1  C confirm(order-1, k1) -> A
2  A records in-flight k1
3  A sends replication m1 -> B
4  network holds m1
5  A sends capture(k1)
6  host-clock retry fires
7  C retry confirm(order-1, k1) -> B
8  B has not seen m1
9  B sends capture(k1)
10 invariant fails

On a faster local machine, the retry may not fire before the test advances the simulated network:

1  C confirm(order-1, k1) -> A
2  A sends replication m1 -> B
3  network delivers m1
4  retry fires later
5  B recognizes k1
6  invariant holds

The product behavior and the test harness are tangled together. The test intended to explore replication lag, but it accidentally let the host clock decide whether the retry happened inside the unsafe window.

The stabilization is not:

sleep(500 ms)
retry assertion three times
increase timeout to 30 seconds

Those changes might reduce failure frequency, but they do not make the causal relationship reproducible.

The stabilization is:

use simulated time for the retry timer
record timer firings in the replay log
advance time only through the deterministic scheduler
record message hold and delivery choices
make the payment adapter a deterministic stub
fail on the invariant, not on elapsed wall-clock duration

Now the failing replay can say:

seed: 91827
time: simulated
timer event: retry(k1) at logical step 6
network event: deliver m1 after logical step 10
dependency response: payment adapter accepts both captures
failure: duplicate capture for k1

That record can be replayed locally, fed to the shrinker from the previous lesson, and kept as a regression test.

Stabilization Tactics

Use seeded randomness, and record the seed with every failure. If the generator makes additional choices after the seed, record the generated scenario too. A seed alone may not be enough after the generator changes.

Use simulated time for timers, leases, retries, deadlines, heartbeats, and backoff loops. Tests should advance logical time through the harness instead of sleeping on the host clock.

Use deterministic schedulers where possible. Actor systems, async runtimes, and simulation harnesses can often expose scheduling points. The harness can explore schedule choices during generation and record them during replay.

Stub or simulate external dependencies. A payment adapter, queue, cache, DNS lookup, or object store should return scripted responses when the property under test is not the dependency itself.

Isolate state. Use unique namespaces, disposable stores, fake credentials, temporary directories, and explicit cleanup checks. A test should not inherit a queue message, lock file, background process, or database row from a previous run.

Prefer condition-based waiting over fixed sleeps. A fixed sleep guesses at timing. A condition waits for a logical state, such as "replica B has applied message m1" or "all scheduled tasks are idle." In deterministic tests, even that wait should be driven by controlled time or scheduler progress.

Record enough to replay. Useful replay records include:

random seed and generated scenario
client operations
simulated time advances
timer firings
message sends, holds, drops, and deliveries
node crashes and restarts
dependency requests and responses
scheduler choices
observed history
failing invariant

What Not To Stabilize Away

Some nondeterminism is useful. A randomized simulator should explore many histories. A fault injector should vary partitions, delays, crashes, and restarts. A scheduler should search for rare interleavings.

The goal is not to make every run identical. The goal is to make every interesting run explainable.

Bad stabilization removes the dangerous condition:

disable retries during tests
turn off replication lag
mock the storage layer so writes are instantly visible everywhere
replace concurrent workers with a single synchronous path

Those changes may make the test stable by no longer testing the distributed behavior.

Good stabilization keeps the pressure but controls it:

generate retry timing
hold replication messages deliberately
record the chosen schedule
stub the external payment effect
replay the same failing history exactly

The distinction matters. Stable tests that cannot fail under realistic pressure create false confidence. Noisy tests that cannot replay failures waste engineering time. Strong distributed tests explore broadly and replay precisely.

Common Failure Modes

One mistake is adding sleeps until CI looks green. Sleeps make the test slower and may only shift the race window.

Another mistake is retrying failed assertions. Assertion retries can hide transient invariant violations that are exactly the bug the test was supposed to catch.

A third mistake is treating the random seed as a complete replay. If the runtime scheduler, host time, or dependency responses are not controlled, the same seed may generate the same operations but not the same execution.

A fourth mistake is stabilizing the harness by weakening the property. Changing "never duplicate a capture" into "eventually reconcile duplicate captures" may be a valid product decision, but it is not the same test.

A fifth mistake is keeping live dependencies in a deterministic test. A live dependency creates changing latency, state, rate limits, credentials, and side effects that the replay cannot own.

Practice

Take one flaky distributed test and write a stabilization plan.

What invariant failed, hung, or timed out?
Which inputs are already recorded?
Which inputs are still uncontrolled?
Which clock does the test use?
Which scheduler choices affect the result?
Which dependencies are live rather than scripted?
Which state can leak across runs?
What replay record would let another engineer reproduce the same failure?

Then make one change that increases control without reducing pressure. For example, move a retry timer onto simulated time, record message delivery choices, or replace a live dependency with a deterministic stub. Rerun the same failing scenario and check whether the failure is now reproducible.

Connections

Builds on Shrinking, Delta Debugging, and Minimal Counterexamples, because a shrinker needs a stable replay predicate before it can reduce a failure honestly.
Prepares for Simulation Fidelity, Model Drift, and False Confidence, where the next question is whether a stable simulation still represents the production behavior that matters.
Connects to reliability practice because teams must distinguish CI noise from uncontrolled system behavior before deciding whether to fix the product, the harness, or the environment.

Resources

[BOOK] Designing Data-Intensive Applications
[DOC] Jepsen Analyses
[ARTICLE] Google Testing Blog: Flaky Tests at Google and How We Mitigate Them
[PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store

Key Takeaways

Flakiness in a distributed test means some relevant choice is uncontrolled or unrecorded.
Stabilization should preserve failure pressure while making seeds, clocks, schedules, dependencies, and state reproducible.
Adding sleeps, assertion retries, or weaker properties can hide the bug instead of making the test more trustworthy.
A strong harness explores nondeterministic executions broadly, then records enough detail to replay each important failure exactly.

← Back to Distributed Testing, Simulation, and Deterministic Replay

← Back to Distributed Systems

← Back to Learning Hub