Distributed Testing, Simulation, and Deterministic Replay: Deterministic Simulation Harness Architecture

LESSON

Distributed Testing, Simulation, and Deterministic Replay

006 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Deterministic Simulation Harness Architecture

Core Insight

A deterministic simulation harness is not just a test runner with random failures. It is a small controlled world where time, network delivery, dependency responses, faults, and client workloads all pass through explicit interfaces the scheduler can choose and replay.

In CheckoutService, a duplicate payment bug might require a client timeout, a delayed idempotency replication message, a retry routed to another replica, and a payment dependency that accepts both calls. If the harness controls only the client workload but leaves timers, network, and payment calls to the host machine, the failing run may never replay. The architecture must put every relevant source of nondeterminism behind a deterministic boundary.

The non-obvious design pressure is that realism and control compete. A harness that simulates everything can be reproducible but may drift away from production behavior. A harness that uses real clocks, sockets, threads, and dependencies may be realistic but hard to replay. The trade-off is choosing which boundaries must be simulated, which can be adapted, and which are safe to leave outside the replay contract.

Harness Components

A useful deterministic harness has a few distinct parts. They may live in one process or several processes, but the responsibilities should stay separate:

The shape is easier to see as a data path:

workload driver
      |
      v
  scheduler -----> simulated clock/timers
      |            simulated network
      |            fault controller
      v            dependency adapters
 simulated service nodes
      |
      v
history recorder -----> oracle runner
      |
      v
replay seed + decision trace

The important feature is not the diagram itself. It is the direction of control. The service under test should ask the harness for time, I/O, dependency responses, and delivery progress. If the service can secretly use host time, unmanaged threads, real sockets, or uncontrolled randomness, the harness has a hole in its replay boundary.

Controlling Nondeterminism

Every deterministic harness starts by finding the places where the real runtime would make a choice without telling the test.

Common sources include:

Source Harness Boundary
Host clock Simulated clock API
Timers and sleeps Deterministic timer queue
Network delivery Simulated transport
Random IDs and random choices Seeded randomness service
External storage Simulated or instrumented storage adapter
External APIs Dependency fake or scripted adapter
Thread scheduling Cooperative tasks or controlled executor
Process crash and restart Snapshot, durable state, and restart hooks

The harness does not need to make the production service beautiful. It needs to expose enough seams for testing. For example, a payment client can keep its production interface while the test binds it to a deterministic adapter. The order service can still think it is sending a request; the harness decides when the response becomes deliverable and records that decision.

This is why deterministic simulation often pressures architecture. Services that directly call time.now(), create unmanaged background threads, hide retries inside opaque clients, or talk to real dependencies during simulation are difficult to replay. The harness cannot schedule what it cannot see.

Execution Loop

At runtime, the harness repeatedly asks what events are enabled, chooses one, applies it, records it, and checks whether the run should continue.

seed = 481516
state = initial_cluster()

while not run_finished(state):
    enabled = collect_enabled_events(state)
    decision = scheduler.choose(seed, enabled, history)
    apply(decision, state)
    history.record(decision, visible_effects(state))
    oracle.maybe_check(history)

The enabled set is the bridge between architecture and scheduling. If the simulated network has three messages waiting, the timer queue has one retry ready, and the fault controller has a crash event available, all of those should appear as scheduler choices. The scheduler can then explore different interleavings as described in lesson 5.

The history should record more than the chosen event name. A useful decision record includes:

That record lets the team replay the run, explain the failure, and later shrink the schedule.

Worked Example

For CheckoutService, the harness might define these simulated components:

nodes:
- order-1, order-2
- inventory-1
- payment-adapter

controlled boundaries:
- client workload issues confirm(order_id, idempotency_key)
- simulated network routes replication and dependency messages
- simulated clock fires retry timers
- durable store adapter records committed order state
- payment adapter records accepted captures
- oracle checks no duplicate capture per idempotency key

A failing run can then be represented as architecture-level decisions:

step 1  workload: client-A confirm(order-101, k1) -> order-1
step 2  order-1: send capture(order-101, k1) -> payment-adapter
step 3  timer: client-A retry timer becomes ready
step 4  scheduler: fire retry timer before payment response delivery
step 5  workload: retry confirm(order-101, k1) -> order-2
step 6  network: delay idempotency replication from order-1 to order-2
step 7  order-2: send capture(order-101, k1) -> payment-adapter
step 8  payment-adapter: accept both captures
step 9  oracle: fail duplicate capture invariant

The architecture makes the failure reproducible because the timeout, retry, network delay, dependency response, and oracle evidence all pass through harness-owned components. The run is not a collection of lucky sleeps. It is a replayable schedule inside a controlled model.

Fidelity and Control

The hardest architecture decision is how much to simulate. Pure simulation gives strong replay control, but it may omit production details such as kernel buffering, real storage latency, connection pooling, or thread contention. Black-box integration testing includes those details, but it can leave too much nondeterminism outside the harness.

Many teams use a layered approach:

No layer replaces the others. The deterministic harness is strongest when the question is "can this legal interleaving violate our contract?" It is weaker when the question is "does this TLS setting work with the real load balancer?" Naming that boundary keeps the test evidence honest.

Architecture Mistakes

One mistake is hiding nondeterminism inside helpers. If a helper uses real time, random numbers, or a real thread pool, the harness may pass most of the time and fail only when the host runtime happens to create the right race.

Another mistake is recording only service logs. Logs are useful for explanation, but replay needs scheduler decisions, enabled events, dependency effects, and client-visible outcomes.

A third mistake is building fakes that are too polite. A simulated dependency that always responds immediately cannot test timeout, retry, duplicate response, or partial failure behavior. A good fake is deterministic, but it should still support the failures in the model.

The last mistake is letting the harness become a second implementation of the system. The harness should model the environment and check contracts; it should not encode the same business logic as the service in a way that masks bugs or duplicates assumptions.

Practice

Sketch a deterministic harness for a queue worker, replicated cache, or checkout workflow. Name one component for each of these responsibilities:

  1. workload driver
  2. scheduler
  3. simulated network or dependency
  4. clock and timer control
  5. history recorder
  6. oracle

Then mark which production dependencies must be replaced by deterministic adapters and which can remain real in a separate integration test. The result should tell you where the replay boundary actually sits.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: Schedule Control and Interleaving Exploration NEXT Distributed Testing, Simulation, and Deterministic Replay: Fault Injection for Crashes, Partitions, and Message Loss