Distributed Testing, Simulation, and Deterministic Replay: Deterministic Simulation Harness Architecture
LESSON
Distributed Testing, Simulation, and Deterministic Replay: Deterministic Simulation Harness Architecture
Core Insight
A deterministic simulation harness is not just a test runner with random failures. It is a small controlled world where time, network delivery, dependency responses, faults, and client workloads all pass through explicit interfaces the scheduler can choose and replay.
In CheckoutService, a duplicate payment bug might require a client timeout, a delayed idempotency replication message, a retry routed to another replica, and a payment dependency that accepts both calls. If the harness controls only the client workload but leaves timers, network, and payment calls to the host machine, the failing run may never replay. The architecture must put every relevant source of nondeterminism behind a deterministic boundary.
The non-obvious design pressure is that realism and control compete. A harness that simulates everything can be reproducible but may drift away from production behavior. A harness that uses real clocks, sockets, threads, and dependencies may be realistic but hard to replay. The trade-off is choosing which boundaries must be simulated, which can be adapted, and which are safe to leave outside the replay contract.
Harness Components
A useful deterministic harness has a few distinct parts. They may live in one process or several processes, but the responsibilities should stay separate:
- Workload driver: issues client operations, retries, cancellations, reads, and background load.
- Scheduler: decides which enabled event happens next.
- Simulated runtime: hosts the service nodes and exposes controlled time, network, and task execution.
- Network model: delivers, drops, duplicates, delays, reorders, or partitions messages.
- Clock and timer model: advances logical or simulated time and fires timers deterministically.
- Dependency adapters: model storage, queues, payment services, lock services, or external APIs.
- Fault controller: injects crashes, restarts, partitions, pauses, and recovery events.
- History recorder: records scheduler decisions, causal events, client-visible outcomes, and dependency effects.
- Oracle runner: checks the recorded history against invariants and allowed outcomes.
- Replay runner: reruns a seed and decision trace to reproduce a failure.
The shape is easier to see as a data path:
workload driver
|
v
scheduler -----> simulated clock/timers
| simulated network
| fault controller
v dependency adapters
simulated service nodes
|
v
history recorder -----> oracle runner
|
v
replay seed + decision trace
The important feature is not the diagram itself. It is the direction of control. The service under test should ask the harness for time, I/O, dependency responses, and delivery progress. If the service can secretly use host time, unmanaged threads, real sockets, or uncontrolled randomness, the harness has a hole in its replay boundary.
Controlling Nondeterminism
Every deterministic harness starts by finding the places where the real runtime would make a choice without telling the test.
Common sources include:
| Source | Harness Boundary |
|---|---|
| Host clock | Simulated clock API |
| Timers and sleeps | Deterministic timer queue |
| Network delivery | Simulated transport |
| Random IDs and random choices | Seeded randomness service |
| External storage | Simulated or instrumented storage adapter |
| External APIs | Dependency fake or scripted adapter |
| Thread scheduling | Cooperative tasks or controlled executor |
| Process crash and restart | Snapshot, durable state, and restart hooks |
The harness does not need to make the production service beautiful. It needs to expose enough seams for testing. For example, a payment client can keep its production interface while the test binds it to a deterministic adapter. The order service can still think it is sending a request; the harness decides when the response becomes deliverable and records that decision.
This is why deterministic simulation often pressures architecture. Services that directly call time.now(), create unmanaged background threads, hide retries inside opaque clients, or talk to real dependencies during simulation are difficult to replay. The harness cannot schedule what it cannot see.
Execution Loop
At runtime, the harness repeatedly asks what events are enabled, chooses one, applies it, records it, and checks whether the run should continue.
seed = 481516
state = initial_cluster()
while not run_finished(state):
enabled = collect_enabled_events(state)
decision = scheduler.choose(seed, enabled, history)
apply(decision, state)
history.record(decision, visible_effects(state))
oracle.maybe_check(history)
The enabled set is the bridge between architecture and scheduling. If the simulated network has three messages waiting, the timer queue has one retry ready, and the fault controller has a crash event available, all of those should appear as scheduler choices. The scheduler can then explore different interleavings as described in lesson 5.
The history should record more than the chosen event name. A useful decision record includes:
- the seed and scheduler strategy
- the enabled events at the decision point
- the selected event
- logical time or scheduler step
- affected node, client, message, timer, or dependency call
- visible output produced by applying the event
- durable state boundary when recovery behavior matters
That record lets the team replay the run, explain the failure, and later shrink the schedule.
Worked Example
For CheckoutService, the harness might define these simulated components:
nodes:
- order-1, order-2
- inventory-1
- payment-adapter
controlled boundaries:
- client workload issues confirm(order_id, idempotency_key)
- simulated network routes replication and dependency messages
- simulated clock fires retry timers
- durable store adapter records committed order state
- payment adapter records accepted captures
- oracle checks no duplicate capture per idempotency key
A failing run can then be represented as architecture-level decisions:
step 1 workload: client-A confirm(order-101, k1) -> order-1
step 2 order-1: send capture(order-101, k1) -> payment-adapter
step 3 timer: client-A retry timer becomes ready
step 4 scheduler: fire retry timer before payment response delivery
step 5 workload: retry confirm(order-101, k1) -> order-2
step 6 network: delay idempotency replication from order-1 to order-2
step 7 order-2: send capture(order-101, k1) -> payment-adapter
step 8 payment-adapter: accept both captures
step 9 oracle: fail duplicate capture invariant
The architecture makes the failure reproducible because the timeout, retry, network delay, dependency response, and oracle evidence all pass through harness-owned components. The run is not a collection of lucky sleeps. It is a replayable schedule inside a controlled model.
Fidelity and Control
The hardest architecture decision is how much to simulate. Pure simulation gives strong replay control, but it may omit production details such as kernel buffering, real storage latency, connection pooling, or thread contention. Black-box integration testing includes those details, but it can leave too much nondeterminism outside the harness.
Many teams use a layered approach:
- deterministic simulation for protocol logic, interleavings, fault models, and invariant checking
- integration tests for real clients, storage drivers, configuration, and deployment packaging
- production trace replay for workload shape and operational evidence
No layer replaces the others. The deterministic harness is strongest when the question is "can this legal interleaving violate our contract?" It is weaker when the question is "does this TLS setting work with the real load balancer?" Naming that boundary keeps the test evidence honest.
Architecture Mistakes
One mistake is hiding nondeterminism inside helpers. If a helper uses real time, random numbers, or a real thread pool, the harness may pass most of the time and fail only when the host runtime happens to create the right race.
Another mistake is recording only service logs. Logs are useful for explanation, but replay needs scheduler decisions, enabled events, dependency effects, and client-visible outcomes.
A third mistake is building fakes that are too polite. A simulated dependency that always responds immediately cannot test timeout, retry, duplicate response, or partial failure behavior. A good fake is deterministic, but it should still support the failures in the model.
The last mistake is letting the harness become a second implementation of the system. The harness should model the environment and check contracts; it should not encode the same business logic as the service in a way that masks bugs or duplicates assumptions.
Practice
Sketch a deterministic harness for a queue worker, replicated cache, or checkout workflow. Name one component for each of these responsibilities:
- workload driver
- scheduler
- simulated network or dependency
- clock and timer control
- history recorder
- oracle
Then mark which production dependencies must be replaced by deterministic adapters and which can remain real in a separate integration test. The result should tell you where the replay boundary actually sits.
Connections
- Builds on Schedule Control and Interleaving Exploration by showing the harness components that make scheduler decisions possible.
- Prepares for Fault Injection for Crashes, Partitions, and Message Loss, where the fault controller becomes a first-class source of scheduled events.
- Connects to distributed systems implementation because controllable time, network, storage, and dependency boundaries shape how testable the system becomes.
Resources
- [PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store
- [PAPER] Lineage-driven Fault Injection
- [PAPER] Finding and Reproducing Heisenbugs in Concurrent Programs
- [DOC] Jepsen
Key Takeaways
- A deterministic harness is an architecture for controlling time, network, dependencies, faults, workload, history, and oracles.
- The replay boundary must include every source of nondeterminism that can change the observable history.
- Pure simulation improves replay control but must be balanced against fidelity to production behavior.
- Good harness architecture records both the selected schedule and the enabled choices it did not take.
← Back to Distributed Testing, Simulation, and Deterministic Replay