Distributed Testing, Simulation, and Deterministic Replay: Fault Injection for Crashes, Partitions, and Message Loss

LESSON

Distributed Testing, Simulation, and Deterministic Replay

007 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Fault Injection for Crashes, Partitions, and Message Loss

Core Insight

Fault injection is useful only when the injected fault has precise semantics. "Kill a node" or "drop some packets" sounds concrete, but a deterministic harness needs to know when the fault happens, what state survives it, which messages are affected, what clients observe, and how recovery proceeds.

In CheckoutService, crashing order-1 is not one fault. It could crash before writing the order, after writing but before replying, after sending a payment capture, or while an idempotency replication message is queued. A partition can isolate clients from one replica, split replicas from each other, or block only replication traffic while payment traffic continues. Message loss can drop a request, a response, a replication update, or a retry acknowledgement. Each version creates a different observable history.

The harness from lesson 6 should treat faults as scheduled events with explicit boundaries. The trade-off is precision versus search space: precise faults make failures explainable and replayable, but every extra fault placement creates more schedules to explore.

Faults as Scheduled Events

In a deterministic simulation, a fault is not an external accident. It is an event the scheduler can choose when the failure model allows it.

Examples:

step 18: crash order-1 after durable commit(order-101)
step 23: partition order-2 from inventory-1
step 31: drop replication(in-flight-key-k1) from order-1 to order-2
step 44: restart order-1 from disk snapshot S7

This event form gives the harness three properties:

Random fault injection can still be useful, but randomness should choose from explicit fault events. A seed should select "drop message m42 from order-1 to order-2" rather than "make the network flaky for a while" with no record of which messages were affected.

Crash Injection

A crash fault should specify what stops and what survives. That depends on the failure model from lesson 3.

Important crash boundaries include:

For CheckoutService, these two crashes are very different:

crash A:
order-1 receives confirm(order-101, k1)
order-1 crashes before durable write

crash B:
order-1 writes committed order-101
order-1 sends capture(order-101, k1)
order-1 crashes before replying to client-A

In crash A, a retry may legitimately create the first committed order. In crash B, the retry must not create a second external payment capture if the service promises idempotency across crash-recovery. The harness must record which boundary occurred, because the oracle cannot infer it from "node crashed" alone.

Crash injection also needs restart semantics. Does the node restart with its durable log, an old snapshot, a partially applied transaction, or empty volatile state plus durable records? Recovery is part of the fault, not a cleanup step outside the test.

Partition Injection

A partition is a rule about which communication paths are blocked. It is not just a slow network.

A useful partition event names the endpoints and direction:

partition:
- block order-1 -> order-2 replication
- allow order-2 -> order-1 heartbeat
- allow clients -> both order replicas
- allow order replicas -> payment-adapter

That detail matters because partitions are often asymmetric and partial. A service can continue accepting client writes while replication is blocked. A leader can receive client traffic but fail to reach a quorum. A payment adapter can remain reachable even while order replicas disagree.

The harness should model partitions as topology changes in the simulated network. Messages that match a blocked path should become delayed, dropped, or held according to the network model. The decision should be explicit:

Those choices produce different client-visible behavior and different replay histories.

Message Loss

Message loss is most useful when it names the message, not just a probability.

Dropping one of these messages can expose different bugs:

The harness should record message identity, sender, receiver, message type, causal parent, and whether the message was dropped, delayed, duplicated, or reordered. That connects message loss to the event ordering from lesson 4 and the schedule decisions from lesson 5.

For example:

m17 order-1 -> order-2: replicate in-flight capture(k1)
fault: drop m17
m18 client-A -> order-2: retry confirm(order-101, k1)

If order-2 double-charges, the failure is no longer mysterious. The key replication message was lost before the retry arrived. The question becomes whether the service contract required correctness under that loss and what mechanism should have prevented the duplicate side effect.

Worked Example

Suppose the team wants to test that CheckoutService does not capture payment twice during a replica crash and partition.

The harness can build this fault script:

failure model:
- crash-recovery for one order replica
- message loss for replication traffic
- clients may timeout and retry with the same idempotency key
- payment adapter is reachable and records captures

script:
1. client-A sends confirm(order-101, k1) to order-1
2. order-1 sends capture(order-101, k1) to payment-adapter
3. payment-adapter accepts capture but response is delayed
4. drop replicate(in-flight k1) from order-1 to order-2
5. crash order-1 before replying to client-A
6. client-A retry timer fires
7. client-A sends confirm(order-101, k1) to order-2
8. order-2 sends capture(order-101, k1) to payment-adapter
9. payment-adapter accepts second capture

The oracle should fail this run if the service promises one external capture per idempotency key. The replay history should include the delayed response, dropped replication message, crash boundary, retry timer, and two payment captures.

This is better than a generic "chaos run" because the failing history names the causal chain. The team can now ask a focused design question: should the idempotency key be coordinated through durable shared state, a quorum write, a payment-side idempotency mechanism, or a compensation path?

Fault Campaign Design

A single fault is rarely enough. A useful campaign chooses a small set of fault placements that match the system's promises.

For a replicated order service, a campaign might cover:

Each placement should have a reason. It should stress a promise such as durability, idempotency, quorum behavior, read-your-writes, or external side-effect safety. If a fault does not connect to a promise or oracle, it is noise.

The campaign should also keep successful runs. Passing under crash before durable write and failing after external side effect is strong evidence about the real boundary of the bug.

Common Fault Injection Mistakes

One mistake is making faults too vague. "The network was bad" is not a replayable event. The harness needs to know which path, message, direction, and delivery rule changed.

Another mistake is injecting faults the system does not claim to tolerate and then treating the result as a product bug. If the failure model excludes Byzantine behavior, forged messages belong in a different test layer.

A third mistake is forgetting client observation. A crash before a success response and a crash after a success response create different user-facing promises. The oracle must know what the client saw.

The last mistake is running fault injection without recovery. Many bugs appear after the partition heals or the crashed node restarts. Recovery is where stale state, duplicate effects, and inconsistent leadership often become visible.

Practice

Choose one invariant from the track so far, such as "one payment capture per idempotency key." Write three fault events that could threaten it:

  1. one crash placement
  2. one partition topology
  3. one message loss decision

For each fault, write what the client observes and what the harness must record for replay. If you cannot name those observations, the fault is not yet precise enough for deterministic testing.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: Deterministic Simulation Harness Architecture NEXT Distributed Testing, Simulation, and Deterministic Replay: Network Simulation, Latency, and Backpressure