Distributed Testing, Simulation, and Deterministic Replay: Fault Injection for Crashes, Partitions, and Message Loss

LESSON

Distributed Testing, Simulation, and Deterministic Replay

007 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Fault Injection for Crashes, Partitions, and Message Loss

Core Insight

Fault injection is useful only when the injected fault has precise semantics. "Kill a node" or "drop some packets" sounds concrete, but a deterministic harness needs to know when the fault happens, what state survives it, which messages are affected, what clients observe, and how recovery proceeds.

In CheckoutService, crashing order-1 is not one fault. It could crash before writing the order, after writing but before replying, after sending a payment capture, or while an idempotency replication message is queued. A partition can isolate clients from one replica, split replicas from each other, or block only replication traffic while payment traffic continues. Message loss can drop a request, a response, a replication update, or a retry acknowledgement. Each version creates a different observable history.

The harness from lesson 6 should treat faults as scheduled events with explicit boundaries. The trade-off is precision versus search space: precise faults make failures explainable and replayable, but every extra fault placement creates more schedules to explore.

Faults as Scheduled Events

In a deterministic simulation, a fault is not an external accident. It is an event the scheduler can choose when the failure model allows it.

Examples:

step 18: crash order-1 after durable commit(order-101)
step 23: partition order-2 from inventory-1
step 31: drop replication(in-flight-key-k1) from order-1 to order-2
step 44: restart order-1 from disk snapshot S7

This event form gives the harness three properties:

Replayability: the fault can happen at the same scheduler step again.
Interpretability: the history says what boundary the fault crossed.
Oracle alignment: the checker knows whether the resulting history is allowed under the failure model.

Random fault injection can still be useful, but randomness should choose from explicit fault events. A seed should select "drop message m42 from order-1 to order-2" rather than "make the network flaky for a while" with no record of which messages were affected.

Crash Injection

A crash fault should specify what stops and what survives. That depends on the failure model from lesson 3.

Important crash boundaries include:

Before a request is accepted.
After a request is accepted but before durable state is written.
After durable state is written but before a response is returned.
After an outbound message is sent but before local state is committed.
During recovery, while replaying a log or rebuilding in-memory state.

For CheckoutService, these two crashes are very different:

crash A:
order-1 receives confirm(order-101, k1)
order-1 crashes before durable write

crash B:
order-1 writes committed order-101
order-1 sends capture(order-101, k1)
order-1 crashes before replying to client-A

In crash A, a retry may legitimately create the first committed order. In crash B, the retry must not create a second external payment capture if the service promises idempotency across crash-recovery. The harness must record which boundary occurred, because the oracle cannot infer it from "node crashed" alone.

Crash injection also needs restart semantics. Does the node restart with its durable log, an old snapshot, a partially applied transaction, or empty volatile state plus durable records? Recovery is part of the fault, not a cleanup step outside the test.

Partition Injection

A partition is a rule about which communication paths are blocked. It is not just a slow network.

A useful partition event names the endpoints and direction:

partition:
- block order-1 -> order-2 replication
- allow order-2 -> order-1 heartbeat
- allow clients -> both order replicas
- allow order replicas -> payment-adapter

That detail matters because partitions are often asymmetric and partial. A service can continue accepting client writes while replication is blocked. A leader can receive client traffic but fail to reach a quorum. A payment adapter can remain reachable even while order replicas disagree.

The harness should model partitions as topology changes in the simulated network. Messages that match a blocked path should become delayed, dropped, or held according to the network model. The decision should be explicit:

Drop means the message is gone.
Delay means the message can be delivered later.
Hold means the message waits until the partition heals.
Reject means the sender receives an error or connection failure.

Those choices produce different client-visible behavior and different replay histories.

Message Loss

Message loss is most useful when it names the message, not just a probability.

Dropping one of these messages can expose different bugs:

client request to order-1
order-1 response to the client
idempotency replication from order-1 to order-2
payment capture request to the payment adapter
payment capture response back to order-1
heartbeat or lease renewal between replicas

The harness should record message identity, sender, receiver, message type, causal parent, and whether the message was dropped, delayed, duplicated, or reordered. That connects message loss to the event ordering from lesson 4 and the schedule decisions from lesson 5.

For example:

m17 order-1 -> order-2: replicate in-flight capture(k1)
fault: drop m17
m18 client-A -> order-2: retry confirm(order-101, k1)

If order-2 double-charges, the failure is no longer mysterious. The key replication message was lost before the retry arrived. The question becomes whether the service contract required correctness under that loss and what mechanism should have prevented the duplicate side effect.

Worked Example

Suppose the team wants to test that CheckoutService does not capture payment twice during a replica crash and partition.

The harness can build this fault script:

failure model:
- crash-recovery for one order replica
- message loss for replication traffic
- clients may timeout and retry with the same idempotency key
- payment adapter is reachable and records captures

script:
1. client-A sends confirm(order-101, k1) to order-1
2. order-1 sends capture(order-101, k1) to payment-adapter
3. payment-adapter accepts capture but response is delayed
4. drop replicate(in-flight k1) from order-1 to order-2
5. crash order-1 before replying to client-A
6. client-A retry timer fires
7. client-A sends confirm(order-101, k1) to order-2
8. order-2 sends capture(order-101, k1) to payment-adapter
9. payment-adapter accepts second capture

The oracle should fail this run if the service promises one external capture per idempotency key. The replay history should include the delayed response, dropped replication message, crash boundary, retry timer, and two payment captures.

This is better than a generic "chaos run" because the failing history names the causal chain. The team can now ask a focused design question: should the idempotency key be coordinated through durable shared state, a quorum write, a payment-side idempotency mechanism, or a compensation path?

Fault Campaign Design

A single fault is rarely enough. A useful campaign chooses a small set of fault placements that match the system's promises.

For a replicated order service, a campaign might cover:

crash before durable write
crash after durable write before response
crash after sending external side effect before response
partition between replicas while clients can still write
drop replication update for idempotency key
delay payment response past client timeout
heal partition and restart crashed replica

Each placement should have a reason. It should stress a promise such as durability, idempotency, quorum behavior, read-your-writes, or external side-effect safety. If a fault does not connect to a promise or oracle, it is noise.

The campaign should also keep successful runs. Passing under crash before durable write and failing after external side effect is strong evidence about the real boundary of the bug.

Common Fault Injection Mistakes

One mistake is making faults too vague. "The network was bad" is not a replayable event. The harness needs to know which path, message, direction, and delivery rule changed.

Another mistake is injecting faults the system does not claim to tolerate and then treating the result as a product bug. If the failure model excludes Byzantine behavior, forged messages belong in a different test layer.

A third mistake is forgetting client observation. A crash before a success response and a crash after a success response create different user-facing promises. The oracle must know what the client saw.

The last mistake is running fault injection without recovery. Many bugs appear after the partition heals or the crashed node restarts. Recovery is where stale state, duplicate effects, and inconsistent leadership often become visible.

Practice

Choose one invariant from the track so far, such as "one payment capture per idempotency key." Write three fault events that could threaten it:

one crash placement
one partition topology
one message loss decision

For each fault, write what the client observes and what the harness must record for replay. If you cannot name those observations, the fault is not yet precise enough for deterministic testing.

Connections

Builds on Deterministic Simulation Harness Architecture by using the fault controller as a scheduled source of replayable events.
Prepares for Network Simulation, Latency, and Backpressure, where partitions, delay, queueing, and delivery rules become richer.
Connects to reliability testing because meaningful fault injection depends on matching failures to explicit system promises and recovery behavior.

Resources

[DOC] Jepsen
[PAPER] Lineage-driven Fault Injection
[PAPER] FoundationDB: A Distributed Unbundled Transactional Key Value Store
[BOOK] Designing Data-Intensive Applications

Key Takeaways

Fault injection should create precise scheduled events, not vague chaos.
Crashes, partitions, and message loss need explicit boundaries so replay and oracles can interpret the run.
Client-visible outcomes decide whether a fault crosses a promised correctness boundary.
Recovery behavior is part of the fault story because many distributed bugs surface after healing or restart.

← Back to Distributed Testing, Simulation, and Deterministic Replay

← Back to Distributed Systems

← Back to Learning Hub