Distributed Testing, Simulation, and Deterministic Replay: Workload Modeling and Jepsen-Style Histories

LESSON

Distributed Testing, Simulation, and Deterministic Replay

011 30 min intermediate

Distributed Testing, Simulation, and Deterministic Replay: Workload Modeling and Jepsen-Style Histories

Core Insight

In LedgerService, clients transfer money between accounts while replicas fail, recover, and exchange replication logs. A unit test can check one transfer. A property test can generate many transfers. But a distributed correctness test also needs a workload model: which clients act concurrently, which accounts they touch, which failures happen during the run, and what evidence is recorded when each operation starts and finishes.

A Jepsen-style history is a record of externally visible operations: invocations, completions, values, errors, timeouts, and process identities. The checker does not need to know every internal thread switch to ask whether the observed behavior could match the system's claimed consistency model. It asks: given these calls and these results, was there a legal sequential explanation, a legal transactional explanation, or a violation?

The trade-off is realism versus diagnosability. A workload that looks like production may be too broad to explain. A workload that is too tiny may never create the harmful overlap. Good workload modeling chooses the few dimensions that matter for the claim being tested, then records a history precise enough to check and replay.

Workloads Are Test Design

A workload is not just "send traffic." It is the shape of pressure the test applies to the system.

For a replicated service, workload design usually chooses:

A weak workload can make a broken system look correct. If every transfer touches a different account pair, the test will rarely expose lost updates. If every operation waits for the previous one to finish, the test will not examine concurrency. If every partition happens before or after the workload, the test may never observe the dangerous middle.

A strong workload is narrow on purpose. It puts pressure where the system makes a promise.

claim: transfers are linearizable
risk: concurrent debits and credits during leader failover
workload:
- 5 clients transfer among 3 hot accounts
- reads check total balance
- nemesis partitions the leader from one follower
- retries use the same request id

That workload does not test everything. It tests one claim under one kind of pressure with enough repetition to find races.

What a Jepsen-Style History Records

A history is the evidence produced by the workload. The common shape is a sequence of events, where each client operation has an invocation and a completion.

{:type :invoke, :process 1, :f :transfer, :value {:from A, :to B, :amount 5}}
{:type :invoke, :process 2, :f :transfer, :value {:from A, :to C, :amount 7}}
{:type :ok,     :process 1, :f :transfer, :value {:from A, :to B, :amount 5}}
{:type :fail,   :process 2, :f :transfer, :value :timeout}

The process identifies the logical client. The function :f names the operation. The value carries the arguments or observed result. The type says whether the operation began, completed successfully, failed, returned informational output, or remained unknown.

Unknown results are important. If a client times out after sending a write, the system may have applied the write or may not have applied it. Treating timeout as a clean failure can make the checker lie. Treating timeout as a possible success keeps the ambiguity visible.

A history may also include failure actions:

{:type :info, :process :nemesis, :f :partition, :value {:majority [n1 n2], :minority [n3]}}
{:type :info, :process :nemesis, :f :heal,      :value :all}

Those events are not client operations, but they explain why certain overlaps occurred.

Checking the History

The checker compares the history against a model of the system's promised behavior.

For a linearizable register, the model is simple:

state: current value

write(x): state = x, returns ok
read(): returns state

The checker asks whether concurrent operations can be ordered into a single sequential history that respects real-time ordering. If operation A completes before operation B starts, A must appear before B in the legal explanation. If A and B overlap, either order may be possible.

For a transfer system, the model might be:

state: balances per account

transfer(from, to, amount):
  if balance[from] >= amount:
    balance[from] -= amount
    balance[to] += amount
    returns ok
  else:
    returns fail

read_total():
  returns sum(balances)

If the total balance changes without an allowed external deposit or withdrawal, the history violates the model. If a transfer returns ok but the corresponding debit and credit cannot both be placed in any legal order, the checker has evidence of a consistency bug.

The checker is only as honest as the model. If the system claims eventual convergence, a linearizability checker may reject valid behavior. If the system claims serializable transactions, a simple final-total invariant may miss anomalies. The workload, history format, and checker must match the claim.

Worked Example

Suppose LedgerService claims linearizable transfers between three accounts. The test starts with:

A = 100
B = 100
C = 100
total = 300

The workload uses four clients:

client 1: transfer A -> B amount 10
client 2: transfer A -> C amount 10
client 3: read_total()
client 4: transfer B -> C amount 5

The nemesis can isolate the current leader from one follower while clients continue to send requests.

A failing history might reduce to:

t1  c1 invoke transfer A -> B 10
t2  nemesis partitions n1 from n2,n3
t3  c2 invoke transfer A -> C 10
t4  c1 ok
t5  c2 ok
t6  nemesis heals
t7  c3 invoke read_total
t8  c3 ok 290

The final read says the system lost 10 units. That does not necessarily prove which internal message was dropped, but it proves that the observed behavior cannot match the transfer model. The history is the externally checkable artifact.

A deterministic simulator can improve this further by linking the history to replay:

seed: 72109
workload event: c2 invoke transfer A -> C 10
schedule event: deliver request to stale leader n1
network event: delay replication log entry e44
history event: c2 ok

The history gives the consistency evidence. The replay gives the internal causal path that helps debug it.

Modeling Workload Dimensions

Good workload models choose dimensions deliberately.

Key distribution matters. One hot key stresses coordination; many independent keys stress throughput and routing. A small hot set often exposes conflict handling better than a huge random key space.

Operation mix matters. A read-heavy workload tests visibility and staleness. A write-heavy workload tests conflict and durability. A compare-and-set workload tests atomicity. A queue workload tests ordering and lost or duplicated delivery.

Timing matters. Sequential operations are easy to explain but weak for concurrency. Fully saturated traffic may hide useful structure. Bursts, pauses, and targeted overlap often produce smaller counterexamples.

Failure timing matters most of all. A partition before the workload tests availability under isolation. A partition during an acknowledged write tests uncertainty. A restart during recovery tests durability boundaries. The same failure at a different moment is a different test.

Common Failure Modes

One mistake is modeling the workload after average production traffic. Correctness bugs often need worst-case overlap, not average behavior. Production realism is useful, but it should not dilute the specific consistency claim.

Another mistake is recording only final states. Final state checks can find conservation bugs, but they miss read anomalies, stale successes, failed compare-and-set behavior, and operations whose timeout status matters.

A third mistake is treating timeouts as failures. A timeout means the client does not know the outcome. The checker should preserve that uncertainty unless the protocol provides a stronger guarantee.

A fourth mistake is using a checker that does not match the contract. Linearizability, serializability, eventual convergence, read-your-writes, monotonic reads, and at-least-once delivery are different claims. A history is useful only when the model names the right one.

Practice

Choose one system claim and design a Jepsen-style workload for it:

  1. Name the claim, such as linearizable register, serializable transfer, convergent set, or exactly-once external effect.
  2. Choose three to five operation types.
  3. Choose the key distribution: one key, a hot set, or many keys.
  4. Choose one failure action and when it occurs relative to the workload.
  5. Write the event fields needed in the history.
  6. Name the checker or invariant that will evaluate the history.

Then remove one event field and ask what the checker can no longer know. If removing completion status, process identity, or timeout ambiguity changes the result, that field is not incidental evidence.

Connections

Resources

Key Takeaways

PREVIOUS Distributed Testing, Simulation, and Deterministic Replay: Property-Based Testing for Replicated Protocols NEXT Distributed Testing, Simulation, and Deterministic Replay: Trace Capture, Causality, and Event Logs