Failure Testing Replication Claims

LESSON

Consistency and Replication

021 30 min advanced

Failure Testing Replication Claims

The core idea: A Jepsen-style test is not "chaos plus load." It is a controlled adversarial experiment that combines a concurrent workload, deliberate faults, and a history checker so a team can prove whether the database's consistency claims still hold when the system is under stress.

Core Insight

Harbor Point can now watch the bond_reservations cluster as a sequence of engine states: WAL sync, archive continuity, replay progress, and restore readiness. That observability is necessary, but it still leaves one uncomfortable production question unanswered. If the API returns 201 Created for reservation R-88421 during a failover, can the team prove that the reservation exists exactly once afterward, or are they merely hoping the replication layer behaved?

A Jepsen-style test turns that question into an experiment. Harbor Point runs many concurrent reservation, cancellation, and read operations with unique request tokens, injects faults such as partitions, process kills, and clock jumps, and then reconstructs the full client-visible history. Instead of asking only "did the cluster stay up?", the team asks "could these observed results have happened in a system that really satisfies linearizable reads, at-most-once token application, and durable quorum commits?"

That is the mental shift. Failure testing at this level is about histories, not dashboards. A clean failover demo can still hide lost acknowledged writes, stale reads that violate a "strong read" promise, or duplicate application of the same client token. The database may look healthy at the node level and still be wrong at the contract level. Jepsen-style validation matters because it forces Harbor Point to define the contract precisely enough that a checker can reject impossible histories.

Why Ordinary Drills Are Not Enough

Harbor Point is preparing the reservation system for a wider rollout in which traders in Madrid and New York will hit the same logical cluster, replicas will serve some reads, and automated failover will move authority when a node or link fails. The business risk is not only downtime. A more dangerous outcome is silent divergence: one trader sees a reservation confirmed, risk accounting misses it for thirty seconds, or the same request token gets applied twice after a timeout and retry. Those are correctness failures, not performance regressions.

Traditional staging tests rarely catch that class of bug. A load test shows throughput. A failover drill shows whether the service comes back. A chaos experiment might show that alerts fire. None of those automatically prove that the observed sequence of successes, failures, timeouts, and reads is compatible with the guarantees the database advertises. If Harbor Point claims linearizable confirmations for reservations, then the test harness has to verify linearizability. If the team claims idempotent retries, the checker has to prove that one client token never became two committed reservations.

Once the team starts working this way, architectural discussions get sharper. "We think leader leases are safe" becomes "under a 400 ms clock jump and a network partition, does the old leader still acknowledge writes that later vanish?" "Retries are fine" becomes "can we detect whether a timeout hid a committed write before the client replays the token?" The value of Jepsen-style validation is not that it breaks systems for entertainment. It converts vague confidence into falsifiable evidence before the next release or topology change reaches production.

Start From the Claim, Not the Fault

Harbor Point does not need a generic "database torture test." It needs evidence about a few concrete promises. For the reservation API, the high-value claims are specific: if POST /reservations succeeds, that reservation should survive leader failover; a request token retried after a timeout should apply at most once; and a strong read issued after success should never return a state older than the acknowledged write. Those are not infrastructure metrics. They are observable contracts between the storage system and the business.

That means the first design step in a Jepsen-style test is choosing operations and invariants that match the contract. Harbor Point's workload might include reserve, release, get_reservation, and issuer_exposure. The checker then asks questions such as: did any acknowledged reservation disappear without a matching release, did exposure ever exceed the configured issuer limit, and did one client token produce two distinct reservation rows? If the system claims linearizable semantics for the strong-read path, the history must be explainable as if each operation happened at one single instant between invocation and completion.

This is where many teams go wrong. They jump directly to partitions and node kills because those are easy to imagine, but they have not said what failure means. A partition is only interesting relative to a claim. If Harbor Point promises only eventual convergence for an analytics view, then stale reads during a partition may be acceptable. If the trading confirmation path promises linearizable success, the same stale read is a release blocker. The workload and checker therefore come before the nemesis.

The trade-off is scoping. A narrow checker is easier to trust and maintain, but it may miss classes of anomalies outside the modeled operations. A checker that tries to prove every property of the entire SQL surface is usually too ambitious to finish. Good Jepsen-style work chooses the smallest set of operations that still covers the business-critical invariants.

The Harness Is a Pipeline

Once Harbor Point knows what it wants to prove, it can build the experiment. Many clients issue reservation operations concurrently, each with a unique token and enough metadata to reconstruct what the client thought happened. A separate nemesis component perturbs the cluster: it severs links between nodes, pauses a leader process, kills and restarts a replica, adds clock skew to nodes that rely on leases, or injects packet delay during elections. The goal is not random destruction. The goal is to expose the exact timing windows where a bad implementation lies about commit status or serves reads from the wrong authority.

The experiment only works if it preserves the client-visible history. Every operation should be recorded as an invoke, then later as ok, fail, or info if the client timed out and could not tell what happened. That distinction matters. A timeout is not proof of failure; it is proof of uncertainty. Harbor Point needs request tokens precisely so the checker can look back later and determine whether a timed-out reserve actually committed before the connection broke.

For this lesson's recurring scenario, the flow looks like this:

clients issue reserve/release/read operations
                |
                v
        Harbor Point cluster
                ^
                |
nemesis injects partitions, pauses, kills, clock skew
                |
                v
     event history + follow-up observations
                |
                v
 checker asks whether the history fits the claimed model

That pipeline explains why Jepsen-style validation is more demanding than ordinary chaos engineering. Chaos tooling often stops at "the service degraded and then recovered." A correctness harness keeps going until it can classify the outcome. Did the old leader acknowledge an uncommitted reservation during a lease violation? Did a read observe a state impossible under a single global order? Did the retry path create a duplicate because the system could not reconcile a timeout with the durable record? The trade-off is complexity: the harness needs good client instrumentation, realistic faults, and a checker the team understands well enough to trust.

Turn Anomalies Into Design Decisions

Suppose Harbor Point runs a test where three nodes host the reservation range. The nemesis partitions the current leader away from the other two nodes, then adds a clock jump on the isolated leader. During the split, trader clients keep sending reserve requests. Some receive success from the isolated leader. Others retry through a different node once failover completes. When the partition heals, the checker finds an impossible history: reservation R-88421 was acknowledged to a client, but no later linearizable read can find it, while a second request with the same token produced a different row on the new leader.

That result is valuable because it points to a concrete engineering decision. Harbor Point might need stronger fencing on old leaders, a safer lease implementation, or a retry protocol that first resolves token status before reissuing work. The bug is no longer "failover seems flaky." It is "under partition plus clock skew, the cluster can acknowledge a reservation that did not survive quorum and can later apply the same token twice." That is specific enough to drive a fix, a regression test, and a go/no-go call for deployment.

The same logic prevents overreaction. Not every anomaly is a database bug. Sometimes the checker reveals that the product contract was vague. If the API returns a timeout without a durable request identifier, then neither the database nor the client can distinguish "did not commit" from "committed but reply lost." Jepsen-style testing often forces improvements above the engine layer: clearer idempotency keys, explicit strong-read endpoints, or better observability around uncertain outcomes.

This is also the bridge into the design-review lessons that follow. A replication plan is not ready because the diagram looks plausible. It is ready when the team can name its core invariants, build adversarial tests for them, and use the outcomes to accept or reject the architecture.

Operational Failure Modes

The test run shows lots of timeouts, but the checker does not report any anomalies. A timeout-heavy run can still hide correctness problems if the harness cannot resolve what happened after uncertainty. Without request tokens, read-after-timeout probes, or enough state captured in the history, the checker may only learn that clients were confused. Make uncertain outcomes first-class data. Persist client tokens, record invocation and completion times, and add follow-up reads that can determine whether timed-out operations committed.

The harness finds anomalies that the database team dismisses as "not production-realistic." Some fault combinations are unrealistic, but many impossible traces reveal hidden assumptions about clocks, leases, or retry semantics. Teams often reject the test before mapping it back to the claim they made. Compare the nemesis to real operational risks. If Harbor Point depends on time-based leader leases, clock skew belongs in scope. If the system advertises automatic failover, partitions and delayed packets during elections are in scope. Reject only faults that are truly outside the design envelope.

A test passes once, and the team treats the result as permanent proof. Concurrency bugs are statistical and configuration-sensitive. A passing run may simply have missed the dangerous interleaving, and a minor version upgrade or timeout change can reopen the same class of bug. Run the suite repeatedly, vary the schedule and load, and keep it in the release process for topology or configuration changes. Jepsen-style validation is a regression asset, not a one-time ceremony.

Connections

Resources

Key Takeaways

  1. A Jepsen-style test starts with a claim, not with random breakage; the workload and checker must match the promise the service makes.
  2. Client-visible history is the ground truth for correctness because partitions, crashes, and timeouts matter only when they produce histories the claimed model should forbid.
  3. Useful anomalies change engineering decisions: they lead to tighter lease rules, safer retry contracts, better fencing, or a rejected release.
PREVIOUS Observability for Replicated Data Systems NEXT Guarantee Matrix Design Review