CAP, PACELC, and Partition-Time Behavior
LESSON
CAP, PACELC, and Partition-Time Behavior
Core Insight
At 10:00, a ticketing service has one seat left for a concert. It operates checkout in Madrid and Frankfurt so buyers in both places get a fast response. Both regions know that seat-19 is available. Then the private network link between them fails. The computers in each region are healthy, but each has lost the evidence needed to know what the other region is doing.
At 10:01, Ana buys the seat through Madrid and Leo buys it through Frankfurt. A tempting response is to keep both checkouts working and reconcile later. That works for some kinds of state. It does not work for one scarce seat: after both buyers receive a confirmation, no repair can make both confirmations true. A refund may compensate one buyer, but it does not preserve the promise the service made.
This is the useful pressure behind CAP. When replicas cannot communicate reliably, an operation that needs one official answer must either wait or reject uncertain work, or accept local work and permit disagreement. CAP is not a personality label for an entire company, and it is not an invitation to choose any two letters on a slide. It is a way to make a specific promise explicit while a partition is happening.
PACELC adds a second scene. Even when the link is healthy, waiting for a remote replica takes time. A service can acknowledge a purchase after one local machine records it, or after enough remote machines agree. The first answer is usually faster; the second gives a stronger basis for saying the seat is truly reserved. Good designs explain both scenes: what the operation does when communication breaks, and what it pays for its normal-day guarantee.
Start With The Promise, Not The Acronym
Before applying CAP, name the state and the promise that users rely on. In the ticketing service, the relevant state is not the whole database. It is the availability of one seat for one performance.
state: concert-9 / seat-19
actors: Madrid checkout, Frankfurt checkout, inventory replicas
user promise: a confirmed purchase reserves this seat for one buyer
failure: the regional link cannot carry messages
success: no two buyers receive a valid confirmation for the same seat
The visible pieces have different jobs. A checkout receives a request and explains the result to a buyer. An inventory replica holds a copy of seat availability. A coordination path carries evidence that an accepted purchase is now official. The network link is not merely a performance detail: it is how one region learns that the other region has changed the same fact.
While the link works, the system can arrange for one decision to become authoritative before it confirms the purchase. A leader, a quorum of replicas, or an external reservation owner can play that role. The implementation differs, but the question is the same: what evidence lets this checkout safely say “the seat is yours”?
During a partition, a region cannot obtain new evidence from the other side. It may still have a local copy that says available, but that copy cannot prove that the remote region did not just accept the same seat. The design must decide whether a local copy is sufficient for this operation.
Worked Trace: One Seat, Two Isolated Checkouts
Suppose the healthy system normally requires both regional inventory replicas to record a reservation before a checkout reports success. The exact quorum could be larger in a real deployment; two replicas make the trade-off easy to inspect.
before the partition
Madrid replica: seat-19 = available, version 41
Frankfurt replica: seat-19 = available, version 41
rule: confirm only after Madrid and Frankfurt record version 42
The link fails just before either buyer acts.
Madrid ---- X ---- Frankfurt
Ana Leo
buy seat-19 buy seat-19
Path A: Protect One Official Reservation
Madrid receives Ana's request and creates a proposed reservation. It can write that proposal locally, but the normal rule needs an acknowledgement from Frankfurt before the proposal becomes official.
Madrid:
start state: available at version 41
local action: propose reservation for Ana at version 42
required: Frankfurt acknowledgement
result: no acknowledgement arrives before the deadline
response: purchase pending or unavailable; do not confirm
Frankfurt follows the same rule for Leo. Both checkouts remain alive, but neither can safely issue the final confirmation. The buyer sees an error, a retry instruction, or a short-lived pending state. That is painful, yet it preserves the stronger promise: the system never creates two official holders for the same seat.
This path gives up availability in the CAP sense for that operation. CAP availability is precise: every request to a non-failing node eventually receives a non-error response. “Please try later” and “we cannot reserve that seat while inventory is uncertain” are deliberate non-success outcomes. They may be the right product behavior, but they are not available responses under the theorem's definition.
Path B: Keep Each Checkout Moving
Now change the rule: each region may confirm a reservation from its local copy and replicate it later.
Madrid:
available at version 41 -> confirmed for Ana locally
Frankfurt:
available at version 41 -> confirmed for Leo locally
after the link returns:
Madrid reports Ana's reservation
Frankfurt reports Leo's reservation
inventory capacity = 1, accepted reservations = 2
Both buyers got a fast, successful answer. The system was available to each local request, but it is now divergent: two replicas have made incompatible official claims. A repair job can choose a winner, issue a refund, offer a replacement, and alert support. It cannot undo the fact that the system told two people the same scarce seat was theirs.
The intermediate states matter. Neither region behaved irrationally: each acted on a copy that was valid when the partition began. The missing ingredient was communication, not a bad comparison function. No timestamp or later merge can recreate the knowledge that would have prevented both confirmations.
What CAP Actually Names
The formal vocabulary is easiest to use after seeing the trace.
Consistency in the CAP theorem means that an operation observes a single, current, externally coherent answer—commonly explained with a linearizable read/write model. For seat-19, a confirmed reservation should not coexist with a conflicting confirmation from another region. This is not the same broad use of “consistency” found in every database or ACID discussion.
Availability means that a request arriving at a non-failing node receives a successful response without waiting forever for a failed or unreachable peer. It does not mean that the product is generally usable, that pages load, or that a response contains fresh data. In Path A, the checkout can respond promptly with a rejection; technically, that is still a loss of availability for the protected purchase operation.
Partition tolerance means the system has a defined behavior when messages between groups can be lost, delayed without a known bound, or delivered in only one direction. A deployment spread across machines and networks cannot safely assume that this never happens. The practical choice is therefore not “tolerate partitions or not.” When the partition exists, it is whether this operation prefers a single official result or a successful local response.
The short decision rule is:
If two isolated sides both accept this operation,
can their results be merged without breaking the user promise?
yes -> local progress and later reconciliation may be acceptable
no -> require coordination, an owner, a quorum, or a refusal path
This is why a system can make different choices for different state. A cart can usually merge two independent “add item” operations. An analytics counter can often add counts later. A concert seat, a bank debit, a password change, or a configuration activation may need one authority because two accepted outcomes cannot both be honored.
PACELC: The Price Before The Outage
CAP describes the exceptional moment when communication is unavailable. PACELC asks a useful follow-up: what does the service choose else—on the ordinary day when replicas can communicate?
For the ticketing service, compare two healthy-link paths.
stronger confirmation
buyer -> Madrid -> Frankfurt acknowledgement -> Madrid confirms
\-> inventory has one shared decision
lower-latency confirmation
buyer -> Madrid confirms locally -> Frankfurt receives replication later
\-> buyer waits less, but a recent remote read can lag
The first path waits for distance, network variation, and remote storage before returning success. It spends latency to make a stronger claim. The second path can feel much faster, but the meaning of “confirmed” must be narrower until replication catches up. If the service uses the second path for a scarce seat, it also needs a safe rule for the partition moment; otherwise normal-case speed becomes an unpriced risk.
PACELC is shorthand for this paired review:
P: during a partition, choose between availability and consistency
E: else, when communication is healthy,
choose between lower latency and stronger consistency
It is not a law that every operation must sit permanently at one corner. A product can make checkout strict, product descriptions slightly stale, search results asynchronous, and carts mergeable. What matters is that the response semantics match the state. “Saved,” “reserved,” “published,” and “visible to everyone” are different promises and may deserve different paths.
Failure Modes That Hide The Trade-off
One failure is treating an acknowledgement as stronger than it is. A local replica can say “stored” truthfully while a remote replica has not seen the write. Calling that result “globally confirmed” turns a replication delay into a product lie. The API, user interface, and operational runbook should distinguish local acceptance, quorum confirmation, and asynchronous completion.
Another failure is giving every record the same policy. A team might use a highly available replicated store for a cart and then reuse its default write path for inventory. The storage technology did not decide the product promise. The application still has to decide which operations may diverge, how conflicts are represented, and who repairs them.
Retries create a nearby but different problem. If Ana times out while Madrid is deciding, she may send the purchase again. An idempotency key lets the checkout recognize that the two requests represent one intended purchase. Idempotency prevents a single client from creating duplicates; it does not solve the cross-region question of whether Madrid and Frankfurt may both reserve the last seat. Both protections are needed on a real checkout path.
Useful operational signals make the boundary visible before it becomes a customer incident:
- the rate and duration of inter-region link failures;
- the number of reservations awaiting remote acknowledgement;
- replica lag and the age of the oldest unreplicated reservation;
- conflicting reservation or compensation counts after repair; and
- the fraction of checkouts served in a degraded, no-confirmation mode.
An increasing compensation count is evidence that the system accepted work that could not be reconciled safely. A rising queue of unacknowledged reservations is evidence that the strict path is protecting the promise but may need a better degraded experience.
Design Check
Close this lesson and choose one operation: a seat purchase, a cart addition, a profile edit, an inventory decrement, a password reset, or a notification counter. Without looking back, write:
state being changed:
user-visible promise:
two isolated requests that could conflict:
can both accepted outcomes be merged safely? why?
partition-time response:
normal-day acknowledgement rule:
what a response such as "saved" means:
signal that tells operators the rule is under pressure:
If the two outcomes cannot both be true, identify the authority that decides: one owner, a quorum, a reservation service, or a deliberate refusal. If they can be merged, name the merge rule rather than merely calling the system “eventually consistent.” That answer is the design, not the acronym.
Resources
- [ARTICLE] CAP Twelve Years Later: How the “Rules” Have Changed
- Focus: Eric Brewer's operational clarification of what CAP says during a partition and why application-level choices matter.
- [PAPER] Consistency Tradeoffs in Modern Distributed Database System Design
- Focus: The PACELC model and the normal-case latency versus consistency trade-off.
- [BOOK] Designing Data-Intensive Applications
- Focus: Replication, linearizability, fault tolerance, and how guarantees affect user-visible behavior.
Key Takeaways
- CAP is an operation-level question: during a partition, choose whether a protected promise or a successful local response takes priority.
- A local replica can be healthy and still lack the evidence needed to safely confirm a scarce resource.
- Reconciliation is only safe when the domain has a real merge or compensation story; it cannot make two incompatible confirmations both true.
- PACELC keeps the normal day visible: stronger confirmation often costs latency even before a partition occurs.