Partition-Time Guarantees: CAP and PACELC
LESSON
Partition-Time Guarantees: CAP and PACELC
The core idea: CAP names the choice a replicated service must make during a network partition, while PACELC reminds us that even healthy networks still force a latency-versus-consistency trade-off.
Core Insight
Imagine a ticketing service with one remaining seat for a popular show. The service has replicas in Madrid and Virginia so buyers near each region get fast responses. Under normal conditions, the replicas coordinate before promising the seat to anyone. Then the transatlantic link fails while both regions are still receiving checkout traffic.
At that moment, "the database is distributed" stops being an implementation detail and becomes a product decision. If both regions keep accepting purchases independently, the service may sell the same seat twice. If one region refuses or delays checkout until coordination is restored, some real buyers get an error or wait even though a local server is still alive.
CAP is the name for that partition-time pressure. It is not a permanent personality label for a database, and it is not really "pick any two" in the casual sense. In a real distributed system, partitions can happen whether the team likes them or not. The practical question is what the service promises when replicas cannot reliably communicate.
PACELC adds the part CAP leaves out. If there is a partition, the service chooses between availability and consistency. Else, when the network is healthy, it still chooses between lower latency and stronger consistency. That second choice is where many everyday architecture decisions actually live.
Partition as a Product Decision
The ticketing service has an invariant:
one physical seat must map to at most one successful purchase
That invariant is stronger than "replicas should converge eventually." If Madrid and Virginia both accept the final-seat purchase during a partition, reconciliation can notice the conflict later, but it cannot make both customers happy without compensation.
So the team has to decide which behavior is acceptable during the broken link:
Madrid replica partition Virginia replica
buy final seat X---------------X buy final seat
One design chooses consistency over availability for checkout. It may route final-seat writes through a quorum, a single leader, or a lease holder. If the service cannot prove that the write is still authorized, it rejects or delays the operation. Some requests fail, but the seat does not get double-sold.
Another design chooses availability over immediate consistency. Both regions can accept local writes, then reconcile conflicts later. That can be reasonable for a shopping cart, a "like" counter, or a wishlist. It is dangerous for the final purchase unless the business has a clear compensation policy.
The important habit is to attach CAP to a specific operation. A single product can make different choices for different APIs: checkout may stop under partition, while browsing, recommendations, and saved searches keep serving stale or local data.
CAP in Precise Terms
In the CAP framing, the three words are narrower than their everyday meanings:
- Consistency means a strong single-copy story: reads and writes behave as if there were one correct copy of the data.
- Availability means every request to a non-failed node receives a non-error response.
- Partition tolerance means the system has a defined behavior even when messages between nodes are lost or delayed.
The common mistake is treating partition tolerance as a feature toggle. In practice, if the system spans more than one failure domain, the network can delay, drop, or reorder messages. The architecture can ignore that possibility, but the production system still has to live through it.
During a partition, a replica cannot know whether a missing message means:
- the other side accepted a conflicting write
- the other side is slow
- the other side is isolated
- the local side is the isolated one
That uncertainty is why CAP bites. A replica that keeps answering every request may be available, but it cannot also guarantee the same strong single-copy story for every operation. A replica that protects the single-copy story must sometimes refuse to answer.
PACELC: The Normal-Case Trade-Off
CAP is essential, but most days the network is not fully partitioned. Requests still cross regions, leaders still wait for followers, and quorums still add latency. PACELC captures that broader shape:
if Partition: choose Availability or Consistency
Else: choose Latency or Consistency
For the ticketing service, the healthy-network question might be:
- Should checkout wait for cross-region confirmation before returning success?
- Should reads of remaining inventory go to a local replica even if it may be slightly stale?
- Should the app show "almost sold out" from a cached projection while the purchase path uses stricter coordination?
Those choices are not just performance tuning. They define what users are allowed to observe. A low-latency local read may show a seat as available after another region has already sold it. A strongly consistent read may avoid that surprise, but it can cost extra round trips and become more sensitive to slow replicas.
PACELC keeps the design discussion honest because it prevents a team from saying "we are CP" and stopping there. A system can reject writes during partitions and still make many different latency-versus-consistency choices when the network is healthy.
Designing the Guarantee
A useful design review starts with a small guarantee table instead of a slogan:
Operation Partition behavior Healthy-network behavior
-------------------- ----------------------------- ------------------------
final purchase reject if authority is unclear wait for authorized write
inventory display serve stale with clear budget prefer local low-latency read
wishlist update accept locally and reconcile local write, async replicate
refund initiation require durable coordination wait for workflow record
This table does three things.
First, it separates operations by business invariant. Selling the final seat has a different correctness budget from updating a wishlist. Second, it names the user-visible behavior during failure. Third, it records where the team is spending latency to buy stronger consistency.
The weakest acceptable guarantee is usually the best one. Strong consistency is valuable when it protects an invariant the business cannot repair cheaply. Availability and low latency are valuable when users benefit more from progress than from an immediate global truth. The skill is not choosing one word for the whole system; the skill is assigning the right promise to each boundary.
Failure Modes
Mistaking CAP for a database label. A product page saying "CP" or "AP" hides the operation-level decision. Ask what a specific API does when replicas cannot communicate.
Calling every stale read a CAP problem. CAP is about partitions. Stale reads during healthy operation often belong to the PACELC side of the discussion: the team chose latency, locality, caching, or asynchronous replication over a stronger read guarantee.
Assuming reconciliation fixes every conflict. Reconciliation can merge counters, refresh projections, or repair caches. It cannot magically undo a double-booked seat, an irreversible payment, or a violated legal workflow without a compensating process.
Ignoring the client contract. Users do not experience "CAP"; they experience success, failure, waiting, stale state, duplicate confirmations, and later corrections. The architecture should name those outcomes explicitly.
Resources
- [PAPER] Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
- Focus: Use this as the formal anchor for the CAP result and its assumptions.
- [ARTICLE] CAP Twelve Years Later: How the "Rules" Have Changed
- Focus: Notice how Brewer reframes CAP as a nuanced design discussion, not a slogan.
- [ARTICLE] Problems with CAP, and Yahoo's Little Known NoSQL System
- Focus: Read for the PACELC framing and the normal-case latency/consistency trade-off.
- [BOOK] Designing Data-Intensive Applications
- Focus: Review the chapters on replication and consistency models for practical examples.
Key Takeaways
- CAP is about partition-time behavior: when replicas cannot communicate, a strongly consistent operation cannot also remain fully available everywhere.
- Partition tolerance is not the meaningful knob in real distributed systems; the meaningful choice is what each operation does when communication fails.
- PACELC extends the question to normal operation, where stronger consistency usually costs latency even without a partition.
- Good designs assign guarantees per API boundary, not as one vague label for the whole database or product.