Partition-Time Guarantees: CAP and PACELC

LESSON

Consistency and Replication

001 30 min intermediate

Partition-Time Guarantees: CAP and PACELC

The core idea: CAP names the choice a replicated service must make during a network partition, while PACELC reminds us that even healthy networks still force a latency-versus-consistency trade-off.

Core Insight

Imagine a ticketing service with one remaining seat for a popular show. The service has replicas in Madrid and Virginia so buyers near each region get fast responses. Under normal conditions, the replicas coordinate before promising the seat to anyone. Then the transatlantic link fails while both regions are still receiving checkout traffic.

At that moment, "the database is distributed" stops being an implementation detail and becomes a product decision. If both regions keep accepting purchases independently, the service may sell the same seat twice. If one region refuses or delays checkout until coordination is restored, some real buyers get an error or wait even though a local server is still alive.

CAP is the name for that partition-time pressure. It is not a permanent personality label for a database, and it is not really "pick any two" in the casual sense. In a real distributed system, partitions can happen whether the team likes them or not. The practical question is what the service promises when replicas cannot reliably communicate.

PACELC adds the part CAP leaves out. If there is a partition, the service chooses between availability and consistency. Else, when the network is healthy, it still chooses between lower latency and stronger consistency. That second choice is where many everyday architecture decisions actually live.

Partition as a Product Decision

The ticketing service has an invariant:

one physical seat must map to at most one successful purchase

That invariant is stronger than "replicas should converge eventually." If Madrid and Virginia both accept the final-seat purchase during a partition, reconciliation can notice the conflict later, but it cannot make both customers happy without compensation.

So the team has to decide which behavior is acceptable during the broken link:

Madrid replica        partition        Virginia replica
buy final seat   X---------------X     buy final seat

One design chooses consistency over availability for checkout. It may route final-seat writes through a quorum, a single leader, or a lease holder. If the service cannot prove that the write is still authorized, it rejects or delays the operation. Some requests fail, but the seat does not get double-sold.

Another design chooses availability over immediate consistency. Both regions can accept local writes, then reconcile conflicts later. That can be reasonable for a shopping cart, a "like" counter, or a wishlist. It is dangerous for the final purchase unless the business has a clear compensation policy.

The important habit is to attach CAP to a specific operation. A single product can make different choices for different APIs: checkout may stop under partition, while browsing, recommendations, and saved searches keep serving stale or local data.

CAP in Precise Terms

In the CAP framing, the three words are narrower than their everyday meanings:

The common mistake is treating partition tolerance as a feature toggle. In practice, if the system spans more than one failure domain, the network can delay, drop, or reorder messages. The architecture can ignore that possibility, but the production system still has to live through it.

During a partition, a replica cannot know whether a missing message means:

That uncertainty is why CAP bites. A replica that keeps answering every request may be available, but it cannot also guarantee the same strong single-copy story for every operation. A replica that protects the single-copy story must sometimes refuse to answer.

PACELC: The Normal-Case Trade-Off

CAP is essential, but most days the network is not fully partitioned. Requests still cross regions, leaders still wait for followers, and quorums still add latency. PACELC captures that broader shape:

if Partition: choose Availability or Consistency
Else:         choose Latency or Consistency

For the ticketing service, the healthy-network question might be:

Those choices are not just performance tuning. They define what users are allowed to observe. A low-latency local read may show a seat as available after another region has already sold it. A strongly consistent read may avoid that surprise, but it can cost extra round trips and become more sensitive to slow replicas.

PACELC keeps the design discussion honest because it prevents a team from saying "we are CP" and stopping there. A system can reject writes during partitions and still make many different latency-versus-consistency choices when the network is healthy.

Designing the Guarantee

A useful design review starts with a small guarantee table instead of a slogan:

Operation              Partition behavior             Healthy-network behavior
--------------------   -----------------------------  ------------------------
final purchase         reject if authority is unclear  wait for authorized write
inventory display      serve stale with clear budget   prefer local low-latency read
wishlist update        accept locally and reconcile    local write, async replicate
refund initiation      require durable coordination    wait for workflow record

This table does three things.

First, it separates operations by business invariant. Selling the final seat has a different correctness budget from updating a wishlist. Second, it names the user-visible behavior during failure. Third, it records where the team is spending latency to buy stronger consistency.

The weakest acceptable guarantee is usually the best one. Strong consistency is valuable when it protects an invariant the business cannot repair cheaply. Availability and low latency are valuable when users benefit more from progress than from an immediate global truth. The skill is not choosing one word for the whole system; the skill is assigning the right promise to each boundary.

Failure Modes

Mistaking CAP for a database label. A product page saying "CP" or "AP" hides the operation-level decision. Ask what a specific API does when replicas cannot communicate.

Calling every stale read a CAP problem. CAP is about partitions. Stale reads during healthy operation often belong to the PACELC side of the discussion: the team chose latency, locality, caching, or asynchronous replication over a stronger read guarantee.

Assuming reconciliation fixes every conflict. Reconciliation can merge counters, refresh projections, or repair caches. It cannot magically undo a double-booked seat, an irreversible payment, or a violated legal workflow without a compensating process.

Ignoring the client contract. Users do not experience "CAP"; they experience success, failure, waiting, stale state, duplicate confirmations, and later corrections. The architecture should name those outcomes explicitly.

Resources

Key Takeaways

  1. CAP is about partition-time behavior: when replicas cannot communicate, a strongly consistent operation cannot also remain fully available everywhere.
  2. Partition tolerance is not the meaningful knob in real distributed systems; the meaningful choice is what each operation does when communication fails.
  3. PACELC extends the question to normal operation, where stronger consistency usually costs latency even without a partition.
  4. Good designs assign guarantees per API boundary, not as one vague label for the whole database or product.
NEXT Consistency Contracts and API Semantics