LESSON
Day 251: Cache Purging Strategies - CDN Cache Invalidation
Purging is not "deleting cache." It is coordinating when thousands of distributed copies must stop being safe to serve.
Today's "Aha!" Moment
The insight: In a CDN, a purge is not a local erase operation. It is a control-plane action that must reach many PoPs, match the right cache keys, and avoid turning a freshness fix into an origin traffic disaster.
Why this matters: Teams often treat invalidation as an afterthought. They assume a deploy changes origin content, the edge will somehow "catch up," and a purge simply forces that to happen faster. In practice, purge strategy determines whether stale content disappears cleanly, whether variants are actually invalidated, and whether the origin survives the refill.
The universal pattern: authoritative content changes -> distributed cached copies may still look valid -> control plane spreads an invalidation signal -> edge nodes stop serving or revalidate those copies.
Concrete anchor: A homepage is cached in 180 edge locations, with variants by language and device. Product changes the hero banner and wants it live immediately. The real problem is no longer "What is the new HTML?" but "How do we make every unsafe copy stop being served without causing a global miss storm back to origin?"
How to recognize when this applies:
- The same response exists in many edge caches at once.
- Content changes before TTL expiry.
- Cache keys include variants such as device, language, cookie buckets, or edge-added headers.
Common misconceptions:
- [INCORRECT] "Purging means the content is physically removed everywhere instantly."
- [INCORRECT] "Purge by URL is enough even when the cache key includes headers or tags."
- [CORRECT] The truth: Purging is a distributed invalidation problem whose difficulty depends on variant space, propagation time, and how aggressively the system refills from origin afterward.
Real-world examples:
- Asset deploys: Teams often prefer versioned URLs so new assets avoid broad purge traffic entirely.
- Editorial updates: HTML, landing pages, product pages, and API responses often need selective purge, soft purge, or tag-based purge because the name stays stable while the content changes.
Why This Matters
The problem: CDN caching makes repeated reads cheap, but freshness corrections are expensive because the copies are geographically distributed and often shared across many users.
Before:
- Stale content persists until TTL expiry.
- Emergency fixes rely on blunt purge-all operations.
- Origin gets overwhelmed when too many objects refill at once.
After:
- Purge scope matches the real content boundary.
- The edge stops serving invalid content with controlled blast radius.
- Origin is protected through revalidation, soft purge, shielding, and targeted refill patterns.
Real-world impact: A good purge strategy shortens incident duration, reduces rollout risk, and keeps the CDN helping even during fast content change. A bad one creates stale leaks, surprising misses, and global traffic spikes at exactly the wrong moment.
Learning Objectives
By the end of this session, you will be able to:
- Explain why purging is a distributed systems problem - Connect CDN invalidation to control-plane propagation, cache keys, and refill behavior.
- Describe the main purge strategies - Compare full purge, URL purge, tag/key purge, and soft purge vs hard purge.
- Evaluate operational trade-offs - Decide how to remove stale content while minimizing origin shock, missed variants, and purge blast radius.
Core Concepts Explained
Concept 1: Purging Is About Scope, Not Just Speed
At a local cache level, invalidation can feel simple:
- mark entry invalid
- next read misses
- refill from authority
At CDN scale, the question becomes much harder:
- which copies should stop being served?
That depends on the effective cache key, not just the URL.
If edge caching varies by:
- path
- query string
- country
- device type
- language
- selected headers
- surrogate tags
then "purge /home" may or may not cover the actual population of cached objects that users can receive.
This makes purge design mostly a scope problem:
- too broad -> you invalidate far more than necessary
- too narrow -> stale variants remain live
That is why mature CDN setups tend to rely on a small set of explicit invalidation handles:
- versioned asset names when possible
- URL purge for narrow object updates
- tag/surrogate-key purge when many related objects must move together
- purge-all only for exceptional cases
The important mindset shift is:
- purge does not fix correctness by force
- purge fixes correctness only if its target model matches the real cache key space
Concept 2: Hard Purge and Soft Purge Optimize for Different Failure Modes
There are two broad behaviors a CDN can take after invalidation.
Hard purge means the object is no longer treated as servable:
- next request becomes a miss
- edge must go back to origin
This is simple and sometimes necessary, but dangerous at scale because many PoPs may refill simultaneously.
Soft purge means the object is marked stale rather than immediately thrown away:
- edge knows the object should no longer be considered fresh
- but can often revalidate or continue serving stale briefly while refresh happens
Soft purge is valuable because it trades "immediate cold miss everywhere" for "controlled transition toward freshness."
That helps with:
- origin protection
- smoother refills
- avoiding thundering herds after global invalidation
So the real trade-off is:
- hard purge optimizes freshness certainty
- soft purge optimizes refill stability
This is also where CDN behavior connects back to HTTP validators and cache semantics:
ETagLast-Modified- conditional requests
- stale-while-revalidate style patterns
The CDN works best when purge and revalidation cooperate. If purge exists alone, the system often replaces stale-risk with origin-risk.
Concept 3: The Hard Part Is What Happens After the Purge
Most teams think about the command:
- purge URL
- purge tag
- purge everything
But the bigger design question is what the system looks like one second later.
After invalidation, several things can go wrong:
- many users request the same object at once
- many PoPs re-fetch in parallel
- origin or shield sees a sudden wave of misses
- cache fragmentation reduces refill efficiency
- some variants refill correctly while others stay stale because the purge key was incomplete
This is why purge is tightly linked to:
- origin shielding
- request collapsing / coalescing
- revalidation instead of unconditional re-fetch
- careful cache-key design
- edge logic that avoids accidental variant explosion
The previous lesson matters directly here. If an edge function rewrites headers or expands the cache key space, purge becomes more complex because the number of distinct copies grows.
The next lesson also follows naturally: CDN optimization is partly about making these refill and propagation behaviors cheaper and more predictable.
So the mature mental model is:
- purge is not the end of the cache lifecycle
- purge is the start of a transition period whose stability matters just as much as the invalidation itself
Troubleshooting
Issue: "We purged the page, but some users still saw stale content."
Why it happens / is confusing: Teams assume one URL maps to one cached object everywhere.
Clarification / Fix: Check the real cache key and variant space. If headers, device class, geo, or language affect the key, some copies may not have matched the purge request.
Issue: "Purge-all fixed freshness, but the origin fell over right after."
Why it happens / is confusing: The purge looks successful from the CDN control plane perspective.
Clarification / Fix: The problem is the refill wave after invalidation. Prefer versioned assets, narrower purge scope, soft purge, shielding, and request coalescing where available.
Issue: "Edge logic and purge strategy seem unrelated."
Why it happens / is confusing: One looks like request computation and the other like cache administration.
Clarification / Fix: Edge logic often changes cache keys and response variants. That directly changes purge scope, blast radius, and the likelihood of stale leftovers.
Advanced Connections
Connection 1: Cache Purging <-> Edge Functions
The parallel: Edge functions can rewrite requests, add variant dimensions, or alter cacheability. That means they indirectly define how difficult it will be to invalidate content later.
Real-world case: A personalized edge rule that varies cache on language and device can silently multiply the number of objects a later purge must cover.
Connection 2: Cache Purging <-> CDN Optimization Techniques
The parallel: Purging is one of the clearest places where control-plane correctness and data-plane efficiency meet. Optimization is not only about hit rate, but about how safely the system transitions after invalidation.
Real-world case: Two CDNs may both support purge-by-tag, but the operational difference appears in propagation behavior, shielding, refill amplification, and stale handling under load.
Resources
Optional Deepening Resources
- [DOCS] Amazon CloudFront Developer Guide: Invalidate files to remove content
- Link: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html
- Focus: Use it to understand invalidation mechanics, trade-offs with versioned file names, and operational limits in a major CDN.
- [DOCS] Cloudflare API: Cache purge
- Link: https://developers.cloudflare.com/api/node/resources/cache/methods/purge/
- Focus: Read it for concrete purge modes such as purge everything and granular purge by URL or custom cache-key context.
- [DOCS] Fastly Documentation: Purging
- Link: https://www.fastly.com/documentation/reference/api/purging/
- Focus: Use it to study soft vs hard purge and surrogate-key-based invalidation in a CDN built around edge cache control.
- [DOCS] MDN Web Docs: HTTP Caching
- Link: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching
- Focus: Revisit validators and freshness semantics so purge, revalidation, and stale handling fit into one mental model.
Key Insights
- Purge scope matters more than the button you press - The real problem is matching the invalidation to the effective cache-key space.
- Hard purge and soft purge protect different things - One prioritizes immediate freshness, the other prioritizes refill stability and origin safety.
- Invalidation is followed by a transition period - The operational risk often comes after the purge, when the edge and origin renegotiate freshness under real traffic.