LESSON
Day 468: Caching Layers and Invalidation Boundaries
The core idea: A cache only improves a production system when every cached copy has an explicit authority, an explicit invalidation trigger, and an explicit freshness budget.
Today's "Aha!" Moment
In 019.md, PayLedger fixed the hard part first: the support query for "unhealthy payouts in the current payroll run" became selective enough to execute directly on PostgreSQL without dragging the whole table through memory. That success created a new pressure during the 15:45 cutoff window. The support dashboard for tenant acme-co polls GET /runs/acme-co/apr-2026-same-day/health every five seconds, the payroll coordinator page asks for the same counts, and the incident bot posts the same summary into Slack. The database query is now correct and reasonably cheap, but the system is still doing the same work over and over.
The tempting move is to say "put Redis in front of it" and declare victory. PayLedger tried exactly that, plus a small in-process cache inside the API. Median latency improved immediately, yet operators started seeing stale "retrying" counts after a payout had already succeeded. One coordinator manually retried a batch because the dashboard still showed 14 unhealthy payouts for almost half a minute. The cache did not break because Redis was slow; it broke because the team had copied data without defining which writes made each copy stale.
That is the important mental shift. Caching is not a generic performance layer. It is controlled duplication of state. The authoritative payout row in PostgreSQL, the shared Redis summary for one payroll run, and the request-local memoized object inside the API are three different copies with three different visibility rules. If you cannot name the invalidation boundary for each copy, you are not tuning latency yet. You are inventing a second consistency model by accident.
Why This Matters
The production consequence is not abstract. PayLedger is used during payroll cutoff, so stale data changes human decisions. A stale support summary can trigger an unnecessary manual replay. A stale payout detail page can convince an operator that a processor callback never arrived. A stale incident bot message can send the on-call engineer toward the wrong root cause. These are not just UX blemishes; they are operational mistakes induced by incorrect visibility.
This is also where the previous lesson matters. Selective indexing from 019.md made the base query narrow enough that caching became optional rather than a rescue mission. That is the right order. Caching an unselective query usually amplifies confusion because the team now has both a slow database path and a stale copy of the same bad result. Once the base access path is sound, the real trade-off becomes explicit: lower repeated-read latency versus the extra coordination required to keep duplicated state honest.
The boundary question is therefore more important than the cache technology. Ask three things before adding any layer: what store is authoritative, what event makes this copy stale, and how stale may the caller safely tolerate? If those answers differ by consumer, then different consumers need different caches or different bypass rules. One policy for every caller is usually a sign that the freshness contract has not been thought through.
Core Walkthrough
Where the copies live
For PayLedger, the authoritative data stays in PostgreSQL because payout status transitions participate in transactional rules with retries, processor response codes, and ledger-side audit rows. On top of that source of truth, the team keeps two cache layers. The first is a request-local cache inside support-api, used only so one HTTP request that renders several widgets does not rerun the same summary query three times. The second is a shared Redis cache for the cross-request summary of a payroll run:
PostgreSQL payout rows -> authority
support-api request cache -> single-request reuse only
Redis run summary cache -> repeated reads across users and bots
Those layers are not interchangeable. The request-local cache never outlives the request, so its invalidation boundary is trivial: it disappears when the request ends. The Redis cache is different because it spans users, workers, and time. That means it needs a durable rule for when a cached summary for one payroll run is no longer safe to serve.
What actually makes a cache entry stale
The summary cache is stale whenever a committed write changes any field that contributes to the "run health" view: status, processor_code, retry counters, or the set of payouts included in the run. The invalidation boundary is therefore not "any write to Redis" or "every 30 seconds." It is "any committed transaction that changes the run-health result for tenant X and run Y."
PayLedger encodes that rule in the write path instead of in ad hoc cache deletes:
BEGIN;
UPDATE payout
SET status = 'succeeded',
updated_at = now()
WHERE payout_id = $1;
UPDATE payroll_run
SET cache_version = cache_version + 1
WHERE tenant_id = $2
AND run_id = $3;
INSERT INTO outbox(topic, aggregate_key, payload)
VALUES (
'payout-status-changed',
$1,
json_build_object(
'tenant_id', $2,
'run_id', $3,
'payout_id', $1
)
);
COMMIT;
Two details matter here. First, the cache boundary is tied to the same transaction that changes authoritative state, so the system never invalidates a cache entry for a write that later rolls back. Second, the outbox event is downstream of commit, which means other services learn about the change only after PostgreSQL has made the new state durable and visible. That preserves a simple contract: invalidate because truth changed, not because a handler hopes truth is about to change.
Why versioned keys beat blind deletes
Many teams implement invalidation by deleting run-health:{tenant}:{run} when a payout changes. That works until readers and writers race. Suppose the key is deleted, a slow request recomputes the summary from an old replica snapshot, and then repopulates the cache after the delete. The cache is now stale again even though invalidation technically happened.
PayLedger avoids that race by versioning the cache key with the transactional cache_version held on payroll_run:
def load_run_health(tenant_id: str, run_id: str) -> RunHealth:
version = load_cache_version(tenant_id, run_id) # commit-visible metadata
key = f"run-health:{tenant_id}:{run_id}:v{version}"
cached = redis.get(key)
if cached is not None:
return decode(cached)
summary = query_authoritative_summary(tenant_id, run_id)
redis.set(key, encode(summary), ex=30)
return summary
With versioned keys, an older reader can only repopulate an older versioned entry. New readers ask for the newer key as soon as cache_version advances. Old entries may linger until TTL cleanup, but they no longer compete with the fresh entry for the same logical view. This is the real job of TTL in the design: it is a garbage-collection and damage-limiting mechanism, not the primary correctness mechanism.
Choosing the right cache boundary for each caller
Once the boundary is explicit, different consumers can make deliberate choices. The support dashboard can read through Redis because a few seconds of staleness is acceptable as long as invalidation is prompt. The "retry this payout now" screen should bypass Redis and fetch fresh row-level state immediately after an operator action because a stale answer could trigger a duplicate manual intervention. The incident bot may read from Redis but annotate its message with the cache timestamp so humans know whether they are looking at a just-in-time view or a lagging summary.
This is where layering becomes a design tool instead of an implementation accident. A cache layer should exist only when its freshness envelope matches a real caller. If two callers cannot tolerate the same staleness, they should not be forced through the same cached copy. The trade-off is operational complexity: more tailored layers reduce unnecessary load, but they also increase the number of boundaries you must instrument, invalidate, and explain during incidents.
Failure Modes and Misconceptions
TTL is the invalidation strategy. This is attractive because it is easy to ship and easy to diagram. It is also incomplete. TTL answers "how long can a stale entry survive if everything else fails?" It does not answer "which write made this value stale?" Use TTL as a backstop, not as the authority for correctness.
Delete the cache before the database commit to keep code simple. That sequence creates a window where readers observe a miss, re-query PostgreSQL, and repopulate the cache with data that the write transaction has not committed yet. The fix is commit-triggered invalidation, usually through an outbox or change-data-capture flow that only emits after durable success.
Any query result can be cached if the key includes all query parameters. Parameterized keys solve naming, not dependency tracking. If the cached object depends on many rows or on a join that changes shape over time, you still need a rule for which committed mutations invalidate the result. When the dependency map is too broad or too unstable, cache a smaller object or move the read model into an explicit derived projection.
One cache should serve every caller of the same data. That assumption hides business semantics. The support dashboard, incident bot, and operator action screen all touch payout health, but they do not have the same freshness budget. Separate consumers by tolerated staleness, not by whether they happen to call the same endpoint today.
Connections
- 019.md is the prerequisite: selective indexes made the underlying query cheap enough that caching could be an optimization instead of a bandage for a bad access path.
- 018.md is the storage-side reminder that every extra copy of data has a maintenance bill. Cache layers avoid some hot reads, but they introduce invalidation work in the same way LSM secondary indexes introduce compaction work.
- 021.md is the next step: once data is copied across cache and storage layers, size and integrity matter, so compression choices and corruption detection become part of the same production story.
Resources
- [BOOK] Designing Data-Intensive Applications
- Focus: Read the sections on derived data and caches as duplicated state, not as magical acceleration.
- [DOC] PostgreSQL Transaction Isolation
- Focus: Review why cache invalidation should be tied to commit-visible state changes rather than optimistic pre-commit deletes.
- [DOC] RFC 9111: HTTP Caching
- Focus: The freshness, validation, and cache-key concepts transfer directly to service-layer caches even when the transport is not HTTP.
- [DOC] Redis EXPIRE
- Focus: Use it to separate expiration mechanics from correctness; expiration limits stale lifetime, but it does not tell you when a value became stale.
Key Takeaways
- A cache layer is a second copy of state, not a free latency toggle. Treat every cached value as data with its own authority and visibility contract.
- Invalidation boundaries should be expressed in terms of committed mutations. "Any committed change that alters this result" is the right mental model; "wait for TTL" is only a fallback.
- Versioned keys solve races that blind delete-based invalidation cannot. They let stale writers repopulate only stale versions instead of poisoning the current key.
- Different callers deserve different freshness budgets. Shared caching works when the consumers agree on staleness; when they do not, separate the layers or provide a fresh-read path.