Day 034: Memory Hierarchies and Distributed State

The same design pressure appears from CPU caches to CDNs: the data you need most wants to be close, but the total data you own cannot all live in the fastest place.

Today's "Aha!" Moment

One of the most useful ideas in systems is that "memory" is rarely just one thing. A CPU has registers, multiple cache levels, RAM, and slower backing storage. A distributed service has in-process memory, local caches, Redis, databases, object storage, and sometimes a CDN in front of all of that. These stacks look different, but they are solving the same problem: the hottest data needs to be near the computation, while the full dataset needs to be larger, cheaper, and more durable somewhere farther away.

Imagine an API that serves user profiles. The user's name and feature flags might already be in process memory because they were just read. A slightly colder copy may live in Redis. The authoritative row lives in a relational database. Profile images live in object storage and may be cached at the edge. That is not accidental layering. It is a hierarchy built around latency, capacity, cost, and expected access patterns.

Once you see the hierarchy, performance behavior gets easier to predict. Hits are cheap because the requested state is already near the place doing the work. Misses are expensive because the request has to fall through to a slower and often more contended layer. That is why cache design is never just "an optimization." It changes the path a request takes through the system and changes how load propagates when locality breaks down.

The main mental shift is this: treat caches and memory layers as part of the architecture of state, not as optional decorations. Then questions about hot keys, eviction, staleness, refill storms, and downstream saturation stop looking like isolated incidents and start looking like properties of the hierarchy itself.

Why This Matters

The problem: Engineers often learn local memory hierarchies and distributed cache layers as separate topics, which hides the common logic behind locality, miss penalties, and layered state design.

Before:

Caches are treated as one-off tricks rather than as deliberate tiers in a hierarchy.
Hit rate is watched without understanding what misses cost downstream.
Eviction and refill behavior are seen as implementation details instead of system behavior.

After:

Fast and slow storage layers are read as one coordinated hierarchy.
Locality becomes the key lens for understanding performance.
Cache misses, refill paths, and eviction policies are recognized as first-class architectural concerns.

Real-world impact: This helps when designing APIs, deciding what to cache, debugging latency spikes, protecting databases under bursts, and explaining why a tiny hit-rate drop can create a large production incident.

Learning Objectives

By the end of this session, you will be able to:

See one pattern across many layers - Relate CPU caches, memory pages, application caches, Redis, databases, and CDNs through the same hierarchy idea.
Reason from locality and miss cost - Explain why hot data placement matters more than raw storage size in many performance problems.
Evaluate hierarchy behavior under pressure - Understand how eviction, refill, and miss cascades shape latency and system stability.

Core Concepts Explained

Concept 1: Hierarchies Exist Because No Single Layer Wins on Latency, Capacity, and Cost

Return to the profile-serving API. If every request went straight to the durable database and object store, correctness would be simple but latency and load would be painful. If every possible user record lived only in process memory, reads would be fast but capacity, durability, and coordination would be a disaster. The hierarchy exists because neither extreme is viable.

That is the common pattern across local and distributed systems. The fastest layer is small and expensive. The largest and most durable layer is slower and farther away. So the system builds a ladder:

closest / fastest / smallest
-> CPU cache
-> RAM
-> local process cache
-> shared cache
-> database
-> object storage / remote backing layer
farthest / slower / larger / cheaper

Each step down usually buys one or more of these properties: more capacity, lower cost per byte, stronger durability, or easier sharing across machines. Each step up buys lower access latency and less contention. That trade-off is why hierarchies keep reappearing in different forms.

The important consequence is that performance is not just about "how fast is the database?" It is about how often requests can stay in upper layers and avoid paying the miss penalty to lower ones.

Concept 2: Locality Is What Makes Small Fast Layers Surprisingly Powerful

A tiny fast layer only helps if the workload repeatedly touches a small working set. This is the idea of locality. Recently accessed data is often needed again soon, and nearby or related data is often needed together. CPU caches exploit it. Page caches exploit it. Application caches and Redis layers exploit it too.

In our profile API, suppose most traffic comes from a subset of active users. Their names, account settings, and session-adjacent metadata are touched over and over. Even a small cache can therefore absorb a large percentage of reads because the access distribution is not uniform.

That is why "how big is the cache?" is often a weaker question than "what is the access pattern?" A smaller cache with strong locality can outperform a larger cache fed by random reads. Once the working set stops fitting, though, misses rise quickly and the deeper layers start paying the price.

One useful picture is:

request
-> check hot layer
   -> hit: finish quickly
   -> miss: fall through, fetch, maybe promote upward

The trade-off is that upper layers become more effective as locality improves, but also more fragile when workload shape changes. A traffic spike, feature launch, or hot-key shift can suddenly turn a healthy hierarchy into a miss-heavy one.

Concept 3: Misses, Refills, and Eviction Policies Determine the System's Real Behavior

Most production pain shows up not during hits but during misses. A miss is not just a local inconvenience. It means a request has to ask a slower, deeper, often more contended layer for help. If many requests miss together, they do not just become slower, they can overload the authority underneath.

Imagine Redis starts evicting profile entries aggressively during a traffic burst. The immediate symptom may be a lower hit rate. The real effect is wider:

more evictions
-> more cache misses
-> more database reads
-> higher database latency
-> slower refill
-> even more concurrent misses

That is a miss cascade, and it is why eviction policy matters so much. Policies like LRU or LFU are not cosmetic. They are guesses about which objects are most valuable to keep close. If the policy matches the workload, the hierarchy stays efficient. If it matches poorly, the system repeatedly throws away exactly the data it was about to need again.

def read_profile(user_id, cache, database):
    profile = cache.get(user_id)
    if profile is not None:
        return profile

    profile = database.fetch(user_id)
    cache.put(user_id, profile)
    return profile

This code looks simple, but it hides the real system questions: how many callers miss together, how expensive is database.fetch, what happens when the refill path slows down, how stale can the cache be, and what gets evicted to make room for the new object? Those are hierarchy questions, not just code questions.

The trade-off is straightforward. Fast upper layers reduce latency and downstream load, but they introduce coherence, freshness, and eviction decisions that can destabilize the system if left implicit.

Troubleshooting

Issue: A cache is treated as a harmless performance add-on.

Why it happens / is confusing: Caches are often introduced incrementally, so teams underestimate how much they change request paths and failure behavior.

Clarification / Fix: Model both the hit path and the miss path. If the miss path is expensive or can amplify load on an authority, the cache is part of the architecture, not a small add-on.

Issue: Hit rate looks acceptable, so the hierarchy is assumed to be healthy.

Why it happens / is confusing: Hit rate is easy to graph, but it says little about which requests miss, how expensive those misses are, or whether misses cluster during bursts.

Clarification / Fix: Pair hit rate with miss penalty, refill latency, eviction churn, and downstream saturation risk. A modest hit-rate change on an expensive path can matter more than a large hit-rate change on a cheap path.

Advanced Connections

Connection 1: Virtual Memory ↔ Distributed Cache Layers

The parallel: Both systems use a smaller fast layer to front a larger slower one, relying on locality and indirection to make access feel smoother than the underlying storage reality.

Real-world case: A Redis-backed application often behaves like an explicit user-space memory hierarchy, complete with promotion, eviction, and expensive miss paths.

Connection 2: Cache Eviction ↔ Production Stability

The parallel: What leaves the fast layer determines not only latency but also how much load gets pushed into deeper authoritative layers.

Real-world case: A badly tuned cache can turn a traffic burst into a database incident by collapsing locality at exactly the worst moment.

Resources

Optional Deepening Resources

These resources are optional and are not required for the core 30-minute path.
[BOOK] Operating Systems: Three Easy Pieces
- Link: https://pages.cs.wisc.edu/~remzi/OSTEP/
- Focus: Review memory virtualization, caching, and paging from the local-systems side.
[DOC] Redis Key Eviction
- Link: https://redis.io/docs/latest/develop/reference/eviction/
- Focus: See how real cache policies approximate different goals under memory pressure.
[DOC] Amazon CloudFront Developer Guide
- Link: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Introduction.html
- Focus: Extend the same hierarchy intuition outward to global content delivery.

Key Insights

Hierarchies appear because storage trade-offs are unavoidable - The fastest place cannot hold everything, and the biggest place cannot answer everything quickly.
Locality is the real source of cache power - Small fast layers help only because access patterns are uneven and repetitive.
Miss behavior is system behavior - Eviction, refill, and fall-through costs determine whether the hierarchy absorbs load or propagates it downward.

Knowledge Check (Test Questions)

Why do systems repeatedly build storage or memory hierarchies?
- A) Because one layer rarely provides the best latency, capacity, durability, and cost at the same time.
- B) Because duplication always removes the need for design trade-offs.
- C) Because operating systems require every application to use Redis-like caches.
What makes a cache miss important?
- A) It is usually just a small local slowdown with no effect elsewhere.
- B) It forces the request onto a slower path and can increase load on deeper, more expensive layers.
- C) It proves the cache should be removed.
Why does eviction policy matter so much?
- A) It determines which data stays close and which requests are pushed toward slower authorities during pressure.
- B) It is only relevant for CPU hardware, not for distributed systems.
- C) It mainly affects storage accounting, not performance.

Answers

1. A: Hierarchies exist because different storage layers optimize different things, so systems combine them instead of pretending one layer can do everything well.

2. B: A miss means the request pays a slower path and may transfer work to a deeper layer that is often more contended and more costly to access.

3. A: Eviction is the policy that decides what remains hot and what must be refetched, which directly shapes latency and downstream pressure.

← Back to Learning