LESSON
Day 244: Cache Coherence at Scale - NUMA & Directory Protocols
A coherence protocol that works for a few caches can become the bottleneck once the machine grows large enough.
Today's "Aha!" Moment
The insight: MESI explains how copies stay coherent, but it does not by itself explain how that coordination scales when the machine has many cores, sockets, and memory regions. At that point, coherence stops being only a correctness protocol and becomes a topology problem.
Why this matters: Teams often learn coherence from a neat small-system diagram and then miss the practical cost explosion that appears in large NUMA systems. The machine is still coherent, but the price of keeping everyone coherent is no longer uniform.
The universal pattern: more caches and more distance -> more metadata and coordination needed -> topology starts to shape latency and throughput.
Concrete anchor: A shared counter on four nearby cores is one thing. The same counter bouncing between threads on different sockets is another. The value remains coherent in both cases, but the cost of ownership transfer and remote memory access can be radically different.
How to recognize when this applies:
- The machine has multiple sockets or NUMA nodes.
- Memory access latency is not uniform across all cores.
- Shared writable data shows much worse performance once threads spread across sockets.
Common misconceptions:
- [INCORRECT] "If the hardware is coherent, all cores see memory with roughly the same cost."
- [INCORRECT] "NUMA is only about memory capacity, not about coherence or access distance."
- [CORRECT] The truth: Large coherent systems stay correct by paying more coordination cost, and that cost depends heavily on topology and placement.
Real-world examples:
- Threaded servers on big machines: Throughput can drop sharply when hot shared lines bounce between sockets.
- Databases and analytics engines: Local vs remote memory placement changes effective latency even when the logical program is unchanged.
Why This Matters
The problem: Small coherence mechanisms often assume that all participants can cheaply observe or react to the same traffic. That assumption weakens as machines add more cores, more cache slices, more interconnect distance, and multiple memory controllers.
Before:
- Coherence is imagined as a flat system-wide property.
- Memory access is treated as if "RAM is RAM" regardless of socket or node.
- Performance debugging stops at "the machine is coherent, so hardware should handle it."
After:
- You can distinguish correctness from scalability cost.
- NUMA placement becomes part of performance reasoning, not an afterthought.
- Directory protocols and topology-aware design become easier to understand as scaling tools rather than hardware trivia.
Real-world impact: This is why thread pinning, sharding by NUMA node, local allocation, and reduced cross-socket sharing matter in serious systems work. The hardware does keep the illusion of shared memory alive, but not at a constant price.
Learning Objectives
By the end of this session, you will be able to:
- Explain why coherence gets harder at scale - Connect machine growth to coherence traffic, metadata, and interconnect cost.
- Describe the role of NUMA and directory protocols - Understand how large systems avoid treating every coherence event as a global broadcast.
- Evaluate practical consequences - Recognize when locality, placement, and shared-write patterns will dominate performance.
Core Concepts Explained
Concept 1: Coherence Stops Being Cheap When Topology Stops Being Flat
The previous lesson showed why coherence exists at all: multiple private caches can otherwise disagree about shared writable data.
That story is manageable in a small system where a coherence event can be observed quickly by everyone who matters. But when the machine grows, the cost structure changes.
A larger system may have:
- many cores
- several last-level cache slices
- multiple sockets
- multiple memory controllers
- a non-uniform fabric connecting all of the above
The key point is that coherence traffic has to travel through this topology.
That means the cost of keeping a line coherent is no longer just "some protocol message happened." It becomes:
- where the line currently lives
- which cores have copies
- whether ownership transfer is local or cross-socket
- whether the protocol needs to notify many participants
This is why scale changes the problem. The protocol still answers "Who may write?" and "Whose copies are stale?" But the answer now has a spatial cost.
The trade-off is:
- we want the programming model of shared memory
- but as the machine grows, preserving that illusion requires more expensive coordination over a larger fabric
That is the first bridge from coherence theory to NUMA systems.
Concept 2: NUMA Means Memory Access Cost Depends on Where You Stand
NUMA stands for Non-Uniform Memory Access.
The name says exactly what matters: memory is still shared in the logical programming model, but physically it is closer to some cores than to others.
A simplified picture looks like this:
Socket A Socket B
cores + caches cores + caches
| |
local memory A ---- interconnect ---- local memory B
A core on socket A can often access memory attached to socket A faster than memory attached to socket B. The remote access still works, but it pays more latency and consumes more fabric bandwidth.
This changes how we should think about "shared memory":
- logically shared does not mean physically equal-cost
- remote lines may require both coherence traffic and remote memory access
- a badly placed workload can spend large amounts of time crossing node boundaries
This is why locality becomes a systems design issue:
- keep threads near the memory they touch most
- avoid unnecessary cross-node writes
- shard work so each NUMA node mostly works on its own data
A useful mental model is:
single-machine NUMA
behaves a bit like
a tiny distributed system with very fast links
Not because it is literally distributed in the same software sense, but because distance and placement now matter materially.
Concept 3: Directory Protocols Replace "Tell Everyone" with "Tell Whoever Actually Has the Line"
In small systems, coherence can often rely on snooping:
- coherence requests are broadcast on a shared medium
- every cache can observe them
- caches decide locally whether to react
That works only while the system is small enough that broadcasting is still acceptable.
As systems grow, broadcasting every relevant event to everyone becomes too expensive. Most caches do not even care about most lines.
That is where directory protocols enter.
A directory protocol adds metadata saying, in effect:
- which node owns the authoritative clean or dirty version
- which caches or nodes currently hold a copy
Then coherence actions can be targeted more precisely:
- invalidate only the sharers
- request data from the owner
- avoid flooding unrelated participants
Conceptually:
Without directory:
"Whoever has this line, react now."
With directory:
"Directory says nodes A and C have it, so only they need to react."
This improves scalability because it trades some metadata and management overhead for less pointless broadcast traffic.
That is the central systems design move:
- snooping spends bandwidth broadly to keep logic simple
- directories spend metadata and lookup complexity to reduce unnecessary global traffic
The moment you see that, the bigger pattern becomes clear:
- small systems often prefer simpler coordination
- large systems often need more explicit tracking so coordination stays selective
This same pattern will show up again later in distributed caches and large-scale invalidation systems.
Troubleshooting
Issue: "The hardware is coherent, so placement should not matter much."
Why it happens / is confusing: Coherence is mistaken for uniform cost.
Clarification / Fix: Coherence preserves correctness, not equal latency. NUMA systems can keep data coherent while still making remote access and cross-socket ownership transfer much more expensive.
Issue: "NUMA only matters when memory is full."
Why it happens / is confusing: People associate NUMA mainly with capacity planning or large machines.
Clarification / Fix: NUMA matters whenever placement changes access latency or coherence traffic, even when plenty of capacity remains free.
Issue: "Directory protocols are just more complicated snooping."
Why it happens / is confusing: Both solve the same high-level coherence problem.
Clarification / Fix: The real distinction is scaling strategy. Snooping assumes broad observability is affordable. Directories add explicit sharer metadata so coherence messages can stay targeted as the system grows.
Advanced Connections
Connection 1: NUMA & Directory Protocols <-> Distributed Systems Thinking
The parallel: As the machine grows, locality and targeted coordination matter more. This is the same systems lesson we see in distributed architectures: distance and fanout shape cost.
Real-world case: A NUMA box often rewards locality-aware sharding for the same reason a distributed cluster does: broad shared mutation is expensive.
Connection 2: NUMA & Directory Protocols <-> Lock Contention and Performance Profiling
The parallel: Many "lock problems" on big machines are also coherence-placement problems. A hot lock line bouncing between sockets can be expensive even before thread blocking becomes the visible symptom.
Real-world case: Profiling later in the month becomes more truthful when we ask not only "how contended is the lock?" but also "where is that line bouncing?"
Resources
Optional Deepening Resources
- [DOCS] What Every Programmer Should Know About Memory
- Link: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
- Focus: Use the NUMA and cache sections to understand why access cost becomes topology-dependent on larger systems.
- [PAPER] A Primer on Memory Consistency and Cache Coherence
- Link: https://www.morganclaypool.com/doi/pdf/10.2200/S00346ED1V01Y201104CAC016
- Focus: Read the scaling-related coherence chapters to connect small-system MESI intuition with larger-system protocol design.
- [DOCS] Linux kernel NUMA memory policy documentation
- Link: https://docs.kernel.org/admin-guide/mm/numa_memory_policy.html
- Focus: Treat it as the most practical bridge from hardware NUMA theory to software placement decisions.
- [DOCS]
numactlmanual page- Link: https://man7.org/linux/man-pages/man8/numactl.8.html
- Focus: Use it to connect node locality concepts with concrete user-space placement tools.
Key Insights
- Coherence scaling is a topology problem - Once machines stop being flat, the cost of keeping caches coherent depends heavily on where data and cores sit.
- NUMA keeps shared memory but removes uniformity - Memory is still logically shared, but local and remote access no longer cost the same.
- Directory protocols scale by being selective - They trade extra metadata for less global broadcast traffic, which is exactly the kind of trade-off large systems need.