Cache Coherence at Scale - NUMA & Directory Protocols

LESSON

Caching, Workers, and Performance

016 30 min intermediate

Day 244: Cache Coherence at Scale - NUMA & Directory Protocols

A coherence protocol that works for a few caches can become the bottleneck once the machine grows large enough.


Today's "Aha!" Moment

The insight: MESI explains how copies stay coherent, but it does not by itself explain how that coordination scales when the machine has many cores, sockets, and memory regions. At that point, coherence stops being only a correctness protocol and becomes a topology problem.

Why this matters: Teams often learn coherence from a neat small-system diagram and then miss the practical cost explosion that appears in large NUMA systems. The machine is still coherent, but the price of keeping everyone coherent is no longer uniform.

The universal pattern: more caches and more distance -> more metadata and coordination needed -> topology starts to shape latency and throughput.

Concrete anchor: A shared counter on four nearby cores is one thing. The same counter bouncing between threads on different sockets is another. The value remains coherent in both cases, but the cost of ownership transfer and remote memory access can be radically different.

How to recognize when this applies:

Common misconceptions:

Real-world examples:

  1. Threaded servers on big machines: Throughput can drop sharply when hot shared lines bounce between sockets.
  2. Databases and analytics engines: Local vs remote memory placement changes effective latency even when the logical program is unchanged.

Why This Matters

The problem: Small coherence mechanisms often assume that all participants can cheaply observe or react to the same traffic. That assumption weakens as machines add more cores, more cache slices, more interconnect distance, and multiple memory controllers.

Before:

After:

Real-world impact: This is why thread pinning, sharding by NUMA node, local allocation, and reduced cross-socket sharing matter in serious systems work. The hardware does keep the illusion of shared memory alive, but not at a constant price.


Learning Objectives

By the end of this session, you will be able to:

  1. Explain why coherence gets harder at scale - Connect machine growth to coherence traffic, metadata, and interconnect cost.
  2. Describe the role of NUMA and directory protocols - Understand how large systems avoid treating every coherence event as a global broadcast.
  3. Evaluate practical consequences - Recognize when locality, placement, and shared-write patterns will dominate performance.

Core Concepts Explained

Concept 1: Coherence Stops Being Cheap When Topology Stops Being Flat

The previous lesson showed why coherence exists at all: multiple private caches can otherwise disagree about shared writable data.

That story is manageable in a small system where a coherence event can be observed quickly by everyone who matters. But when the machine grows, the cost structure changes.

A larger system may have:

The key point is that coherence traffic has to travel through this topology.

That means the cost of keeping a line coherent is no longer just "some protocol message happened." It becomes:

This is why scale changes the problem. The protocol still answers "Who may write?" and "Whose copies are stale?" But the answer now has a spatial cost.

The trade-off is:

That is the first bridge from coherence theory to NUMA systems.

Concept 2: NUMA Means Memory Access Cost Depends on Where You Stand

NUMA stands for Non-Uniform Memory Access.

The name says exactly what matters: memory is still shared in the logical programming model, but physically it is closer to some cores than to others.

A simplified picture looks like this:

Socket A                 Socket B
 cores + caches          cores + caches
    |                        |
 local memory A ---- interconnect ---- local memory B

A core on socket A can often access memory attached to socket A faster than memory attached to socket B. The remote access still works, but it pays more latency and consumes more fabric bandwidth.

This changes how we should think about "shared memory":

This is why locality becomes a systems design issue:

A useful mental model is:

single-machine NUMA
behaves a bit like
a tiny distributed system with very fast links

Not because it is literally distributed in the same software sense, but because distance and placement now matter materially.

Concept 3: Directory Protocols Replace "Tell Everyone" with "Tell Whoever Actually Has the Line"

In small systems, coherence can often rely on snooping:

That works only while the system is small enough that broadcasting is still acceptable.

As systems grow, broadcasting every relevant event to everyone becomes too expensive. Most caches do not even care about most lines.

That is where directory protocols enter.

A directory protocol adds metadata saying, in effect:

Then coherence actions can be targeted more precisely:

Conceptually:

Without directory:
  "Whoever has this line, react now."

With directory:
  "Directory says nodes A and C have it, so only they need to react."

This improves scalability because it trades some metadata and management overhead for less pointless broadcast traffic.

That is the central systems design move:

The moment you see that, the bigger pattern becomes clear:

This same pattern will show up again later in distributed caches and large-scale invalidation systems.


Troubleshooting

Issue: "The hardware is coherent, so placement should not matter much."

Why it happens / is confusing: Coherence is mistaken for uniform cost.

Clarification / Fix: Coherence preserves correctness, not equal latency. NUMA systems can keep data coherent while still making remote access and cross-socket ownership transfer much more expensive.

Issue: "NUMA only matters when memory is full."

Why it happens / is confusing: People associate NUMA mainly with capacity planning or large machines.

Clarification / Fix: NUMA matters whenever placement changes access latency or coherence traffic, even when plenty of capacity remains free.

Issue: "Directory protocols are just more complicated snooping."

Why it happens / is confusing: Both solve the same high-level coherence problem.

Clarification / Fix: The real distinction is scaling strategy. Snooping assumes broad observability is affordable. Directories add explicit sharer metadata so coherence messages can stay targeted as the system grows.


Advanced Connections

Connection 1: NUMA & Directory Protocols <-> Distributed Systems Thinking

The parallel: As the machine grows, locality and targeted coordination matter more. This is the same systems lesson we see in distributed architectures: distance and fanout shape cost.

Real-world case: A NUMA box often rewards locality-aware sharding for the same reason a distributed cluster does: broad shared mutation is expensive.

Connection 2: NUMA & Directory Protocols <-> Lock Contention and Performance Profiling

The parallel: Many "lock problems" on big machines are also coherence-placement problems. A hot lock line bouncing between sockets can be expensive even before thread blocking becomes the visible symptom.

Real-world case: Profiling later in the month becomes more truthful when we ask not only "how contended is the lock?" but also "where is that line bouncing?"


Resources

Optional Deepening Resources


Key Insights

  1. Coherence scaling is a topology problem - Once machines stop being flat, the cost of keeping caches coherent depends heavily on where data and cores sit.
  2. NUMA keeps shared memory but removes uniformity - Memory is still logically shared, but local and remote access no longer cost the same.
  3. Directory protocols scale by being selective - They trade extra metadata for less global broadcast traffic, which is exactly the kind of trade-off large systems need.

PREVIOUS MESI Protocol & Cache Coherence NEXT Redis Internals & Data Structures - Distributed Caching Foundation

← Back to Caching, Workers, and Performance

← Back to Learning Hub