Memory Allocators in Production - jemalloc, Arenas & Fragmentation

LESSON

Caching, Workers, and Performance

018 30 min intermediate

Day 246: Memory Allocators in Production - jemalloc, Arenas & Fragmentation

A memory allocator is not a hidden runtime detail; it is a policy engine for latency, memory footprint, and contention.


Today's "Aha!" Moment

The insight: Once a system keeps large working sets in RAM, the allocator starts shaping production behavior almost as much as the application data structures do. It decides how requests become size classes, where freed memory waits, how much lock contention appears, and when pages go back to the OS.

Why this matters: Teams often profile serialization, indexing, or network paths while ignoring the allocator underneath them. Then they get surprised by RSS growth, fragmentation, tail latency spikes, or throughput collapse under multi-threaded load.

The universal pattern: allocation policy -> reuse behavior -> fragmentation/contention -> observable production cost.

Concrete anchor: A Redis-like service may store mostly small objects, experience bursty writes, and later show RSS far above the logical dataset size. That gap is often about allocator behavior, not only about "too much data."

How to recognize when this applies:

Common misconceptions:

Real-world examples:

  1. Cache servers: Redis and Memcached-style workloads care deeply about allocator fragmentation because RAM efficiency directly affects capacity and cost.
  2. Application servers: Highly concurrent services can shift from useful work to allocator contention if fast paths and locality are poor.

Why This Matters

The problem: In memory-heavy systems, the allocator becomes part of the architecture. If it rounds badly, shards poorly, caches too aggressively, or returns memory too slowly, the system can look slower, fatter, or more unstable even when the application logic is correct.

Before:

After:

Real-world impact: Better allocator choices can reduce infra cost, lower latency variance, and make performance investigations much more honest because "memory" is no longer treated as one undifferentiated blob.


Learning Objectives

By the end of this session, you will be able to:

  1. Explain why allocator choice matters in production - Connect allocator policy to RSS, fragmentation, latency, and contention.
  2. Describe how modern allocators work internally - Trace size classes, arenas, thread caches, and purging back to their operational effects.
  3. Compare allocator trade-offs - Decide when jemalloc, glibc malloc, or tcmalloc is a better fit for a workload.

Core Concepts Explained

Concept 1: Allocators Matter Because "Bytes in Use" and "Memory Cost" Are Not the Same Thing

Imagine a cache service that stores 40 GB of useful data but the process RSS sits near 60 GB after a bursty write period. The dataset did not suddenly become 50% larger. The missing explanation often lives in fragmentation, cached free objects, page reuse policy, and when memory is returned to the kernel.

The allocator sits between application objects and OS pages. Every allocation request is rounded, routed, reused, and eventually released according to allocator policy. That means the allocator helps answer questions like:

This is why the allocator becomes visible in production systems that are:

Examples include Redis, Memcached-style caches, proxies, stream processors, and services that frequently build and discard many small objects.

The central trade-off is straightforward:

So allocator behavior is never "free." Low latency, low contention, and low fragmentation pull in related but not identical directions.

Concept 2: Modern Allocators Buy Speed with Size Classes, Arenas, and Caches

Modern general-purpose allocators do not manage every allocation independently. They group work so the common case is cheap.

An intentionally simplified mental model looks like this:

allocation request
    ->
size class chosen
    ->
thread or CPU-local fast cache
    ->
arena / central allocator state
    ->
OS pages

Three ideas matter most:

1. Size classes

Small allocations are rounded into buckets such as 16 B, 32 B, 64 B, and so on. That makes reuse fast and bookkeeping simpler, but creates internal fragmentation because an object rarely fits the bucket perfectly.

2. Arenas

Instead of one global heap lock, allocators often maintain multiple arenas so different threads can allocate concurrently with less lock contention. This helps scaling, but it can also spread memory across semi-independent heaps, which makes reuse less globally efficient.

3. Thread caches or CPU-local caches

Fast paths avoid central locks by keeping recently freed objects close to the allocating thread or CPU. This is excellent for throughput and tail latency, but it also means freed memory may remain cached rather than immediately visible as reclaimable RSS.

This is where allocator personalities start to differ:

One more production detail matters: purging / decay.

Returning memory to the OS immediately can be expensive and can hurt locality if the workload soon reallocates. Returning it too slowly can make RSS stay stubbornly high. So allocators usually choose a middle ground: keep some memory hot, purge some memory lazily, and expose tuning knobs for the rest.

Concept 3: jemalloc Is a Good Anchor Because It Makes the Trade-offs Visible

jemalloc is worth teaching here not because it is universally best, but because it makes allocator design legible.

In practice, teams choose it when they care about some combination of:

That makes it a natural fit for memory-dense infrastructure like caches and other always-on services.

A useful comparison is:

But the real rule is this: allocator choice is a measurement problem, not a branding problem.

The metrics that matter are usually:

This also explains why allocator choice connects naturally to the next performance block of the month. If you do not profile memory and contention correctly, allocator problems get misdiagnosed as generic "slowness." Flame graphs, lock analysis, and RSS/heap comparisons are how you tell whether the allocator is actually your bottleneck.

The trade-off here is the real production one:


Troubleshooting

Issue: "RSS is much larger than the logical dataset, so the service must be leaking."

Why it happens / is confusing: Teams often treat RSS as if it meant "currently useful objects only."

Clarification / Fix: Check fragmentation, allocator caches, arena spread, and purging behavior before concluding leak. A leak means memory is unreachable but still retained. High RSS can also mean memory is reachable by allocator policy even after the application has freed objects.

Issue: "Switching allocators will fix any bad memory pattern."

Why it happens / is confusing: Allocators are sometimes discussed as magic performance buttons.

Clarification / Fix: Allocators can improve the cost of an allocation pattern, but they do not remove the pattern itself. If the program churns objects unnecessarily or keeps poor lifetime locality, the allocator can only soften the consequences.

Issue: "If frees do not instantly shrink memory, the allocator is broken."

Why it happens / is confusing: People expect free() to map directly to immediate RSS reduction.

Clarification / Fix: Most allocators delay or batch memory return because immediate release can be expensive and can hurt short-term reuse. The right question is not "Did RSS drop instantly?" but "Is memory reuse and eventual reclamation acceptable for this workload?"


Advanced Connections

Connection 1: Memory Allocators <-> Redis, Caches, and In-Memory Systems

The parallel: Cache servers make allocator behavior visible because RAM efficiency directly becomes capacity, hardware cost, and eviction pressure.

Real-world case: A cache service with poor allocator fit may evict sooner or require more nodes even when the application data model itself has not changed.

Connection 2: Memory Allocators <-> Profiling, Flame Graphs, and Contention Analysis

The parallel: Allocator issues often masquerade as generic slowness until profiles reveal time in allocation paths, lock contention, or page-management activity.

Real-world case: The performance lessons later in this month become much more powerful once you know that "memory" problems can come from allocator policy, not only from business code.


Resources

Optional Deepening Resources


Key Insights

  1. Allocators are part of system behavior - In long-lived, memory-heavy services, allocator policy directly shapes RSS, fragmentation, latency, and contention.
  2. Fast allocation paths always hide a trade-off - Size classes, arenas, and caches make allocation cheaper, but they can also retain memory and complicate reuse.
  3. Choose allocators with benchmarks, not folklore - jemalloc, glibc malloc, and tcmalloc optimize different pain points, so the right answer depends on workload shape and measured bottlenecks.

PREVIOUS Redis Internals & Data Structures - Distributed Caching Foundation NEXT Consistent Hashing & Distributed Cache Coordination - Ring Algorithm

← Back to Caching, Workers, and Performance

← Back to Learning Hub