ZeRO & DeepSpeed - Training at Trillion-Parameter Scale

LESSON

LLM Training, Alignment, and Serving

002 30 min intermediate

Day 306: ZeRO & DeepSpeed - Training at Trillion-Parameter Scale

The core idea: once you know what data and tokens you want to train on, the next hard question is whether the model state can even fit in memory. ZeRO and DeepSpeed matter because large-scale LLM training is often limited less by raw FLOPs than by how badly parameters, gradients, and optimizer states are replicated.


Today's "Aha!" Moment

The insight: A huge model does not only store weights. During training, you also need:

In naive data parallelism, too much of that state is replicated on every GPU. That means training fails not because the model is mathematically impossible, but because memory is being used wastefully.

Why this matters: ZeRO is best understood as a memory-partitioning strategy for distributed training. DeepSpeed is the system that operationalizes that strategy and combines it with offload, checkpointing, and execution optimizations.

Concrete anchor: With Adam-style optimization, optimizer state alone can cost multiple times the parameter memory. At large scale, "the model size" is not the real memory footprint. The real footprint is model state plus training state.

Keep this mental hook in view: Large-model training is not only a compute problem. It is a state-placement problem.


Why This Matters

20/01.md established that pretraining quality depends on the token stream you feed the model.

This lesson asks the next systems question:

That is where the training stack changes from:

to:

Without that shift, scaling hits memory walls long before it hits theoretical model ambition.


Learning Objectives

By the end of this session, you should be able to:

  1. Explain why naive data parallelism becomes memory-inefficient for large LLM training.
  2. Describe what ZeRO Stage 1, 2, and 3 partition and why each stage saves more memory.
  3. Evaluate when DeepSpeed and ZeRO-style strategies are worth the communication and complexity cost compared with simpler setups.

Core Concepts Explained

Concept 1: ZeRO Exists Because Training State Is Much Larger Than "Just the Weights"

For example, a team can load a model for inference on a cluster of GPUs, but training crashes immediately with out-of-memory errors once gradients and Adam states are enabled.

At a high level, Inference memory and training memory are different worlds. Training must remember far more state than inference.

Mechanically: A training job typically carries at least these categories of state:

Traditional data parallel training replicates most model and optimizer state on every worker. That replication is simple, but very expensive.

For large models, the waste is obvious:

So memory pressure grows faster than people expect from just reading the parameter count.

In practice:

The trade-off is clear: Full replication buys simpler programming and less fine-grained orchestration, but it wastes precious accelerator memory.

A useful mental model is: Think of distributed training as warehouse management. If every warehouse stores the full inventory, operations are simple but storage is wasteful. ZeRO starts splitting inventory across warehouses and reassembling only what is needed when it is needed.

Use this lens when:

Concept 2: ZeRO Saves Memory by Partitioning Different Kinds of State Across Workers

For example, a cluster has enough aggregate memory to hold the training state, but no single GPU can hold the full replicated copy. ZeRO makes the cluster's total memory more usable by sharding state across workers.

At a high level, ZeRO is not "magic compression." It is a structured decision about what not to replicate everywhere.

Mechanically: ZeRO is usually taught in stages:

  1. Stage 1
    • shard optimizer states across data-parallel workers
  2. Stage 2
    • shard optimizer states and gradients
  3. Stage 3
    • shard optimizer states, gradients, and parameters

The pattern is simple:

At Stage 3, parameters themselves are no longer permanently replicated in full on every worker. Instead, relevant parameter shards are gathered when needed for computation and released afterward.

DeepSpeed is the system layer that makes this practical with features like:

In practice:

The trade-off is clear: As you reduce memory replication, you usually increase orchestration complexity and communication overhead.

A useful mental model is: ZeRO turns memory from "replicate everything locally" into "materialize just enough globally."

Use this lens when:

Concept 3: DeepSpeed Is Valuable Because It Turns Scaling Tricks Into an Operational Training Stack

For example, a research team knows the theory of sharding optimizer state, but production training still fails because throughput collapses, checkpoints become unwieldy, and restarts are fragile.

At a high level, Large-scale training is not solved by one algorithm. It is solved by an execution stack that coordinates memory layout, optimizer behavior, checkpoint format, and cluster realities.

Mechanically: DeepSpeed is useful not only because it implements ZeRO, but because it packages several operational concerns together:

This matters because real training jobs are constrained by more than raw math:

In practice:

The trade-off is clear: DeepSpeed expands the feasible training envelope, but it also introduces more system surface area to tune, monitor, and debug.

A useful mental model is: ZeRO is the memory strategy; DeepSpeed is the broader training platform that helps you live with that strategy at scale.

Use this lens when:


Troubleshooting

Issue: "The model fits for inference, so why does training still OOM?"

Why it happens / is confusing: Parameter size is easy to estimate, but people forget gradients, optimizer state, and activations.

Clarification / Fix: Separate memory accounting into parameters, gradients, optimizer state, and activations. Training memory is often several times larger than inference memory.

Issue: "Stage 3 saved memory, but throughput got much worse."

Why it happens / is confusing: Less memory use does not mean less total cost. More sharding can require more all-gathers, more synchronization, and more sensitivity to interconnect speed.

Clarification / Fix: Treat ZeRO stage choice as a latency/bandwidth trade-off, not a universal upgrade path. Re-evaluate with actual hardware topology and profiler data.

Issue: "Offload made the run possible, but the job is now painfully slow."

Why it happens / is confusing: Offload shifts pressure from GPU memory to slower devices and buses.

Clarification / Fix: Use offload when the goal is feasibility first. If throughput matters, compare that solution against smaller model changes, gradient checkpointing, or different parallelism strategies.


Advanced Connections

Connection 1: ZeRO <-> FSDP and State Sharding

The deeper pattern is broader than one library: modern large-model training keeps asking the same question. Which state must be resident locally, and which state can be sharded and rematerialized on demand?

Connection 2: ZeRO <-> Model, Tensor, and Pipeline Parallelism

ZeRO mainly attacks replication inside data parallelism. It does not replace every other scaling strategy. Real large-model training often combines:

That is why distributed training design becomes a composition problem rather than a single trick.


Resources

Optional Deepening Resources


Key Insights

  1. Large-model training is dominated by training state, not just parameter count - weights alone understate the real memory problem.
  2. ZeRO works by removing unnecessary replication - each stage shards another category of state across workers.
  3. DeepSpeed matters because scale is operational, not just theoretical - memory savings only matter if the training stack is still fast and recoverable enough to use.

PREVIOUS Dataset Curation & Tokenization - The Foundation of LLMs NEXT Pre-training from Scratch - Building Your Own LLM

← Back to LLM Training, Alignment, and Serving

← Back to Learning Hub