Training Optimizations - Making LLMs Train Faster & Better

LESSON

LLM Training, Alignment, and Serving

004 30 min intermediate

Day 308: Training Optimizations - Making LLMs Train Faster & Better

The core idea: training optimizations are not a bag of random speed hacks. They are choices about where the real bottleneck lives in large-scale training: memory, communication, numerical stability, input pipeline throughput, or wasted work per useful token.


Today's "Aha!" Moment

The insight: A pretraining run can fail in several different ways while still looking "busy":

That is why training optimization is really bottleneck optimization. The right trick depends on what is actually limiting useful progress.

Why this matters: After 20/03.md, we now have the full from-scratch pretraining pipeline in view. The next step is making that pipeline efficient enough to be economically viable.

Concrete anchor: Mixed precision, gradient checkpointing, fused kernels, better data loaders, flash attention, and schedule tuning do not all solve the same problem. They attack different pieces of wasted time or wasted memory.

Keep this mental hook in view: Training optimization is the art of paying less compute for the same useful gradient signal.


Why This Matters

Once a team decides to pretrain from scratch, the challenge is no longer only:

It becomes:

That is why this lesson sits here before the adaptation block:

If pretraining is expensive by nature, optimization determines whether that expense is merely high or completely impractical.


Learning Objectives

By the end of this session, you should be able to:

  1. Explain the main categories of training bottlenecks in LLM pretraining.
  2. Describe how common optimization techniques improve memory use, throughput, or stability and what they cost in return.
  3. Evaluate training optimizations as trade-offs rather than as universally good defaults.

Core Concepts Explained

Concept 1: Training Speed Is Usually Limited by a Small Number of Dominant Bottlenecks

For example, a team adds more GPUs to a run, but tokens-per-second barely improves. Another team reduces memory usage successfully, but the job becomes slower because communication and recomputation rise.

At a high level, Large-scale training is rarely "globally inefficient" in a vague way. It is usually constrained by one or two dominant bottlenecks:

Mechanically: At each step, useful training work competes with overhead like:

Optimization starts by identifying which of these is actually constraining tokens-per-second or stable batch size.

In practice:

The trade-off is clear: You gain efficiency only when the optimization matches the current bottleneck. Otherwise, you often add complexity without moving the real limit.

A useful mental model is: Think of training like a factory line. Making one station faster helps only if that station was actually the one creating the queue.

Use this lens when:

Concept 2: Most Training Optimizations Trade Memory, Compute, and Communication Against Each Other

For example, a team enables activation checkpointing and finally fits the target context length, but step time rises. Another team adopts mixed precision and gets a large speedup, but now needs careful loss scaling and stability monitoring.

At a high level, Optimization in training is rarely free. Many techniques work by shifting cost from one subsystem to another.

Mechanically: Common examples include:

These are not one-dimensional upgrades. Each one changes the cost surface differently.

In practice:

The trade-off is clear: Most optimizations buy one scarce resource by spending another.

A useful mental model is: You are rebalancing a budget across memory, compute, bandwidth, and engineering complexity.

Use this lens when:

Concept 3: The Best Training Optimization Is Often the One That Improves Useful Convergence Per Dollar, Not Raw Step Speed

For example, One configuration runs fewer tokens per second but converges more stably with a larger global batch and fewer restarts. Another is faster per step but wastes time on instability, bad packing, or noisy gradients.

At a high level, "Faster training" is ambiguous. The real question is not only step speed, but how efficiently the run turns money and time into model quality.

Mechanically: Useful optimization should be judged on a broader set of outcomes:

That means optimization decisions belong partly to systems engineering and partly to learning dynamics.

In practice:

The trade-off is clear: Maximizing local throughput is not the same as maximizing end-to-end training efficiency.

A useful mental model is: Training is a business process as much as a technical one. The question is not only "how fast is the engine?" but "how much useful distance do we cover for the fuel we spend?"

Use this lens when:


Troubleshooting

Issue: "GPU utilization is high, so the training job must already be well optimized."

Why it happens / is confusing: Busy hardware looks healthy, but utilization alone says little about whether the right work is happening efficiently.

Clarification / Fix: Check end-to-end throughput, dataloader wait time, communication overhead, and convergence quality. A busy cluster can still be wasting a lot of time.

Issue: "This optimization reduced memory, so it must be an overall win."

Why it happens / is confusing: Memory relief is visible and urgent, so it is easy to ignore what was spent to get it.

Clarification / Fix: Measure the new cost too: step time, recomputation, communication, or numerical instability. Memory wins are only good if the overall training objective improves.

Issue: "The fastest configuration in a benchmark should be our production default."

Why it happens / is confusing: Isolated benchmarks often hide dataloader effects, checkpoint cost, or convergence behavior.

Clarification / Fix: Prefer end-to-end comparisons on realistic workloads. Optimize for time-to-quality, not only for kernel-level or step-level speed.


Advanced Connections

Connection 1: Training Optimizations <-> ZeRO and Distributed State Management

Many optimizations interact directly with ZeRO-style state partitioning. A memory-saving change can increase communication, and a communication-saving change can demand more local memory. These decisions are coupled.

Connection 2: Training Optimizations <-> PEFT

The reason PEFT methods become so compelling later in the month is that full-model training is expensive in exactly the dimensions studied here: memory, throughput, optimizer state, and system complexity.


Resources

Optional Deepening Resources


Key Insights

  1. Optimization starts with bottlenecks, not with tricks - the useful change depends on what the training system is actually waiting on.
  2. Most improvements are exchanges, not freebies - memory, compute, communication, and stability are constantly being traded against each other.
  3. The real objective is time-to-quality, not just faster steps - end-to-end convergence efficiency matters more than local benchmark wins.

PREVIOUS Pre-training from Scratch - Building Your Own LLM NEXT LoRA (Low-Rank Adaptation) - Fine-tune 175B Models on Consumer GPUs

← Back to LLM Training, Alignment, and Serving

← Back to Learning Hub