LESSON
Day 308: Training Optimizations - Making LLMs Train Faster & Better
The core idea: training optimizations are not a bag of random speed hacks. They are choices about where the real bottleneck lives in large-scale training: memory, communication, numerical stability, input pipeline throughput, or wasted work per useful token.
Today's "Aha!" Moment
The insight: A pretraining run can fail in several different ways while still looking "busy":
- GPUs are active, but waiting on input
- memory fits, but communication dominates
- throughput is high, but instability forces tiny learning rates
- loss decreases, but too much compute is being wasted on avoidable overhead
That is why training optimization is really bottleneck optimization. The right trick depends on what is actually limiting useful progress.
Why this matters: After 20/03.md, we now have the full from-scratch pretraining pipeline in view. The next step is making that pipeline efficient enough to be economically viable.
Concrete anchor: Mixed precision, gradient checkpointing, fused kernels, better data loaders, flash attention, and schedule tuning do not all solve the same problem. They attack different pieces of wasted time or wasted memory.
Keep this mental hook in view: Training optimization is the art of paying less compute for the same useful gradient signal.
Why This Matters
Once a team decides to pretrain from scratch, the challenge is no longer only:
- can we run this job?
It becomes:
- can we run it fast enough, stably enough, and cheaply enough to finish before the budget or patience runs out?
That is why this lesson sits here before the adaptation block:
- first you learn how to make base-model training efficient
- then you can understand why techniques like LoRA and PEFT are so attractive downstream
If pretraining is expensive by nature, optimization determines whether that expense is merely high or completely impractical.
Learning Objectives
By the end of this session, you should be able to:
- Explain the main categories of training bottlenecks in LLM pretraining.
- Describe how common optimization techniques improve memory use, throughput, or stability and what they cost in return.
- Evaluate training optimizations as trade-offs rather than as universally good defaults.
Core Concepts Explained
Concept 1: Training Speed Is Usually Limited by a Small Number of Dominant Bottlenecks
For example, a team adds more GPUs to a run, but tokens-per-second barely improves. Another team reduces memory usage successfully, but the job becomes slower because communication and recomputation rise.
At a high level, Large-scale training is rarely "globally inefficient" in a vague way. It is usually constrained by one or two dominant bottlenecks:
- memory capacity
- memory bandwidth
- interconnect bandwidth
- compute kernel efficiency
- input pipeline throughput
- optimizer instability
Mechanically: At each step, useful training work competes with overhead like:
- sharded state movement
- activation storage
- dataloader stalls
- recomputation
- synchronization barriers
- suboptimal kernels
Optimization starts by identifying which of these is actually constraining tokens-per-second or stable batch size.
In practice:
- adding hardware can be a poor fix if the real bottleneck is elsewhere
- the same model can be input-bound on one cluster and communication-bound on another
- optimization work is most effective when it starts from profiling, not folklore
The trade-off is clear: You gain efficiency only when the optimization matches the current bottleneck. Otherwise, you often add complexity without moving the real limit.
A useful mental model is: Think of training like a factory line. Making one station faster helps only if that station was actually the one creating the queue.
Use this lens when:
- Best fit: diagnosing low throughput, poor scaling efficiency, or unstable large runs.
- Misuse pattern: applying optimization recipes without measuring what the system is waiting on.
Concept 2: Most Training Optimizations Trade Memory, Compute, and Communication Against Each Other
For example, a team enables activation checkpointing and finally fits the target context length, but step time rises. Another team adopts mixed precision and gets a large speedup, but now needs careful loss scaling and stability monitoring.
At a high level, Optimization in training is rarely free. Many techniques work by shifting cost from one subsystem to another.
Mechanically: Common examples include:
- mixed precision
- saves memory and often increases throughput, but can introduce numerical sensitivity
- activation checkpointing
- reduces activation memory, but adds recomputation during backward pass
- fused kernels
- reduce overhead between many small operations, but depend on implementation quality and hardware fit
- flash attention-style kernels
- reduce memory traffic and improve attention efficiency, especially at longer sequences
- better dataloading and sequence packing
- reduce idle accelerator time by feeding the model more efficiently
These are not one-dimensional upgrades. Each one changes the cost surface differently.
In practice:
- activation checkpointing often buys model size or sequence length at the price of longer step time
- mixed precision is often worth it, but needs careful numerical monitoring
- better input packing can increase effective throughput without changing the model at all
- a kernel optimization that looks impressive in isolation may not matter if the job is bottlenecked elsewhere
The trade-off is clear: Most optimizations buy one scarce resource by spending another.
A useful mental model is: You are rebalancing a budget across memory, compute, bandwidth, and engineering complexity.
Use this lens when:
- Best fit: choosing optimizations for a specific cluster, model size, and context target.
- Misuse pattern: assuming a faster microbenchmark automatically means a faster end-to-end training run.
Concept 3: The Best Training Optimization Is Often the One That Improves Useful Convergence Per Dollar, Not Raw Step Speed
For example, One configuration runs fewer tokens per second but converges more stably with a larger global batch and fewer restarts. Another is faster per step but wastes time on instability, bad packing, or noisy gradients.
At a high level, "Faster training" is ambiguous. The real question is not only step speed, but how efficiently the run turns money and time into model quality.
Mechanically: Useful optimization should be judged on a broader set of outcomes:
- tokens per second
- stable batch size
- memory headroom
- convergence quality
- wall-clock time to target quality
- failure rate and restart cost
That means optimization decisions belong partly to systems engineering and partly to learning dynamics.
In practice:
- a slightly slower but much more stable run may be better overall
- sequence packing and data-pipeline improvements can buy surprisingly large real-world wins
- reducing restart frequency can matter as much as improving average step time
- the right optimizer schedule can sometimes outperform hardware-heavy scaling tricks
The trade-off is clear: Maximizing local throughput is not the same as maximizing end-to-end training efficiency.
A useful mental model is: Training is a business process as much as a technical one. The question is not only "how fast is the engine?" but "how much useful distance do we cover for the fuel we spend?"
Use this lens when:
- Best fit: deciding between optimization options with different effects on speed, stability, and cost.
- Misuse pattern: choosing solely on peak throughput without considering convergence or operational fragility.
Troubleshooting
Issue: "GPU utilization is high, so the training job must already be well optimized."
Why it happens / is confusing: Busy hardware looks healthy, but utilization alone says little about whether the right work is happening efficiently.
Clarification / Fix: Check end-to-end throughput, dataloader wait time, communication overhead, and convergence quality. A busy cluster can still be wasting a lot of time.
Issue: "This optimization reduced memory, so it must be an overall win."
Why it happens / is confusing: Memory relief is visible and urgent, so it is easy to ignore what was spent to get it.
Clarification / Fix: Measure the new cost too: step time, recomputation, communication, or numerical instability. Memory wins are only good if the overall training objective improves.
Issue: "The fastest configuration in a benchmark should be our production default."
Why it happens / is confusing: Isolated benchmarks often hide dataloader effects, checkpoint cost, or convergence behavior.
Clarification / Fix: Prefer end-to-end comparisons on realistic workloads. Optimize for time-to-quality, not only for kernel-level or step-level speed.
Advanced Connections
Connection 1: Training Optimizations <-> ZeRO and Distributed State Management
Many optimizations interact directly with ZeRO-style state partitioning. A memory-saving change can increase communication, and a communication-saving change can demand more local memory. These decisions are coupled.
Connection 2: Training Optimizations <-> PEFT
The reason PEFT methods become so compelling later in the month is that full-model training is expensive in exactly the dimensions studied here: memory, throughput, optimizer state, and system complexity.
Resources
Optional Deepening Resources
-
[PAPER] Mixed Precision Training
- Focus: Why lower precision can speed up training while preserving model quality when handled carefully.
-
[PAPER] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- Focus: A concrete example of optimizing attention by targeting memory movement rather than only FLOPs.
-
[DOC] PyTorch Performance Tuning Guide
- Focus: Practical optimization levers around kernels, dataloading, and execution behavior.
-
[DOC] NVIDIA Megatron-LM
- Focus: A real training stack where many large-scale optimization ideas are combined in practice.
Key Insights
- Optimization starts with bottlenecks, not with tricks - the useful change depends on what the training system is actually waiting on.
- Most improvements are exchanges, not freebies - memory, compute, communication, and stability are constantly being traded against each other.
- The real objective is time-to-quality, not just faster steps - end-to-end convergence efficiency matters more than local benchmark wins.