Day 122: Learning Rate Schedules
Learning rate schedules matter because the step size that helps early exploration is often too large for later fine adjustment.
Today's "Aha!" Moment
The previous lesson focused on how to turn gradients into updates. This lesson focuses on how large those updates should be over time.
At the beginning of training, you often want bold movement. The model is far from a useful region, and large steps can speed up progress. Later, that same step size can become a liability. Once the model is near a good basin, large updates can bounce around, overshoot, or keep the loss oscillating instead of settling.
That is why a fixed learning rate is often an awkward compromise. If it is small enough for late training, it may be too slow early on. If it is large enough for early training, it may be too unstable later.
That is the aha. A learning-rate schedule is a plan for changing the aggressiveness of training as the training phase changes.
Why This Matters
The problem: One learning rate has to serve multiple phases of training, even though those phases often need different behavior.
Before:
- Learning rate feels like one static knob you set once.
- Slow convergence and late-stage oscillation look like unrelated problems.
- Schedules can seem like tuning folklore.
After:
- Training is seen as a process with phases: explore, stabilize, refine.
- Learning-rate schedules become a way to match step size to phase.
- Optimization behavior becomes easier to reason about and debug.
Real-world impact: A good learning-rate schedule can significantly improve training speed, stability, and final model quality without changing the architecture at all.
Learning Objectives
By the end of this session, you will be able to:
- Explain why fixed learning rates are often suboptimal - Understand why the best step size changes across training.
- Compare common schedule families - Distinguish step decay, exponential decay, cosine decay, and warmup-oriented schedules.
- Reason about schedule trade-offs - Know when you want faster exploration, gentler convergence, or protection from early instability.
Core Concepts Explained
Concept 1: Training Usually Wants Large Steps Early and Smaller Steps Later
Think of training as search over a rough landscape. Early on, you are trying to make broad progress toward a useful region. Later, you are trying to fit more precisely within that region.
With a large fixed learning rate:
- early progress may be fast
- late training may oscillate or fail to settle
With a small fixed learning rate:
- late training may be stable
- early training may waste many steps moving too cautiously
early phase: bigger moves often helpful
late phase: smaller moves often safer
This is the core reason schedules exist. They let training behave differently at different times without changing the loss, model, or optimizer family.
The trade-off is between speed and precision. Bigger steps make it easier to cover ground; smaller steps make it easier to refine.
Concept 2: Different Schedule Families Encode Different Beliefs About How Training Should Progress
The simplest family is step decay: reduce the learning rate by a fixed factor every so many epochs.
lr = 0.1 -> 0.01 -> 0.001
This reflects a simple belief: train in stages, with each stage becoming more conservative.
Exponential decay makes the decrease smoother instead of dropping at sharp boundaries.
Cosine decay reduces the learning rate gradually following a cosine curve, often giving a gentler late-training phase.
Warmup does something different: it starts with a small learning rate and ramps it up before decay begins. This helps when large updates early in training are especially unstable, which is common in larger or more sensitive models.
warmup:
small lr -> gradually increase -> normal schedule
decay:
larger lr -> gradually or suddenly reduce
The point is not to memorize every schedule family. The point is to see that each one encodes a story about training dynamics.
The trade-off is simplicity versus fit to the real optimization behavior. Simpler schedules are easier to tune and explain. Richer schedules may fit the training dynamics better, but add more moving parts.
Concept 3: The Best Schedule Depends on the Failure Mode You Are Seeing
Learning-rate schedules are most useful when tied to observed behavior.
If training makes progress early and then oscillates, a decay schedule may help.
If training is unstable right from the beginning, warmup may help avoid destructive early updates.
If progress becomes painfully slow after an initially good phase, the issue may be that the learning rate decayed too aggressively, or that the base learning rate was too small from the start.
early instability? -> consider warmup or lower initial lr
late oscillation? -> consider decay
stalls too early? -> maybe decay too soon or base lr too low
This is why schedules should not be treated as decoration on top of optimization. They are responses to concrete training behavior.
The trade-off is that schedules can improve training a lot, but they can also create extra tuning burden if applied mechanically without reading the loss curves and validation behavior.
Troubleshooting
Issue: Assuming the optimizer alone should solve training instability.
Why it happens / is confusing: Adam or Momentum can make optimization feel more forgiving.
Clarification / Fix: The optimizer defines how to use gradients; the schedule defines how aggressively to step over time. Both matter.
Issue: Reducing the learning rate too early.
Why it happens / is confusing: Smaller steps sound safer, so decaying sooner can feel conservative.
Clarification / Fix: If you decay too early, training may become unnecessarily slow or settle into a mediocre region before enough exploration happened.
Issue: Treating schedules as mandatory complexity.
Why it happens / is confusing: Many strong training recipes mention elaborate schedules.
Clarification / Fix: Not every model needs a sophisticated schedule. Use the simplest schedule that matches the training behavior you actually see.
Advanced Connections
Connection 1: Learning Rate Schedules ↔ Optimization Phases
The parallel: Schedules reflect the idea that the role of an update changes across training, from coarse movement to fine adjustment.
Real-world case: Many successful training recipes work not because of one magic number, but because they respect that different phases need different step sizes.
Connection 2: Learning Rate Schedules ↔ Control Systems
The parallel: Changing the learning rate over time is a bit like changing the gain of a controller as the system moves closer to a target.
Real-world case: Overshoot, oscillation, and slow convergence are all familiar control-style behaviors that schedules help manage.
Resources
Optional Deepening Resources
- These resources are optional and are not required for the core 30-minute path.
- [DOCS] PyTorch Learning Rate Scheduler Reference
- Link: https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
- Focus: See the major scheduler families and how they are used in practice.
- [PAPER] SGDR: Stochastic Gradient Descent with Warm Restarts
- Link: https://arxiv.org/abs/1608.03983
- Focus: Read the motivation behind cosine-style decay and restart-based scheduling.
- [DOCS] PyTorch OneCycleLR
- Link: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html
- Focus: See an example of a schedule built around distinct training phases.
- [BOOK] Deep Learning
- Link: https://www.deeplearningbook.org/
- Focus: Use the optimization chapter as a more formal follow-up on step size and convergence behavior.
Key Insights
- A fixed learning rate is often a compromise across incompatible phases - Early exploration and late refinement usually want different step sizes.
- Schedules encode beliefs about training dynamics - Step decay, cosine decay, warmup, and related patterns are different responses to different optimization behaviors.
- A schedule should answer a real training problem - Use schedules to address observed instability, oscillation, or slowdown, not as ritual complexity.
Knowledge Check (Test Questions)
-
Why is a fixed learning rate often suboptimal?
- A) Because the step size that helps early progress can be too large for late-stage refinement.
- B) Because gradients disappear if the rate is constant.
- C) Because schedules remove the need for optimizers.
-
What problem does warmup mainly help with?
- A) Early training instability when large updates at the start are risky.
- B) Converting regression into classification.
- C) Eliminating the need for decay later.
-
When is a decay schedule especially plausible?
- A) When training makes progress at first but later oscillates or fails to settle cleanly.
- B) When gradients are mathematically incorrect.
- C) When the model has no hidden layers.
Answers
1. A: Training often wants different levels of aggressiveness at different stages.
2. A: Warmup protects the earliest phase of training, where full-strength updates can be too unstable.
3. A: Decay is often useful when later training needs smaller, more careful steps than early training did.