Day 121: SGD Variants - Momentum, RMSprop, Adam

Optimizer variants matter because a correct gradient still does not tell you how to move well through a noisy, uneven loss landscape.

Today's "Aha!" Moment

The last lessons got the network to the point where it can compute gradients correctly. But one more question remains: once you have a gradient, how should you actually use it to update the parameters?

Plain stochastic gradient descent gives the most direct answer: move a little in the negative-gradient direction. That works, but it can also zig-zag, stall, or move too slowly when the landscape is noisy or when different parameters need very different step sizes.

That is why optimizer variants exist. Momentum tries to keep useful motion going instead of reacting too hard to every noisy minibatch. RMSprop scales steps based on recent gradient magnitudes. Adam combines both instincts: keep a moving direction estimate and normalize by recent gradient scale.

That is the aha. These optimizers are not random improvements layered on top of SGD. Each one is a response to a concrete weakness in plain gradient descent.

Why This Matters

The problem: Even with correct gradients, optimization can be slow, unstable, or overly sensitive to learning-rate choice.

Before:

“Use the gradient” sounds like the whole story.
Training noise and curvature differences look like mysterious bad luck.
Optimizer names feel like cookbook defaults rather than design choices.

After:

Optimization is seen as the problem of turning gradients into useful motion.
Each optimizer variant can be understood as addressing a specific difficulty.
Choosing an optimizer becomes more principled and less superstitious.

Real-world impact: Optimizer choice affects training speed, stability, and how much tuning effort a model requires. It often changes whether a network trains smoothly at all.

Learning Objectives

By the end of this session, you will be able to:

Explain what plain SGD struggles with - Understand noise, ravines, and uneven gradient scales.
Describe the main idea behind Momentum, RMSprop, and Adam - See what each method adds beyond basic SGD.
Reason about practical trade-offs - Understand why no optimizer is “best” in all cases.

Core Concepts Explained

Concept 1: Plain SGD Is Simple, but the Gradient Alone Can Be a Messy Guide

Basic SGD updates parameters like this:

W = W - lr * dW

That is clean and direct, but real neural-network training is rarely so well behaved. With minibatches, gradients are noisy. In narrow valleys of the loss surface, the gradient can point sharply side to side while only weakly pointing downhill overall. Some parameters may also live on much larger scales than others.

true descent direction
    ^
    |
 noisy minibatch gradients
  /\/\/\/\/\

This produces the classic SGD frustrations:

zig-zagging instead of smooth progress
sensitivity to learning rate
slow movement along important but shallow directions
instability when gradient magnitudes vary across parameters

The trade-off is that SGD is simple, memory-light, and often surprisingly strong, but it leaves all these geometry and noise issues mostly untreated.

Concept 2: Momentum and RMSprop Solve Different Optimization Problems

Momentum addresses noisy direction changes. Instead of following only the current gradient, it keeps a running velocity:

velocity = old velocity + current gradient signal
update   = follow velocity

The intuition is physical. If many consecutive gradients point roughly the same way, momentum lets motion build. If the gradient oscillates back and forth, momentum damps some of that zig-zagging.

RMSprop addresses a different issue: some coordinates repeatedly get large gradients while others get small ones. It keeps a running estimate of recent squared gradients and scales the update inversely to that estimate.

That means:

parameters with consistently large gradients get smaller effective steps
parameters with smaller gradients can still move meaningfully

Momentum:
  smooth the direction

RMSprop:
  normalize by recent gradient scale

These are not competing stories about the same thing. They fix different weaknesses of plain SGD.

The trade-off is that both add state and hyperparameters. You gain better motion, but you also make the update rule less transparent than raw SGD.

Concept 3: Adam Combines Momentum-Like Direction Tracking with RMSprop-Like Scaling

Adam became popular because it combines both intuitions.

It keeps:

a moving average of gradients, like momentum
a moving average of squared gradients, like RMSprop

Then it uses both to produce a scaled update.

m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * (grad ** 2)
param = param - lr * m / (sqrt(v) + eps)

You do not need to memorize the exact symbols yet. The conceptual point is enough: Adam tries to move in a smoothed useful direction while adapting the step size to local gradient scale.

That makes it a strong default in many deep-learning workflows. But “default” is not the same as “always best.” Sometimes SGD with momentum generalizes better or behaves more predictably once tuned well. Adam can also hide poor learning-rate intuition by making training seem easier than it really is.

SGD:
  one raw step rule

Momentum:
  keep moving in useful directions

RMSprop:
  adapt to gradient scale

Adam:
  do both

The trade-off is convenience versus control. Adam often works quickly with less tuning, but simpler optimizers can still be preferable when you want more predictable optimization behavior or better final generalization.

Troubleshooting

Issue: Assuming a more sophisticated optimizer will automatically fix a weak model.

Why it happens / is confusing: If training is unstable, it is tempting to blame the optimizer first.

Clarification / Fix: Optimizers help you use gradients better. They do not rescue a broken architecture, bad loss design, or incorrect gradients.

Issue: Treating optimizer choice as completely separate from learning-rate choice.

Why it happens / is confusing: Adam and related methods can feel “self-tuning.”

Clarification / Fix: Adaptive methods reduce some sensitivity, but learning rate still matters a lot.

Issue: Thinking Adam always dominates SGD with momentum.

Why it happens / is confusing: Adam is often the easiest optimizer to get running.

Clarification / Fix: Adam is an excellent default, but SGD with momentum can still be competitive or better depending on the model, task, and desired generalization behavior.

Advanced Connections

Connection 1: Optimizers ↔ Loss-Surface Geometry

The parallel: Different optimizers respond differently to curvature, noisy gradients, and uneven scaling across directions.

Real-world case: The same model and loss can train very differently depending on whether the update rule damps oscillation, rescales coordinates, or both.

Connection 2: Optimizers ↔ Control Systems

The parallel: Momentum acts a bit like inertia or smoothing, while adaptive scaling acts a bit like automatic gain adjustment.

Real-world case: Many training instabilities are easier to understand when seen as poor control of noisy feedback rather than as mysterious neural-network behavior.

Resources

Optional Deepening Resources

These resources are optional and are not required for the core 30-minute path.
[TUTORIAL] CS231n Notes - Neural Networks Part 3: Learning and Evaluation
- Link: https://cs231n.github.io/neural-networks-3/
- Focus: Review the optimizer discussion and the practical advice around learning dynamics.
[DOCS] PyTorch Optimizer Reference - SGD
- Link: https://pytorch.org/docs/stable/generated/torch.optim.SGD.html
- Focus: See how plain SGD and momentum are parameterized in practice.
[DOCS] PyTorch Optimizer Reference - Adam
- Link: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html
- Focus: Map the conceptual moving averages to a real framework implementation.
[BOOK] Deep Learning
- Link: https://www.deeplearningbook.org/
- Focus: Use the optimization chapter as a formal follow-up on gradient-based methods.

Key Insights

A gradient still needs an update rule - Plain SGD is the simplest one, but it can struggle with noise and uneven geometry.
Momentum and RMSprop solve different problems - One smooths direction; the other adapts to gradient scale.
Adam combines both ideas, but it is not magic - It is a strong default, not a universal substitute for good modeling and tuning.

Knowledge Check (Test Questions)

What is one main weakness of plain SGD?
- A) It can zig-zag or move inefficiently when gradients are noisy or differently scaled across directions.
- B) It cannot use gradients at all.
- C) It only works for linear models.
What problem does momentum mainly address?
- A) It reduces noisy directional oscillation by accumulating a running velocity.
- B) It converts classification into regression.
- C) It removes the need for a learning rate.
What is the main idea behind Adam?
- A) Combine smoothed directional updates with adaptive scaling based on recent gradient magnitudes.
- B) Replace the loss function with a better metric.
- C) Use only the last gradient and ignore history.

Answers

1. A: SGD is simple, but raw minibatch gradients can produce unstable or inefficient motion.

2. A: Momentum helps keep movement aligned with persistent descent directions instead of reacting too hard to every noisy gradient.

3. A: Adam mixes momentum-like running averages with RMSprop-like adaptive step sizing.

← Back to Learning