Gradient Checking & Debugging

Day 120: Gradient Checking & Debugging

Gradient checking matters because when you write backprop by hand, the first question is not “is the optimizer good?” but “are these gradients even correct?”


Today's "Aha!" Moment

Once you implement backpropagation from scratch, a new kind of uncertainty appears. The code may run, the loss may even move, and yet the gradients can still be wrong. A missing transpose, the wrong cached tensor, or a tiny activation-derivative mistake can quietly poison training.

Gradient checking is the sanity test for that situation. Instead of trusting the analytical gradient from backprop immediately, you approximate the gradient numerically by slightly nudging one parameter up and down and seeing how the loss changes.

If the analytical and numerical gradients agree closely, that is strong evidence your backward pass is implemented correctly. If they disagree badly, you know the problem is in the gradient code before you waste time tuning learning rates or architectures.

That is the aha. Gradient checking is not part of normal training. It is a debugging instrument for verifying the machinery that makes training possible.


Why This Matters

The problem: Backpropagation implementations often fail through small local mistakes that do not crash the code but do corrupt the gradients.

Before:

After:

Real-world impact: Gradient checking is one of the most valuable habits when implementing custom losses, custom layers, or framework-free educational models. It shortens debugging loops dramatically.


Learning Objectives

By the end of this session, you will be able to:

  1. Explain what gradient checking verifies - Understand how numerical finite differences test analytical gradients.
  2. Run the basic gradient-checking procedure - Perturb a parameter, recompute the loss, and compare the result with backprop.
  3. Use gradient checking correctly as a debugging tool - Know when it is useful, when it is too expensive, and what common mismatches usually mean.

Core Concepts Explained

Concept 1: Numerical Gradients Approximate “How Much Would the Loss Change If This Parameter Moved?”

Suppose you want to check one parameter w. You can estimate its gradient numerically with a symmetric finite-difference formula:

grad_approx(w) = [L(w + eps) - L(w - eps)] / (2 * eps)

This works because the derivative is, at heart, a local rate of change. If eps is small, the loss difference around w gives you a good approximation of how sensitive the loss is to that parameter.

def numerical_grad(loss_fn, w, eps=1e-5):
    return (loss_fn(w + eps) - loss_fn(w - eps)) / (2 * eps)

That numerical gradient is not what you use for real training. It is too slow. But it is extremely useful as a reference signal for debugging.

The trade-off is speed versus trust. Finite differences are computationally expensive, but they give you a valuable independent check on whether backprop's analytic gradient is plausible.

Concept 2: Gradient Checking Compares Two Ways of Answering the Same Question

Backpropagation gives you an analytical gradient from the chain rule. Gradient checking gives you a numerical approximation from repeated forward evaluations. If both methods are correct, they should closely agree.

A typical workflow looks like this:

1. run forward pass and backprop
2. choose one parameter
3. perturb it by +eps and -eps
4. recompute the loss both times
5. estimate numerical gradient
6. compare numerical gradient vs backprop gradient

You usually compare them with a relative error rather than raw subtraction:

relative_error = abs(grad_backprop - grad_num) / max(1e-8, abs(grad_backprop) + abs(grad_num))

Why relative error? Because an absolute difference of 1e-5 might be tiny if the gradients are around 1, but huge if the gradients are around 1e-7.

This is the core mindset: gradient checking is not asking whether the gradient is exactly identical. It is asking whether the independently computed answers are close enough to trust the implementation.

The trade-off is that the comparison is powerful but delicate. Very small eps can amplify floating-point noise, while very large eps can make the approximation too crude.

Concept 3: Gradient Checking Is a Debugging Step, Not a Training Strategy

This point matters a lot in practice.

Gradient checking is expensive because every checked parameter requires extra forward evaluations of the whole network. That is totally fine for a tiny model or for a few randomly selected parameters, but absurd for a full training run.

It is also important to run the check in a controlled setting:

use gradient check when:
  custom layer
  custom loss
  from-scratch implementation
  suspicious training behavior

do not use it as:
  normal training loop
  substitute for optimization

If the check fails, common causes include:

The trade-off is that gradient checking can save huge amounts of debugging time, but only if you treat it as a targeted verification tool rather than a permanent part of the pipeline.

Troubleshooting

Issue: Numerical and analytical gradients disagree badly.

Why it happens / is confusing: The model may still run, so the disagreement feels mysterious.

Clarification / Fix: Check one layer at a time. Look first for shape mismatches, incorrect activation derivatives, and missing terms such as regularization.

Issue: Gradient checking passes on a tiny example but training still behaves badly.

Why it happens / is confusing: A correct gradient implementation does not guarantee good optimization behavior.

Clarification / Fix: Once gradient correctness is verified, move on to optimizer choice, learning rate, initialization, scaling, and data issues.

Issue: Gradient checking fails intermittently.

Why it happens / is confusing: Stochastic layers or nondeterministic settings can make repeated forward evaluations inconsistent.

Clarification / Fix: Disable randomness during the check and use a deterministic mini-case.


Advanced Connections

Connection 1: Gradient Checking ↔ Numerical Analysis

The parallel: Finite differences are a numerical approximation technique, not a symbolic derivative method.

Real-world case: The choice of eps and the use of symmetric differences are classic numerical-stability considerations.

Connection 2: Gradient Checking ↔ Software Testing

The parallel: This is essentially a unit test for derivatives.

Real-world case: Just as normal tests compare expected and actual outputs, gradient checking compares expected local sensitivity with implemented backprop sensitivity.


Resources

Optional Deepening Resources


Key Insights

  1. Gradient checking compares analytical gradients with numerical approximations - It is a direct sanity test for a backprop implementation.
  2. Finite differences are a debugging tool, not a training method - They are too expensive for normal learning but extremely useful for verification.
  3. A failed gradient check usually points to a local implementation bug - Shapes, cached values, derivatives, averaging, or missing terms are the usual culprits.

Knowledge Check (Test Questions)

  1. What is the main purpose of gradient checking?

    • A) To verify that the gradient computed by backpropagation is consistent with a numerical approximation.
    • B) To replace backpropagation during training.
    • C) To choose the best optimizer automatically.
  2. Why is symmetric finite difference usually preferred over a one-sided estimate?

    • A) Because it usually gives a more accurate numerical approximation of the local derivative.
    • B) Because it removes the need to compute the loss.
    • C) Because it guarantees zero floating-point error.
  3. Why should gradient checking usually be done on a tiny deterministic setup?

    • A) Because it is expensive and stochastic behavior can make the comparison unreliable.
    • B) Because larger models cannot have gradients.
    • C) Because backpropagation only works on tiny networks.

Answers

1. A: Gradient checking is a verification step that compares implemented backprop against an independent numerical reference.

2. A: The symmetric formula generally gives a better local approximation than one-sided finite differences.

3. A: You want the cleanest possible signal when debugging gradients, and large or stochastic setups add unnecessary noise and cost.



← Back to Learning