Backpropagation Algorithm - Step by Step

Day 118: Backpropagation Algorithm - Step by Step

Backpropagation matters because it turns the chain rule into an efficient procedure: one forward pass to compute values, one backward pass to compute how each parameter should change.


Today's "Aha!" Moment

Yesterday's lesson explained the principle: the chain rule lets a network trace how the final loss depends on earlier computations. Backpropagation is the practical algorithm that uses that principle efficiently across the whole network.

The key shift is this: backpropagation does not compute each gradient from scratch with a separate symbolic derivation. That would be hopelessly repetitive. Instead, it reuses partial results. Once you know how the loss changes with respect to a layer's output, you can use that to compute how it changes with respect to that layer's weights, biases, and inputs, then pass the needed part backward to the previous layer.

So the algorithm is really a carefully organized reuse of local derivatives. First run the network forward and save the intermediate values. Then move backward layer by layer, converting downstream error into parameter gradients and upstream error signals.

That is the aha: backpropagation is dynamic programming for gradients. It is not a mysterious neural trick. It is the efficient implementation of repeated chain-rule application.


Why This Matters

The problem: A deep network may contain millions of parameters. Computing "how much does the loss change if I tweak this one?" independently for each parameter would be absurdly expensive.

Before:

After:

Real-world impact: Efficient gradient computation is one of the central reasons modern neural networks are trainable at scale. Understanding backpropagation also helps debug exploding gradients, vanishing gradients, and custom-layer implementations.


Learning Objectives

By the end of this session, you will be able to:

  1. Explain the high-level structure of backpropagation - Describe the forward cache plus backward gradient flow pattern.
  2. Trace gradients through a small network - Understand how one layer receives a downstream gradient and turns it into parameter gradients plus an upstream gradient.
  3. Connect the algorithm to efficient learning - Explain why reused intermediate derivatives make training practical.

Core Concepts Explained

Concept 1: Backpropagation Starts with a Forward Pass That Saves What the Backward Pass Will Need

Backpropagation cannot begin from nowhere. It needs the intermediate values produced during the forward pass.

For a simple two-layer network:

Z1 = W1 @ X + b1
A1 = relu(Z1)
Z2 = W2 @ A1 + b2
Y_hat = sigmoid(Z2)
L = loss(Y_hat, Y)

During the forward pass, you are not only making a prediction. You are also generating the exact quantities the backward pass will differentiate through: X, Z1, A1, Z2, and Y_hat.

That is why implementations often "cache" these values.

forward pass:
  compute values
  save intermediates
  compute loss

The trade-off is memory for speed. Saving intermediate activations costs memory, but without them the backward pass would need to recompute or reconstruct too much of the graph.

Concept 2: Each Layer Turns a Downstream Gradient Into Parameter Gradients and an Upstream Gradient

This is the heart of the algorithm.

Suppose you already know how the loss changes with respect to Z2. Call that quantity dZ2. From there, you can compute:

That last quantity, dA1, is what the previous layer needs.

downstream gradient arrives
        |
        +--> compute dW, db for this layer
        |
        +--> compute upstream gradient for previous layer

For an affine layer, the pattern is mechanically consistent:

dW2 = dZ2 @ A1.T
db2 = sum_over_batch(dZ2)
dA1 = W2.T @ dZ2

Then the previous layer uses dA1 together with the derivative of its activation to get dZ1, and the process repeats.

This is why backpropagation scales. Every layer solves the same local problem:

  1. receive the gradient coming from later computations
  2. use local derivatives to turn it into parameter gradients
  3. pass the remaining gradient backward

The trade-off is that the algorithm is elegant and reusable, but only if each operation has a clear forward definition and a clear local derivative.

Concept 3: Backpropagation Is Efficient Because It Reuses Shared Computation

Imagine trying to compute the gradient of the loss with respect to every weight independently by tracing every path from scratch. In a deep network, the same downstream computations would be repeated again and again.

Backpropagation avoids that waste. Once the gradient at a node is known, every operation upstream can reuse it. Shared subcomputations are not re-derived from zero.

An ASCII view:

forward:
X -> layer1 -> layer2 -> loss

backward:
loss grad
   -> layer2 grads + signal to layer1
   -> layer1 grads + signal to inputs

This is why the earlier "dynamic programming for gradients" phrase is useful. Backpropagation stores and reuses intermediate results instead of recalculating the same derivative information repeatedly.

It also explains why the algorithm is local. No layer needs to understand the entire network symbolically. It only needs:

The trade-off is conceptual simplicity at the local level versus the need for careful bookkeeping across many layers. Most implementation bugs in backpropagation are bookkeeping mistakes: wrong shapes, wrong cached value, wrong transpose, or wrong activation derivative.

Troubleshooting

Issue: Thinking backpropagation directly updates weights.

Why it happens / is confusing: In practice, "backprop" and "training update" are often spoken about together.

Clarification / Fix: Backpropagation computes gradients. The optimizer uses those gradients to update parameters afterward.

Issue: Forgetting that the backward pass depends on saved forward values.

Why it happens / is confusing: Forward and backward can look like separate phases.

Clarification / Fix: The backward pass differentiates through the actual forward computation, so cached activations and pre-activations are essential.

Issue: Treating each layer's backward rule as completely unrelated.

Why it happens / is confusing: Different activations and layer types have different formulas.

Clarification / Fix: The formulas differ, but the pattern stays the same: receive downstream gradient, compute local parameter gradients, pass upstream gradient.


Advanced Connections

Connection 1: Backpropagation ↔ Reverse-Mode Automatic Differentiation

The parallel: Backpropagation is a special case of reverse-mode autodiff applied to neural-network computation graphs.

Real-world case: Modern frameworks generalize this idea far beyond simple feedforward layers, but the underlying mechanism is the same.

Connection 2: Backpropagation ↔ Optimizer Design

The parallel: Optimizers such as SGD or Adam do not replace backpropagation; they consume the gradients backpropagation produces.

Real-world case: When training behaves oddly, it helps to separate "are the gradients right?" from "is the optimizer using them well?"


Resources

Optional Deepening Resources


Key Insights

  1. Backpropagation is the efficient algorithmic form of the chain rule - It computes full-network gradients by reusing local derivative information.
  2. Each layer follows the same backward pattern - Use cached forward values and downstream gradient to produce parameter gradients and an upstream gradient.
  3. Backpropagation computes gradients, not updates - The optimizer still has to decide how to turn those gradients into parameter changes.

Knowledge Check (Test Questions)

  1. Why is backpropagation efficient compared with deriving each parameter gradient separately?

    • A) Because it reuses intermediate gradient information instead of recomputing shared downstream effects from scratch.
    • B) Because it removes the need for a forward pass.
    • C) Because it updates weights without using derivatives.
  2. What does a layer need during the backward pass?

    • A) The downstream gradient plus the cached values and local derivative rules for that layer.
    • B) Only the final accuracy metric.
    • C) A brand-new symbolic derivation of the whole network.
  3. What is the relationship between backpropagation and the optimizer?

    • A) Backpropagation computes gradients; the optimizer uses them to update parameters.
    • B) Backpropagation is itself the optimizer.
    • C) The optimizer replaces the need for backpropagation.

Answers

1. A: The whole point of backpropagation is efficient gradient reuse across shared computational paths.

2. A: Backward computation is local: each layer needs the incoming gradient, its cached forward values, and its own local derivative formulas.

3. A: They are different stages of training: first compute gradients, then apply an update rule.



← Back to Learning