Day 119: Implementing Backpropagation from Scratch
Implementing backpropagation from scratch matters because writing it yourself is where the abstract story finally hardens into something testable: tensors, caches, gradients, and update steps that either line up or break.
Today's "Aha!" Moment
Yesterday's lesson explained the algorithmic pattern of backpropagation. But there is still a difference between "I understand the idea" and "I can implement a small network that actually computes correct gradients."
That gap matters. When you implement backpropagation yourself, you stop treating gradients as magical outputs from a framework. You see exactly what has to be stored in the forward pass, what each layer must return in the backward pass, and why shapes, transposes, and activation derivatives matter so much.
The simplest useful setup is a tiny network like Linear -> ReLU -> Linear -> Sigmoid, trained on a small batch. Once you can implement that end to end, most later deep-learning machinery becomes easier to trust and debug.
That is the aha. "From scratch" is not about rejecting libraries. It is about building the mechanistic understanding that lets you use libraries without cargo culting them.
Why This Matters
The problem: Backpropagation can seem clear in theory but still feel opaque in practice until you have to compute and store the actual quantities yourself.
Before:
- Gradients look like mysterious byproducts of a framework.
- It is easy to miss what needs to be cached during the forward pass.
- Shape bugs and derivative bugs are hard to reason about.
After:
- The training step becomes a sequence of concrete operations.
- Each layer has a clear interface: forward values in one direction, gradients in the other.
- Debugging gradients becomes much more systematic.
Real-world impact: Even if you later rely on PyTorch, TensorFlow, or JAX, understanding a hand-built backprop loop makes you much better at diagnosing custom layers, shape issues, exploding or vanishing gradients, and incorrect training behavior.
Learning Objectives
By the end of this session, you will be able to:
- Describe the structure of a manual training step - Forward pass, loss, backward pass, parameter update.
- Implement the core backward pieces for a small network - Understand what each layer must compute and cache.
- Recognize the most common implementation bugs - Especially shape mistakes, missing caches, and wrong local derivatives.
Core Concepts Explained
Concept 1: A Manual Neural-Network Training Step Has a Very Clear Skeleton
For a small network, the whole training loop can be understood as four stages:
1. forward pass
2. compute loss
3. backward pass
4. update parameters
That structure is simple enough to write explicitly.
# forward
z1 = X @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
y_hat = sigmoid(z2)
# loss
loss = binary_cross_entropy(y_hat, y)
# backward
dz2 = y_hat - y
dW2 = a1.T @ dz2
db2 = dz2.sum(axis=0, keepdims=True)
da1 = dz2 @ W2.T
dz1 = da1 * relu_grad(z1)
dW1 = X.T @ dz1
db1 = dz1.sum(axis=0, keepdims=True)
# update
W1 -= lr * dW1
b1 -= lr * db1
W2 -= lr * dW2
b2 -= lr * db2
This code is small enough to read line by line, and that is the point. The training step is not mysterious when decomposed into its real parts.
The trade-off is that a manual implementation is more verbose and less convenient than a framework call, but it exposes the exact mechanics that frameworks usually hide.
Concept 2: Every Layer Needs a Forward Interface and a Backward Interface
One of the cleanest ways to reason about a manual implementation is to think layer by layer.
For a linear layer:
- forward: compute output from input, weights, and bias
- backward: receive gradient with respect to output and return
- gradient with respect to weights
- gradient with respect to bias
- gradient with respect to input
For an activation:
- forward: transform the incoming value
- backward: use the cached pre-activation or activation to compute the local derivative and propagate the gradient
Linear layer:
forward(X) -> Z
backward(dZ) -> dX, dW, db
Activation layer:
forward(Z) -> A
backward(dA) -> dZ
This interface-based view makes the whole network modular. It also explains why the forward pass must cache values. The backward pass needs those exact intermediates to compute local derivatives correctly.
The trade-off is between modular clarity and implementation overhead. A clean layer API is easier to debug, but it forces you to be explicit about what each stage stores and returns.
Concept 3: Most Backpropagation Bugs Are Bookkeeping Bugs, Not Deep Math Bugs
When a manual implementation fails, the cause is usually mundane.
Common failures include:
- wrong shape or missing transpose
- using
Awhen the derivative actually needsZ - forgetting batch averaging or summation conventions
- updating parameters before all required gradients are computed
- mismatching activation derivative with the cached value
A useful debugging checklist looks like this:
forward values sane?
loss decreasing at all?
gradient shapes match parameter shapes?
activation derivative applied to the right cached tensor?
updates happening after gradients are computed?
This is why implementing backprop from scratch is so educational. You discover that much of practical ML reliability comes from disciplined bookkeeping, not from mystical mathematical cleverness.
The trade-off is that the algorithm itself is elegant, but the implementation is unforgiving. Small shape mistakes can silently produce wrong gradients even when the code still runs.
Troubleshooting
Issue: The code runs, but the loss does not decrease.
Why it happens / is confusing: Silent shape-compatible errors can still produce incorrect gradients.
Clarification / Fix: Check each gradient formula, confirm cached tensors are the right ones, and inspect whether gradient shapes match parameter shapes exactly.
Issue: Confusing z and a in activation derivatives.
Why it happens / is confusing: The names are close and both appear in forward and backward code.
Clarification / Fix: Decide explicitly which cached value each derivative uses and keep that convention consistent across the implementation.
Issue: Thinking a framework-free implementation is only academic.
Why it happens / is confusing: Real projects usually use autodiff libraries.
Clarification / Fix: Manual implementation is the best way to understand what the library is doing for you and how to debug it when something goes wrong.
Advanced Connections
Connection 1: Manual Backpropagation ↔ Automatic Differentiation
The parallel: A from-scratch implementation mirrors what autodiff frameworks do automatically: cache forward values, apply local derivative rules, and accumulate gradients backward.
Real-world case: Understanding the manual version makes framework gradient traces and hooks much easier to interpret.
Connection 2: Manual Backpropagation ↔ Software Interface Design
The parallel: Layers with explicit forward and backward contracts behave like well-designed software modules.
Real-world case: Clean separation of responsibilities is what makes larger neural-network codebases maintainable and debuggable.
Resources
Optional Deepening Resources
- These resources are optional and are not required for the core 30-minute path.
- [TUTORIAL] CS231n Notes - Backpropagation, Intuitions
- Link: https://cs231n.github.io/optimization-2/
- Focus: Review the local-gradient view before implementing each layer manually.
- [BOOK/TUTORIAL] Neural Networks and Deep Learning - Chapter 2
- Link: http://neuralnetworksanddeeplearning.com/chap2.html
- Focus: Compare the derivation with a small implementation you can reason through line by line.
- [ARTICLE] The Matrix Calculus You Need For Deep Learning
- Link: https://arxiv.org/abs/1802.01528
- Focus: Use later if you want a stronger bridge from scalar intuition to vectorized formulas.
- [BOOK] Deep Learning
- Link: https://www.deeplearningbook.org/
- Focus: Use as a formal reference once the hand-built implementation feels intuitive.
Key Insights
- A manual training step has a stable structure - Forward, loss, backward, update.
- Each layer needs explicit forward and backward behavior - That is what makes the whole network differentiable in practice.
- Most implementation mistakes are local bookkeeping mistakes - Shapes, caches, and derivatives matter as much as the big idea.
Knowledge Check (Test Questions)
-
What is the most useful reason to implement backpropagation from scratch at least once?
- A) To understand exactly what values, gradients, and caches a training step depends on.
- B) To avoid ever using autodiff frameworks again.
- C) To make neural networks train without a loss function.
-
What should a linear layer's backward method usually return?
- A) Gradients with respect to input, weights, and bias.
- B) Only the validation accuracy.
- C) A new set of random weights.
-
What kind of bug is most common in manual backpropagation code?
- A) Local bookkeeping mistakes such as wrong shapes, wrong transpose, or wrong cached tensor.
- B) The chain rule becoming mathematically invalid.
- C) The optimizer refusing to use gradients.
Answers
1. A: Writing the loop yourself makes the dependence on cached values and local derivatives explicit.
2. A: A layer's backward step must tell the rest of the network how the loss changes with respect to both its parameters and its input.
3. A: Most failures come from implementation details, not from the underlying calculus principle.