Day 117: Chain Rule & Computational Graphs

The chain rule matters because it is the reason a neural network can assign blame for one wrong prediction all the way back through every intermediate computation that produced it.

Today's "Aha!" Moment

The last lesson answered one question: how does the network know whether a prediction was good or bad? The loss gives that answer. But a harder question remains: once the loss says "this was wrong," how does the model know which weight, in which earlier layer, should change?

That is what the chain rule solves. A neural network is a composition of many small operations: weighted sums, bias additions, activations, and finally a loss. The output error does not touch an early weight directly. It reaches that weight through the sequence of computations that connect them.

A computational graph makes that structure visible. Each node is one operation, and each edge carries a value forward. The chain rule then says: if you want to know how the final loss changes with respect to some earlier variable, multiply the local sensitivities along the path.

That is the aha. Backpropagation is not magical global intelligence. It is systematic bookkeeping over a graph of composed functions.

Why This Matters

The problem: Deep networks have many parameters spread across many layers. Without a disciplined way to trace influence backward, training would be intractable.

Before:

Gradients can feel like mysterious numbers produced by autodiff libraries.
It is unclear how a late error reaches an early layer.
Backpropagation looks like a special trick instead of a consequence of composition.

After:

A network can be seen as a graph of simple operations.
The chain rule becomes the rule for distributing responsibility through that graph.
Backpropagation becomes easier to understand as repeated local gradient multiplication.

Real-world impact: This is one of the core ideas that made modern deep learning practical. Once a model is written as a differentiable computation graph, gradients can be computed systematically rather than guessed or derived separately for each architecture.

Learning Objectives

By the end of this session, you will be able to:

Explain why the chain rule is essential for neural-network training - Describe how it links the final loss to earlier parameters.
Read a simple computational graph - Identify nodes, intermediate values, and how local derivatives compose.
Connect local gradients to global influence - Understand why backpropagation can be seen as the chain rule applied repeatedly.

Core Concepts Explained

Concept 1: A Neural Network Is a Composition of Small Functions

The cleanest way to think about a network is not as one giant formula, but as a chain of smaller steps.

For a tiny binary classifier, one path might look like this:

x
 -> z = w*x + b
 -> a = sigmoid(z)
 -> L = loss(a, y)

Each line is a small function. The final loss depends on a, which depends on z, which depends on w and b.

That is exactly the setup where the chain rule applies. If you want to know how much the loss changes when w changes, you do not jump directly from L to w. You walk through the chain of dependencies.

dL/dw = dL/da * da/dz * dz/dw

The trade-off is that breaking the model into many explicit steps creates more intermediate quantities to track, but it turns one opaque expression into a system whose dependencies can be understood and differentiated systematically.

Concept 2: Computational Graphs Make the Dependency Structure Visible

A computational graph is just a picture of how values are produced.

For the same example:

      w ----\
             (*) ----\
x ----------/         \
                        (+) ---> z ---> sigmoid ---> a ---> loss ---> L
      b ---------------/

The value flows forward through the graph during inference. During training, gradient information flows backward.

The usefulness of the graph is not decorative. It tells you two important things:

what values must be computed in the forward pass
what local derivatives are needed in the backward pass

Every node only needs to know how its own output changes with respect to its own inputs. It does not need to understand the whole network at once.

That is what makes the system scalable. Instead of deriving a brand-new global formula for every architecture, you compose many local derivative rules according to the graph structure.

The trade-off is that graph-based thinking introduces more notation and intermediate states, but it dramatically simplifies both conceptual understanding and implementation.

Concept 3: The Chain Rule Turns Local Sensitivities into Global Credit Assignment

Here is the core intuition that matters most.

Suppose the loss changes a lot when the output activation changes. And suppose the output activation changes a lot when the hidden pre-activation changes. And suppose that hidden pre-activation changes a lot when a certain weight changes. Then that weight has a strong influence on the loss.

The chain rule captures that by multiplying local effects:

global effect
  =
local effect at the end
* local effect in the middle
* local effect at the earlier step

This is why backpropagation feels like "sending blame backward." Each parameter receives a gradient that reflects how much changing it would have changed the final loss, given all the intermediate steps downstream.

An ASCII view of the idea:

weight -> hidden value -> activation -> loss
   |         |              |           |
   +--------- chain of local derivatives ----------> total sensitivity

This also explains why the next lesson is called backpropagation. It is not a different mathematical principle. It is the practical algorithmic procedure for applying the chain rule efficiently across the whole graph.

The trade-off is that chained derivatives make deep models trainable, but they also make gradient quality sensitive to what happens at every intermediate step. That is one reason activation choice, initialization, and architecture design matter so much.

Troubleshooting

Issue: Thinking the chain rule is just calculus trivia unrelated to real training.

Why it happens / is confusing: The symbolic notation can feel abstract and disconnected from neural-network code.

Clarification / Fix: The chain rule is exactly what lets training connect the final loss to early weights. Without it, there is no systematic way to compute those gradients.

Issue: Treating a computational graph as a visualization convenience only.

Why it happens / is confusing: Diagrams can look like teaching aids rather than real structure.

Clarification / Fix: The graph encodes the actual dependency structure of the computation. Modern autodiff systems use that structure directly.

Issue: Assuming each layer needs a separate magical gradient formula.

Why it happens / is confusing: Deep networks look complicated enough that every part seems special.

Clarification / Fix: Each operation only needs its local derivative. The global gradient appears by chaining those local pieces together.

Advanced Connections

Connection 1: Computational Graphs ↔ Automatic Differentiation

The parallel: Modern ML frameworks represent computations as graphs and then apply automatic differentiation to compute gradients.

Real-world case: PyTorch, TensorFlow, and JAX all rely on this basic idea even though they package it differently.

Connection 2: Chain Rule ↔ Credit Assignment

The parallel: Training requires deciding which earlier choices contributed to a later outcome.

Real-world case: In neural networks, that credit assignment is mathematical: gradients quantify how much each parameter influenced the final loss.

Resources

Optional Deepening Resources

These resources are optional and are not required for the core 30-minute path.
[TUTORIAL] CS231n Notes - Backpropagation, Intuitions
- Link: https://cs231n.github.io/optimization-2/
- Focus: Read the sections on computational graphs and local gradients.
[BOOK/TUTORIAL] Neural Networks and Deep Learning - Chapter 2
- Link: http://neuralnetworksanddeeplearning.com/chap2.html
- Focus: Reinforce how the chain rule becomes the engine of training.
[ARTICLE] The Matrix Calculus You Need For Deep Learning
- Link: https://arxiv.org/abs/1802.01528
- Focus: Use later if you want a more formal bridge from scalar chain rule to vectorized neural-network derivatives.
[BOOK] Deep Learning
- Link: https://www.deeplearningbook.org/
- Focus: Use the chapters on numerical computation and deep feedforward networks for a more formal treatment.

Key Insights

A neural network is a composition of smaller functions - That structure is what makes the chain rule applicable.
Computational graphs make dependencies explicit - They show how values flow forward and where gradients must flow backward.
Global blame is built from local derivatives - Backpropagation works by chaining simple local sensitivities into a full gradient.

Knowledge Check (Test Questions)

Why is the chain rule necessary in neural-network training?
- A) Because the loss depends on early parameters through many intermediate computations.
- B) Because it removes the need for a loss function.
- C) Because it makes all activations linear.
What is the main purpose of a computational graph?
- A) To represent how values are computed from simpler operations so gradients can be traced systematically.
- B) To replace the need for matrix multiplication.
- C) To guarantee that optimization will converge.
What does a local derivative represent in this setting?
- A) How one node's output changes with respect to one of its direct inputs.
- B) The final model accuracy on the validation set.
- C) A global summary of the whole network by itself.

Answers

1. A: The chain rule connects the final loss to parameters that only influence it indirectly through intermediate layers.

2. A: The graph exposes the dependency structure of the computation, which is exactly what both forward evaluation and gradient propagation need.

3. A: Backpropagation builds the full gradient from these local pieces, one operation at a time.

← Back to Learning