Day 123: Weight Initialization
Weight initialization matters because training does not begin from nowhere: the very first forward and backward passes already depend on how signal and gradient were seeded.
Today's "Aha!" Moment
Once a network has an optimizer and a learning-rate schedule, it still needs somewhere to start. That starting point is not neutral. If the initial weights are too large, activations and gradients can explode or saturate. If they are too small, signals can shrink toward zero and learning becomes weak or painfully slow.
There is also a second problem: symmetry. If every neuron in a layer starts with exactly the same weights, then they receive the same gradients and keep evolving identically. The layer wastes its capacity because all units learn the same feature.
That is why initialization is more than “pick some random numbers.” Good initialization tries to do two things at once: break symmetry so different neurons can learn different features, and keep activations and gradients at a sensible scale as they travel through the network.
That is the aha. Initialization is the network's starting geometry for signal flow.
Why This Matters
The problem: A network can fail before optimization has any real chance to help if the starting weights make activations or gradients pathological from the first pass.
Before:
- Initialization feels like a minor implementation detail.
- Early training failures are blamed only on the optimizer or learning rate.
- Randomness sounds sufficient without regard to scale.
After:
- Initialization is seen as control over early signal propagation.
- Symmetry breaking and variance preservation become the main goals.
- Schemes like Xavier and He initialization become easier to understand, not just memorize.
Real-world impact: Good initialization improves training stability, speed, and depth scalability. Bad initialization can make even a reasonable architecture look broken.
Learning Objectives
By the end of this session, you will be able to:
- Explain why random initialization is needed - Understand symmetry breaking and why identical weights are a problem.
- Explain why scale matters - Understand how poor weight scale can shrink or blow up activations and gradients.
- Reason about common initialization schemes - See why Xavier and He initialization are matched to network behavior and activation choice.
Core Concepts Explained
Concept 1: Initialization Must Break Symmetry
Imagine a hidden layer with several neurons, all starting with exactly the same weights and bias. Each neuron receives the same input, produces the same output, and therefore receives the same gradient. After one update, they are still identical.
That means the layer behaves like many copies of one neuron instead of many distinct feature detectors.
identical start
-> identical activations
-> identical gradients
-> identical updates
-> wasted capacity
This is why “all zeros” initialization is fine for some biases but disastrous for the weights of a hidden layer. Randomness is not there to make the code feel stochastic. It is there so neurons can start differently enough to specialize.
The trade-off is simple: you want enough randomness to break symmetry, but not so much that the network begins in a numerically unstable regime.
Concept 2: Initialization Also Controls the Scale of Activations and Gradients
Breaking symmetry is necessary, but it is not enough. The magnitude of the initial weights matters because every layer multiplies and transforms the signal coming from the previous one.
If weights are too large:
- activations may become huge
- saturating activations like sigmoid or tanh can flatten out
- gradients may explode or vanish
If weights are too small:
- activations may collapse toward zero
- downstream layers receive weak signals
- gradients can become tiny and learning slows dramatically
too large -> unstable or saturated
too small -> weak signal and weak gradient
just right -> signal stays in a usable range
This is the main reason initialization is tied to network depth. A small scaling problem repeated layer after layer becomes a large training problem.
The trade-off is between expressiveness and stability. Larger weights make activations more dramatic, but they can also destroy gradient flow. Smaller weights are safer, but can make the network too passive at the start.
Concept 3: Xavier and He Initialization Are Attempts to Preserve Useful Signal Flow
Modern initialization schemes are built around one core idea: choose the variance of the initial weights so activations and gradients stay roughly well-scaled across layers.
Xavier/Glorot initialization is a common choice for activations such as tanh, where you want the signal scale to stay balanced between layers.
He initialization is often used with ReLU-style activations, because ReLU zeroes out part of the signal and therefore typically benefits from a larger variance than Xavier would use.
You do not need the exact derivations yet to get the intuition:
Xavier -> preserve scale for more symmetric activations
He -> preserve scale better for ReLU-style activations
This is not arbitrary recipe memorization. It is matching the initialization to how the activation function behaves.
A tiny code-level sketch:
import math
import numpy as np
def he_init(fan_in, fan_out):
std = math.sqrt(2.0 / fan_in)
return np.random.randn(fan_in, fan_out) * std
The trade-off is convenience versus fit. Framework defaults are often good, but the better you understand the interaction between initialization and activation, the better equipped you are to debug custom architectures.
Troubleshooting
Issue: The network barely learns from the first iterations.
Why it happens / is confusing: The optimizer may look correct, so the lack of movement feels mysterious.
Clarification / Fix: Check whether activations or gradients are tiny from the start. The learning rate may not be the first problem; initialization scale may be.
Issue: Training becomes unstable immediately.
Why it happens / is confusing: It is tempting to blame the optimizer or batch noise.
Clarification / Fix: Inspect whether the initial weights are pushing activations into extreme regimes before training has even had a chance to settle.
Issue: A hidden layer behaves as if many neurons are redundant copies.
Why it happens / is confusing: The model still has the right number of units, so capacity looks fine on paper.
Clarification / Fix: Check for symmetry-breaking failures. If neurons start identically, they often keep learning identically.
Advanced Connections
Connection 1: Initialization ↔ Signal Propagation
The parallel: Good initialization tries to preserve healthy signal scale as information moves forward and backward through the network.
Real-world case: This is one reason deeper networks became much more trainable once initialization schemes improved.
Connection 2: Initialization ↔ Activation Choice
The parallel: The “right” starting variance depends partly on how the activation function reshapes its inputs.
Real-world case: Xavier and He initialization are best understood as activation-aware scaling rules, not as random historical conventions.
Resources
Optional Deepening Resources
- These resources are optional and are not required for the core 30-minute path.
- [TUTORIAL] CS231n Notes - Weight Initialization
- Link: https://cs231n.github.io/neural-networks-2/#init
- Focus: Review symmetry breaking, variance concerns, and practical initialization advice.
- [PAPER] Understanding the difficulty of training deep feedforward neural networks
- Link: https://proceedings.mlr.press/v9/glorot10a.html
- Focus: Read the intuition behind Xavier/Glorot initialization.
- [PAPER] Delving Deep into Rectifiers
- Link: https://arxiv.org/abs/1502.01852
- Focus: See the reasoning behind He initialization for ReLU-like networks.
- [BOOK] Deep Learning
- Link: https://www.deeplearningbook.org/
- Focus: Use the chapters on optimization and feedforward networks for a more formal treatment of initialization.
Key Insights
- Initialization must break symmetry - Hidden neurons need different starting weights so they can learn different features.
- Initialization scale shapes early signal and gradient flow - Too large or too small can destabilize training before optimization has a chance to help.
- Common schemes like Xavier and He are activation-aware scaling rules - They are attempts to preserve useful variance through the network.
Knowledge Check (Test Questions)
-
Why is all-zero weight initialization bad for a hidden layer?
- A) Because neurons start identically, receive identical gradients, and keep learning the same feature.
- B) Because zero weights cannot be updated by gradient descent.
- C) Because biases stop working.
-
What is one major risk of weights that are initialized too large?
- A) Activations and gradients may become unstable or saturate.
- B) The network becomes linear.
- C) The loss function disappears.
-
Why is He initialization often paired with ReLU-like activations?
- A) Because it uses a scale designed to better preserve signal when many activations are partially zeroed by ReLU behavior.
- B) Because ReLU requires all weights to be positive.
- C) Because He initialization removes the need for learning-rate tuning.
Answers
1. A: Without symmetry breaking, the hidden units remain clones of one another.
2. A: Very large initial weights can push the network into unstable or saturated regimes immediately.
3. A: He initialization uses a variance choice that better matches the way ReLU-like activations pass or zero signal.