Weight Initialization

Day 123: Weight Initialization

Weight initialization matters because training does not begin from nowhere: the very first forward and backward passes already depend on how signal and gradient were seeded.


Today's "Aha!" Moment

Once a network has an optimizer and a learning-rate schedule, it still needs somewhere to start. That starting point is not neutral. If the initial weights are too large, activations and gradients can explode or saturate. If they are too small, signals can shrink toward zero and learning becomes weak or painfully slow.

There is also a second problem: symmetry. If every neuron in a layer starts with exactly the same weights, then they receive the same gradients and keep evolving identically. The layer wastes its capacity because all units learn the same feature.

That is why initialization is more than “pick some random numbers.” Good initialization tries to do two things at once: break symmetry so different neurons can learn different features, and keep activations and gradients at a sensible scale as they travel through the network.

That is the aha. Initialization is the network's starting geometry for signal flow.


Why This Matters

The problem: A network can fail before optimization has any real chance to help if the starting weights make activations or gradients pathological from the first pass.

Before:

After:

Real-world impact: Good initialization improves training stability, speed, and depth scalability. Bad initialization can make even a reasonable architecture look broken.


Learning Objectives

By the end of this session, you will be able to:

  1. Explain why random initialization is needed - Understand symmetry breaking and why identical weights are a problem.
  2. Explain why scale matters - Understand how poor weight scale can shrink or blow up activations and gradients.
  3. Reason about common initialization schemes - See why Xavier and He initialization are matched to network behavior and activation choice.

Core Concepts Explained

Concept 1: Initialization Must Break Symmetry

Imagine a hidden layer with several neurons, all starting with exactly the same weights and bias. Each neuron receives the same input, produces the same output, and therefore receives the same gradient. After one update, they are still identical.

That means the layer behaves like many copies of one neuron instead of many distinct feature detectors.

identical start
   -> identical activations
   -> identical gradients
   -> identical updates
   -> wasted capacity

This is why “all zeros” initialization is fine for some biases but disastrous for the weights of a hidden layer. Randomness is not there to make the code feel stochastic. It is there so neurons can start differently enough to specialize.

The trade-off is simple: you want enough randomness to break symmetry, but not so much that the network begins in a numerically unstable regime.

Concept 2: Initialization Also Controls the Scale of Activations and Gradients

Breaking symmetry is necessary, but it is not enough. The magnitude of the initial weights matters because every layer multiplies and transforms the signal coming from the previous one.

If weights are too large:

If weights are too small:

too large  -> unstable or saturated
too small  -> weak signal and weak gradient
just right -> signal stays in a usable range

This is the main reason initialization is tied to network depth. A small scaling problem repeated layer after layer becomes a large training problem.

The trade-off is between expressiveness and stability. Larger weights make activations more dramatic, but they can also destroy gradient flow. Smaller weights are safer, but can make the network too passive at the start.

Concept 3: Xavier and He Initialization Are Attempts to Preserve Useful Signal Flow

Modern initialization schemes are built around one core idea: choose the variance of the initial weights so activations and gradients stay roughly well-scaled across layers.

Xavier/Glorot initialization is a common choice for activations such as tanh, where you want the signal scale to stay balanced between layers.

He initialization is often used with ReLU-style activations, because ReLU zeroes out part of the signal and therefore typically benefits from a larger variance than Xavier would use.

You do not need the exact derivations yet to get the intuition:

Xavier -> preserve scale for more symmetric activations
He     -> preserve scale better for ReLU-style activations

This is not arbitrary recipe memorization. It is matching the initialization to how the activation function behaves.

A tiny code-level sketch:

import math
import numpy as np

def he_init(fan_in, fan_out):
    std = math.sqrt(2.0 / fan_in)
    return np.random.randn(fan_in, fan_out) * std

The trade-off is convenience versus fit. Framework defaults are often good, but the better you understand the interaction between initialization and activation, the better equipped you are to debug custom architectures.

Troubleshooting

Issue: The network barely learns from the first iterations.

Why it happens / is confusing: The optimizer may look correct, so the lack of movement feels mysterious.

Clarification / Fix: Check whether activations or gradients are tiny from the start. The learning rate may not be the first problem; initialization scale may be.

Issue: Training becomes unstable immediately.

Why it happens / is confusing: It is tempting to blame the optimizer or batch noise.

Clarification / Fix: Inspect whether the initial weights are pushing activations into extreme regimes before training has even had a chance to settle.

Issue: A hidden layer behaves as if many neurons are redundant copies.

Why it happens / is confusing: The model still has the right number of units, so capacity looks fine on paper.

Clarification / Fix: Check for symmetry-breaking failures. If neurons start identically, they often keep learning identically.


Advanced Connections

Connection 1: Initialization ↔ Signal Propagation

The parallel: Good initialization tries to preserve healthy signal scale as information moves forward and backward through the network.

Real-world case: This is one reason deeper networks became much more trainable once initialization schemes improved.

Connection 2: Initialization ↔ Activation Choice

The parallel: The “right” starting variance depends partly on how the activation function reshapes its inputs.

Real-world case: Xavier and He initialization are best understood as activation-aware scaling rules, not as random historical conventions.


Resources

Optional Deepening Resources


Key Insights

  1. Initialization must break symmetry - Hidden neurons need different starting weights so they can learn different features.
  2. Initialization scale shapes early signal and gradient flow - Too large or too small can destabilize training before optimization has a chance to help.
  3. Common schemes like Xavier and He are activation-aware scaling rules - They are attempts to preserve useful variance through the network.

Knowledge Check (Test Questions)

  1. Why is all-zero weight initialization bad for a hidden layer?

    • A) Because neurons start identically, receive identical gradients, and keep learning the same feature.
    • B) Because zero weights cannot be updated by gradient descent.
    • C) Because biases stop working.
  2. What is one major risk of weights that are initialized too large?

    • A) Activations and gradients may become unstable or saturate.
    • B) The network becomes linear.
    • C) The loss function disappears.
  3. Why is He initialization often paired with ReLU-like activations?

    • A) Because it uses a scale designed to better preserve signal when many activations are partially zeroed by ReLU behavior.
    • B) Because ReLU requires all weights to be positive.
    • C) Because He initialization removes the need for learning-rate tuning.

Answers

1. A: Without symmetry breaking, the hidden units remain clones of one another.

2. A: Very large initial weights can push the network into unstable or saturated regimes immediately.

3. A: He initialization uses a variance choice that better matches the way ReLU-like activations pass or zero signal.



← Back to Learning