Feed-Forward Networks

LESSON

LLM Foundations

006 30 min intermediate

Day 294: Feed-Forward Networks

The core idea: attention tells tokens what other tokens matter; the feed-forward network then transforms each token's internal features with a small nonlinear network applied independently at every position.


Today's "Aha!" Moment

The insight: A Transformer block is not "just attention." Attention handles communication across positions, but the block still needs a way to do richer computation inside each token representation after that communication happens.

Why this matters: Without the feed-forward part, the layer would mostly be:

That is useful, but not enough. The model also needs nonlinear feature transformation per token.

Concrete anchor: Imagine one token has already gathered context from the rest of the sentence. It now needs to convert that contextual information into a better internal representation: maybe amplify some features, suppress others, and combine them nonlinearly. That is the job of the feed-forward network.

The practical sentence to remember:
Attention mixes tokens; the feed-forward network mixes features inside each token.


Why This Matters

By now the Transformer stack has acquired:

What is still missing is a strong per-token computation step.

This matters because attention alone is not the full story of representation learning. After tokens exchange information, the model needs a local computation that can:

That is what the position-wise feed-forward network does.

Operational payoff:


Learning Objectives

By the end of this session, you should be able to:

  1. Explain why Transformer blocks need a feed-forward stage in addition to attention.
  2. Describe the standard FFN computation as expansion, activation, and projection back to model dimension.
  3. Evaluate what the FFN buys and what it costs, including nonlinear capacity, per-token independence, and a large share of model parameters and compute.

Core Concepts Explained

Concept 1: The FFN Adds Nonlinear Computation After Attention

Concrete example / mini-scenario: After attention, the representation of the token bank may already encode whether nearby words suggest a riverbank or a financial institution. But that contextualized representation still needs to be transformed into more useful internal features for the next layer.

Intuition: Attention decides what information arrives at each token. The feed-forward network decides how that token internally processes what it now knows.

Technical structure (how it works): In a Transformer block, attention gives each position a contextualized vector of size d_model. The FFN then applies the same small MLP to every position independently.

The usual pattern is:

FFN(x) = W2 * activation(W1 * x + b1) + b2

where:

Practical implications:

Fundamental trade-off: You gain expressive power, but the FFN often contributes a large fraction of parameters and floating-point work.

Mental model: Attention is the conversation; the FFN is what each participant does privately after hearing the room.

Connection to other fields: Similar to message-passing systems where nodes first exchange information and then run local computation on the aggregated state.

When to use it:

Concept 2: The Standard Transformer FFN Is Position-Wise but Shared Across Positions

Concrete example / mini-scenario: If a sequence has 128 tokens, the same FFN weights are applied to token 1, token 2, token 3, and so on, independently.

Intuition: The FFN does not mix across positions. It applies the same learned transformation to each token vector separately.

That is why people often call it:

Technical structure (how it works):

If the input matrix is:

X in R^(n x d_model)

then the FFN is applied row by row:

Y_i = W2 * activation(W1 * X_i + b1) + b2

for each token position i.

This means:

In many implementations, this is equivalent to a 1x1 convolution or batched MLP applied over the last dimension.

Practical implications:

Fundamental trade-off: Sharing weights across positions is efficient and elegant, but it means the FFN itself does not explicitly model cross-token structure. That remains attention's job.

Mental model: Every token goes through the same local workshop, but the workshop does not let tokens talk to each other while inside.

Connection to other fields: Similar to applying the same filter or MLP independently over every item in a batch.

When to use it:

Concept 3: Expansion Dimension and Activation Choice Matter a Lot

Concrete example / mini-scenario: A Transformer might use d_model = 768 and an FFN hidden width of 3072. That means each token is first expanded into a much larger space before being compressed back.

Intuition: The expansion step gives the model room to compute richer intermediate features than would fit in the original representation size.

Technical structure (how it works):

The classic Transformer used:

Modern variants often use:

These changes matter because the FFN is not a small side component. In many Transformer families, it is one of the major consumers of:

So design choices around width and activation influence both quality and efficiency.

Practical implications:

Fundamental trade-off:

Mental model: The FFN is like a temporary expansion chamber: widen the representation, perform nonlinear transformation, then compress back into a useful dense form.

Connection to other fields: Similar to low-rank versus expanded intermediate representations in systems and numerical methods: wider intermediate spaces can capture more structure, but they are not free.

When to use it:


Troubleshooting

Issue: "If the Transformer is famous for attention, why does it need an FFN at all?"

Why it happens / is confusing: The architecture is often described as if attention were the whole layer.

Clarification / Fix: Attention is the communication mechanism, not the full computation. The FFN adds nonlinear feature processing after contextual information has been gathered.

Issue: "Does the FFN mix information between tokens?"

Why it happens / is confusing: The block is often mentally treated as one blended operation.

Clarification / Fix: No. Token mixing happens in attention. The FFN is applied independently at each position with shared weights.

Issue: "Why does the FFN often have more parameters than I expected?"

Why it happens / is confusing: Attention gets most of the conceptual attention, so the FFN can feel secondary.

Clarification / Fix: In practice, the FFN is often very wide and can dominate a substantial part of parameter count and compute inside a layer.


Advanced Connections

Connection 1: FFN <-> Division of Labor in the Transformer Block

The parallel: The block separates concerns cleanly:

Real-world case: This separation is part of why the Transformer scales well as a reusable architectural template.

Connection 2: FFN <-> Model Efficiency Work

The parallel: Many efficiency improvements do not only target attention's quadratic cost; they also target FFN width, activation design, sparsity, and parameter sharing.

Real-world case: In production models, optimizing the FFN can matter as much as optimizing attention, especially at inference scale.


Resources

Suggested Resources


Key Insights

  1. The FFN gives the Transformer local nonlinear computation after attention has gathered context.
  2. It is position-wise and shared across tokens, so it transforms each token independently rather than mixing positions.
  3. Its width and activation are major design levers, affecting both model capacity and deployment cost.

PREVIOUS Positional Encoding Deep Dive NEXT Layer Normalization & Residual Connections

← Back to LLM Foundations

← Back to Learning Hub