LESSON

006 30 min intermediate

Day 294: Feed-Forward Networks

The core idea: attention tells tokens what other tokens matter; the feed-forward network then transforms each token's internal features with a small nonlinear network applied independently at every position.

Today's "Aha!" Moment

The insight: A Transformer block is not "just attention." Attention handles communication across positions, but the block still needs a way to do richer computation inside each token representation after that communication happens.

Why this matters: Without the feed-forward part, the layer would mostly be:

weighted mixing across tokens
plus linear projections

That is useful, but not enough. The model also needs nonlinear feature transformation per token.

Concrete anchor: Imagine one token has already gathered context from the rest of the sentence. It now needs to convert that contextual information into a better internal representation: maybe amplify some features, suppress others, and combine them nonlinearly. That is the job of the feed-forward network.

The practical sentence to remember:
Attention mixes tokens; the feed-forward network mixes features inside each token.

Why This Matters

By now the Transformer stack has acquired:

token-to-token interaction through attention
multiple relational views through multi-head attention
control over legality and stability through masking and scaling
order information through positional encoding

What is still missing is a strong per-token computation step.

This matters because attention alone is not the full story of representation learning. After tokens exchange information, the model needs a local computation that can:

reshape the representation
introduce nonlinearity
expand and compress feature space
build more useful abstractions layer by layer

That is what the position-wise feed-forward network does.

Operational payoff:

higher representational capacity
better feature extraction after contextualization
a clean division of labor inside the Transformer block

Learning Objectives

By the end of this session, you should be able to:

Explain why Transformer blocks need a feed-forward stage in addition to attention.
Describe the standard FFN computation as expansion, activation, and projection back to model dimension.
Evaluate what the FFN buys and what it costs, including nonlinear capacity, per-token independence, and a large share of model parameters and compute.

Core Concepts Explained

Concept 1: The FFN Adds Nonlinear Computation After Attention

Concrete example / mini-scenario: After attention, the representation of the token bank may already encode whether nearby words suggest a riverbank or a financial institution. But that contextualized representation still needs to be transformed into more useful internal features for the next layer.

Intuition: Attention decides what information arrives at each token. The feed-forward network decides how that token internally processes what it now knows.

Technical structure (how it works): In a Transformer block, attention gives each position a contextualized vector of size d_model. The FFN then applies the same small MLP to every position independently.

The usual pattern is:

FFN(x) = W2 * activation(W1 * x + b1) + b2

where:

W1 expands the representation to a larger hidden dimension
activation adds nonlinearity
W2 projects back to d_model

Practical implications:

the model gains nonlinear capacity beyond weighted averaging
each token can refine its own feature representation after attending to context
the block can build richer abstractions over layers

Fundamental trade-off: You gain expressive power, but the FFN often contributes a large fraction of parameters and floating-point work.

Mental model: Attention is the conversation; the FFN is what each participant does privately after hearing the room.

Connection to other fields: Similar to message-passing systems where nodes first exchange information and then run local computation on the aggregated state.

When to use it:

Best fit: any standard Transformer block that needs more than pure token mixing.
Misuse pattern: treating attention as if it alone already provides enough nonlinear reasoning.

Concept 2: The Standard Transformer FFN Is Position-Wise but Shared Across Positions

Concrete example / mini-scenario: If a sequence has 128 tokens, the same FFN weights are applied to token 1, token 2, token 3, and so on, independently.

Intuition: The FFN does not mix across positions. It applies the same learned transformation to each token vector separately.

That is why people often call it:

position-wise feed-forward network

Technical structure (how it works):

If the input matrix is:

X in R^(n x d_model)

then the FFN is applied row by row:

Y_i = W2 * activation(W1 * X_i + b1) + b2

for each token position i.

This means:

attention is where positions communicate
FFN is where each position transforms itself

In many implementations, this is equivalent to a 1x1 convolution or batched MLP applied over the last dimension.

Practical implications:

the same transformation is reused across sequence positions
computation parallelizes well across all tokens
positional interaction stays cleanly separated from local feature transformation

Fundamental trade-off: Sharing weights across positions is efficient and elegant, but it means the FFN itself does not explicitly model cross-token structure. That remains attention's job.

Mental model: Every token goes through the same local workshop, but the workshop does not let tokens talk to each other while inside.

Connection to other fields: Similar to applying the same filter or MLP independently over every item in a batch.

When to use it:

Best fit: Transformer blocks where token interaction and local nonlinear processing are intentionally separated.
Misuse pattern: expecting the FFN itself to capture long-range token dependencies.

Concept 3: Expansion Dimension and Activation Choice Matter a Lot

Concrete example / mini-scenario: A Transformer might use d_model = 768 and an FFN hidden width of 3072. That means each token is first expanded into a much larger space before being compressed back.

Intuition: The expansion step gives the model room to compute richer intermediate features than would fit in the original representation size.

Technical structure (how it works):

The classic Transformer used:

large hidden expansion
ReLU activation

Modern variants often use:

GELU
gated forms such as GEGLU or SwiGLU

These changes matter because the FFN is not a small side component. In many Transformer families, it is one of the major consumers of:

parameters
memory bandwidth
inference time

So design choices around width and activation influence both quality and efficiency.

Practical implications:

wider FFNs increase capacity but also cost
activation choice changes optimization behavior and expressiveness
FFN design becomes a major target for efficiency work, pruning, and distillation later

Fundamental trade-off:

more width and richer activations buy capacity
but they also increase latency, memory pressure, and deployment cost

Mental model: The FFN is like a temporary expansion chamber: widen the representation, perform nonlinear transformation, then compress back into a useful dense form.

Connection to other fields: Similar to low-rank versus expanded intermediate representations in systems and numerical methods: wider intermediate spaces can capture more structure, but they are not free.

When to use it:

Best fit: standard Transformer designs where capacity per layer matters.
Misuse pattern: focusing only on attention optimization while ignoring that FFNs are often a major compute budget owner.

Troubleshooting

Issue: "If the Transformer is famous for attention, why does it need an FFN at all?"

Why it happens / is confusing: The architecture is often described as if attention were the whole layer.

Clarification / Fix: Attention is the communication mechanism, not the full computation. The FFN adds nonlinear feature processing after contextual information has been gathered.

Issue: "Does the FFN mix information between tokens?"

Why it happens / is confusing: The block is often mentally treated as one blended operation.

Clarification / Fix: No. Token mixing happens in attention. The FFN is applied independently at each position with shared weights.

Issue: "Why does the FFN often have more parameters than I expected?"

Why it happens / is confusing: Attention gets most of the conceptual attention, so the FFN can feel secondary.

Clarification / Fix: In practice, the FFN is often very wide and can dominate a substantial part of parameter count and compute inside a layer.

Advanced Connections

Connection 1: FFN <-> Division of Labor in the Transformer Block

The parallel: The block separates concerns cleanly:

attention for cross-token communication
FFN for per-token nonlinear transformation

Real-world case: This separation is part of why the Transformer scales well as a reusable architectural template.

Connection 2: FFN <-> Model Efficiency Work

The parallel: Many efficiency improvements do not only target attention's quadratic cost; they also target FFN width, activation design, sparsity, and parameter sharing.

Real-world case: In production models, optimizing the FFN can matter as much as optimizing attention, especially at inference scale.

Resources

Suggested Resources

[PAPER] Attention Is All You Need - arXiv
Focus: the original Transformer block layout, including the position-wise feed-forward network.
[DOC] The Annotated Transformer - Harvard NLP
Focus: a practical walkthrough showing exactly where the FFN sits in the block.
[PAPER] GLU Variants Improve Transformer - arXiv
Focus: useful context for why modern FFN variants often move beyond simple ReLU.

Key Insights

The FFN gives the Transformer local nonlinear computation after attention has gathered context.
It is position-wise and shared across tokens, so it transforms each token independently rather than mixing positions.
Its width and activation are major design levers, affecting both model capacity and deployment cost.

← Back to LLM Foundations

← Back to Learning Hub