LESSON
Day 294: Feed-Forward Networks
The core idea: attention tells tokens what other tokens matter; the feed-forward network then transforms each token's internal features with a small nonlinear network applied independently at every position.
Today's "Aha!" Moment
The insight: A Transformer block is not "just attention." Attention handles communication across positions, but the block still needs a way to do richer computation inside each token representation after that communication happens.
Why this matters: Without the feed-forward part, the layer would mostly be:
- weighted mixing across tokens
- plus linear projections
That is useful, but not enough. The model also needs nonlinear feature transformation per token.
Concrete anchor: Imagine one token has already gathered context from the rest of the sentence. It now needs to convert that contextual information into a better internal representation: maybe amplify some features, suppress others, and combine them nonlinearly. That is the job of the feed-forward network.
The practical sentence to remember:
Attention mixes tokens; the feed-forward network mixes features inside each token.
Why This Matters
By now the Transformer stack has acquired:
- token-to-token interaction through attention
- multiple relational views through multi-head attention
- control over legality and stability through masking and scaling
- order information through positional encoding
What is still missing is a strong per-token computation step.
This matters because attention alone is not the full story of representation learning. After tokens exchange information, the model needs a local computation that can:
- reshape the representation
- introduce nonlinearity
- expand and compress feature space
- build more useful abstractions layer by layer
That is what the position-wise feed-forward network does.
Operational payoff:
- higher representational capacity
- better feature extraction after contextualization
- a clean division of labor inside the Transformer block
Learning Objectives
By the end of this session, you should be able to:
- Explain why Transformer blocks need a feed-forward stage in addition to attention.
- Describe the standard FFN computation as expansion, activation, and projection back to model dimension.
- Evaluate what the FFN buys and what it costs, including nonlinear capacity, per-token independence, and a large share of model parameters and compute.
Core Concepts Explained
Concept 1: The FFN Adds Nonlinear Computation After Attention
Concrete example / mini-scenario: After attention, the representation of the token bank may already encode whether nearby words suggest a riverbank or a financial institution. But that contextualized representation still needs to be transformed into more useful internal features for the next layer.
Intuition: Attention decides what information arrives at each token. The feed-forward network decides how that token internally processes what it now knows.
Technical structure (how it works): In a Transformer block, attention gives each position a contextualized vector of size d_model. The FFN then applies the same small MLP to every position independently.
The usual pattern is:
FFN(x) = W2 * activation(W1 * x + b1) + b2
where:
W1expands the representation to a larger hidden dimensionactivationadds nonlinearityW2projects back tod_model
Practical implications:
- the model gains nonlinear capacity beyond weighted averaging
- each token can refine its own feature representation after attending to context
- the block can build richer abstractions over layers
Fundamental trade-off: You gain expressive power, but the FFN often contributes a large fraction of parameters and floating-point work.
Mental model: Attention is the conversation; the FFN is what each participant does privately after hearing the room.
Connection to other fields: Similar to message-passing systems where nodes first exchange information and then run local computation on the aggregated state.
When to use it:
- Best fit: any standard Transformer block that needs more than pure token mixing.
- Misuse pattern: treating attention as if it alone already provides enough nonlinear reasoning.
Concept 2: The Standard Transformer FFN Is Position-Wise but Shared Across Positions
Concrete example / mini-scenario: If a sequence has 128 tokens, the same FFN weights are applied to token 1, token 2, token 3, and so on, independently.
Intuition: The FFN does not mix across positions. It applies the same learned transformation to each token vector separately.
That is why people often call it:
- position-wise feed-forward network
Technical structure (how it works):
If the input matrix is:
X in R^(n x d_model)
then the FFN is applied row by row:
Y_i = W2 * activation(W1 * X_i + b1) + b2
for each token position i.
This means:
- attention is where positions communicate
- FFN is where each position transforms itself
In many implementations, this is equivalent to a 1x1 convolution or batched MLP applied over the last dimension.
Practical implications:
- the same transformation is reused across sequence positions
- computation parallelizes well across all tokens
- positional interaction stays cleanly separated from local feature transformation
Fundamental trade-off: Sharing weights across positions is efficient and elegant, but it means the FFN itself does not explicitly model cross-token structure. That remains attention's job.
Mental model: Every token goes through the same local workshop, but the workshop does not let tokens talk to each other while inside.
Connection to other fields: Similar to applying the same filter or MLP independently over every item in a batch.
When to use it:
- Best fit: Transformer blocks where token interaction and local nonlinear processing are intentionally separated.
- Misuse pattern: expecting the FFN itself to capture long-range token dependencies.
Concept 3: Expansion Dimension and Activation Choice Matter a Lot
Concrete example / mini-scenario: A Transformer might use d_model = 768 and an FFN hidden width of 3072. That means each token is first expanded into a much larger space before being compressed back.
Intuition: The expansion step gives the model room to compute richer intermediate features than would fit in the original representation size.
Technical structure (how it works):
The classic Transformer used:
- large hidden expansion
ReLUactivation
Modern variants often use:
GELU- gated forms such as
GEGLUorSwiGLU
These changes matter because the FFN is not a small side component. In many Transformer families, it is one of the major consumers of:
- parameters
- memory bandwidth
- inference time
So design choices around width and activation influence both quality and efficiency.
Practical implications:
- wider FFNs increase capacity but also cost
- activation choice changes optimization behavior and expressiveness
- FFN design becomes a major target for efficiency work, pruning, and distillation later
Fundamental trade-off:
- more width and richer activations buy capacity
- but they also increase latency, memory pressure, and deployment cost
Mental model: The FFN is like a temporary expansion chamber: widen the representation, perform nonlinear transformation, then compress back into a useful dense form.
Connection to other fields: Similar to low-rank versus expanded intermediate representations in systems and numerical methods: wider intermediate spaces can capture more structure, but they are not free.
When to use it:
- Best fit: standard Transformer designs where capacity per layer matters.
- Misuse pattern: focusing only on attention optimization while ignoring that FFNs are often a major compute budget owner.
Troubleshooting
Issue: "If the Transformer is famous for attention, why does it need an FFN at all?"
Why it happens / is confusing: The architecture is often described as if attention were the whole layer.
Clarification / Fix: Attention is the communication mechanism, not the full computation. The FFN adds nonlinear feature processing after contextual information has been gathered.
Issue: "Does the FFN mix information between tokens?"
Why it happens / is confusing: The block is often mentally treated as one blended operation.
Clarification / Fix: No. Token mixing happens in attention. The FFN is applied independently at each position with shared weights.
Issue: "Why does the FFN often have more parameters than I expected?"
Why it happens / is confusing: Attention gets most of the conceptual attention, so the FFN can feel secondary.
Clarification / Fix: In practice, the FFN is often very wide and can dominate a substantial part of parameter count and compute inside a layer.
Advanced Connections
Connection 1: FFN <-> Division of Labor in the Transformer Block
The parallel: The block separates concerns cleanly:
- attention for cross-token communication
- FFN for per-token nonlinear transformation
Real-world case: This separation is part of why the Transformer scales well as a reusable architectural template.
Connection 2: FFN <-> Model Efficiency Work
The parallel: Many efficiency improvements do not only target attention's quadratic cost; they also target FFN width, activation design, sparsity, and parameter sharing.
Real-world case: In production models, optimizing the FFN can matter as much as optimizing attention, especially at inference scale.
Resources
Suggested Resources
- [PAPER] Attention Is All You Need - arXiv
Focus: the original Transformer block layout, including the position-wise feed-forward network. - [DOC] The Annotated Transformer - Harvard NLP
Focus: a practical walkthrough showing exactly where the FFN sits in the block. - [PAPER] GLU Variants Improve Transformer - arXiv
Focus: useful context for why modern FFN variants often move beyond simpleReLU.
Key Insights
- The FFN gives the Transformer local nonlinear computation after attention has gathered context.
- It is position-wise and shared across tokens, so it transforms each token independently rather than mixing positions.
- Its width and activation are major design levers, affecting both model capacity and deployment cost.