Self-Attention Deep Dive

LESSON

LLM Foundations

002 30 min intermediate

Day 290: Self-Attention Deep Dive

The core idea: in self-attention, each token builds its representation by looking at other tokens in the same sequence and deciding which ones matter most.


Today's "Aha!" Moment

The insight: Generic attention said "look back at relevant information." Self-attention makes that idea internal to the sequence itself:

Why this matters: This is the step that turns attention from a sequence-to-sequence helper into the main computation of the Transformer. Once this clicks, the rest of the month stops feeling like a pile of components and starts feeling like one architecture.

Concrete anchor: In the sentence "The animal didn't cross the street because it was too tired," the token it should connect strongly to animal, not street. Self-attention gives the model a direct way to build that relation instead of hoping a recurrent hidden state preserves it.

The practical sentence to remember:
Self-attention is contextualization by relevance inside the same sequence.


Why This Matters

The jump from generic attention to self-attention is bigger than it first looks.

With encoder-decoder attention, a decoder state attends over encoded input states. With self-attention, the sequence itself becomes both:

That creates three big consequences:

  1. every token can directly interact with every other token
  2. those interactions can be computed in parallel across the sequence
  3. the architecture no longer gets order "for free" from recurrence, so position must be added explicitly later

Operational payoff:


Learning Objectives

By the end of this session, you should be able to:

  1. Explain what makes self-attention different from generic encoder-decoder attention.
  2. Describe the self-attention computation from input embeddings through Q, K, V, scores, weights, and contextualized outputs.
  3. Evaluate the strengths and limits of self-attention, especially its parallelism, expressiveness, quadratic cost, and lack of built-in positional bias.

Core Concepts Explained

Concept 1: Self-Attention Lets Each Token Rebuild Itself Using the Rest of the Sequence

Concrete example / mini-scenario: In the phrase "bank of the river" versus "bank approved the loan," the token bank should end up with different representations depending on nearby context.

Intuition: A token should not carry the same meaning everywhere. Its meaning depends on what surrounds it. Self-attention lets each token update itself by consulting the other tokens that help disambiguate it.

Technical structure (how it works): Start from an input sequence of token representations:

X = [x1, x2, x3, ..., xn]

For each token, the model creates three learned projections:

The important difference from the previous lesson is that all of these come from the same source sequence.

Practical implications:

Fundamental trade-off: You gain direct contextualization, but every token now participates in many pairwise comparisons.

Mental model: A meeting where every participant briefly checks who in the room is most relevant to their current question, then updates their view based on those voices.

Connection to other fields: This resembles message passing in graphs: each node updates itself by aggregating information from other nodes it finds relevant.

When to use it:

Concept 2: Mechanically, Self-Attention Is a Full Relevance Matrix Over the Sequence

Concrete example / mini-scenario: For a sentence with n tokens, token i should be able to score every token j and decide how much to use its information.

Intuition: Self-attention builds a matrix of "who should listen to whom."

Technical structure (how it works):

  1. Project the input matrix X into:
Q = XWq
K = XWk
V = XWv
  1. Compute pairwise similarity scores:
scores = QK^T
  1. Scale and normalize those scores to get attention weights:
weights = softmax(scores / sqrt(d_k))
  1. Use those weights to mix values:
output = weights V

This means each output token is a weighted combination of value vectors from the whole sequence.

Practical implications:

Fundamental trade-off: The full relevance matrix is expressive and hardware-friendly, but it grows with sequence length and becomes expensive for long contexts.

Mental model: A similarity table where each row is one token asking, "Which other tokens should influence me?"

Connection to other fields: This is close to dense all-to-all communication. Powerful, but the communication surface grows fast as the system gets larger.

When to use it:

Concept 3: Self-Attention Is Powerful Because It Shortens Information Paths

Concrete example / mini-scenario: In recurrent models, if token 1 must influence token 100, that signal has to travel through many intermediate steps. In self-attention, token 100 can attend directly to token 1 in one layer.

Intuition: Long-distance interaction becomes easier when relevant positions can connect directly.

Technical structure (how it works): In a recurrent network, dependencies are mediated through sequential hidden-state updates. In self-attention, dependencies are mediated through attention weights, which can create direct token-to-token influence regardless of distance.

That gives Transformers two major structural advantages:

But it also creates a structural gap:

That is why the next lessons need:

Practical implications:

Fundamental trade-off:

Mental model: Self-attention replaces long relay chains with direct, weighted communication links.

Connection to other fields: Similar to network topology design: shorter communication paths can improve expressiveness and throughput, but dense connectivity is more expensive to maintain.

When to use it:


Troubleshooting

Issue: "If every token attends to every other token, does that mean all tokens become the same?"

Why it happens / is confusing: The weighted-sum idea can sound like indiscriminate averaging.

Clarification / Fix: The weights are different for each token and learned from its query. Each row in the attention matrix can look very different, so contextualization is selective, not uniform.

Issue: "Why do we need separate Q, K, and V if they all come from the same input?"

Why it happens / is confusing: It seems redundant at first.

Clarification / Fix: They play different roles. The model learns one projection for asking, one for matching, and one for passing information forward. Using the same raw vector for all three would be much less flexible.

Issue: "Does self-attention understand token order automatically?"

Why it happens / is confusing: The input is a sequence, so it feels like order should already be obvious.

Clarification / Fix: No. Self-attention by itself only sees a set of vectors and their learned pairwise relevance. Position must be added explicitly, which is why positional encoding is necessary later.


Advanced Connections

Connection 1: Self-Attention <-> Graph Message Passing

The parallel: Each token acts like a node that updates itself by aggregating messages from other relevant nodes.

Real-world case: Many graph neural network ideas feel familiar once you see self-attention as learned weighted communication over a complete graph.

Connection 2: Self-Attention <-> Systems Parallelism

The parallel: Recurrent models serialize token interactions over time, while self-attention exposes a computation that can be parallelized across the whole sequence.

Real-world case: This is one of the reasons Transformer-style models fit modern accelerator hardware so well despite their high raw compute cost.


Resources

Suggested Resources


Key Insights

  1. Self-attention makes each token context-aware by letting it read from other tokens in the same sequence.
  2. Mechanically, self-attention is a learned relevance matrix built from Q, K, and V projections of the same input.
  3. Its power comes from direct token-to-token interaction and parallelism, but it needs positional information and pays quadratic cost as context grows.

PREVIOUS Attention Mechanism Fundamentals NEXT Multi-Head Attention

← Back to LLM Foundations

← Back to Learning Hub