Vision Transformers (ViT) - Transformers for Images

LESSON

LLM Foundations

012 30 min intermediate

Day 300: Vision Transformers (ViT) - Transformers for Images

The core idea: a Vision Transformer treats an image as a sequence of learned patch tokens, then applies Transformer-style sequence modeling to that patch sequence.


Today's "Aha!" Moment

The insight: ViT becomes easy to place once you stop thinking "images need a completely different deep learning family" and start thinking:

That is the key move. Instead of convolving over pixels with strong local biases from the start, ViT:

  1. chops the image into patches
  2. embeds each patch as a token
  3. adds position information
  4. runs a Transformer over the resulting sequence

Why this matters: ViT is one of the clearest examples of the Transformer becoming a general architecture pattern, not just an NLP tool.

Concrete anchor: A 224x224 image with 16x16 patches becomes a grid of 14x14 = 196 patches. Those 196 patches become the sequence the Transformer processes.

The practical sentence to remember:
ViT turns images into token sequences so the Transformer can reason over visual parts the way it reasons over words.


Why This Matters

ViT matters because it tests a major hypothesis:

The answer turned out to be:

But the route there is not magic. ViT works because it:

This makes ViT important both conceptually and practically:


Learning Objectives

By the end of this session, you should be able to:

  1. Explain how ViT converts images into Transformer-friendly inputs using patch embeddings and positional encodings.
  2. Describe what ViT gains and loses compared with CNN-style inductive biases.
  3. Evaluate when ViT is a good fit, especially in relation to data scale, transfer learning, and long-range visual interactions.

Core Concepts Explained

Concept 1: ViT Starts by Converting the Image into Patch Tokens

Concrete example / mini-scenario: An RGB image of shape 224 x 224 x 3 is split into non-overlapping 16 x 16 patches.

Intuition: A Transformer expects a sequence of vectors, not a 2D image grid. So ViT first tokenizes the image into patches.

Technical structure (how it works):

If the patch size is P x P, then an image of size H x W yields:

N = (H / P) * (W / P)

patches.

Each patch is:

So the pipeline begins like this:

image -> patches -> flatten -> linear projection -> patch embeddings

A learned [CLS] token is often prepended, just as in BERT-style models, to support image-level classification.

Practical implications:

Fundamental trade-off: Large patches are cheaper but lose fine detail; small patches keep more local structure but make the token sequence much longer.

Mental model: ViT turns the image into a sentence of visual tiles.

Connection to other fields: Similar to document chunking in NLP: the chunk size controls both information granularity and compute cost.

When to use it:

Concept 2: Positional Encoding and Self-Attention Rebuild Spatial Context

Concrete example / mini-scenario: Once the image has been chopped into patch tokens, the model still needs to know where each patch came from in the original 2D image.

Intuition: Just as in text, attention alone does not encode order or position. In vision, that missing information is even more obvious because spatial layout is central to meaning.

Technical structure (how it works):

After patch embedding, ViT adds positional information to each patch token:

patch_token = patch_embedding + positional_encoding

Then a standard Transformer encoder stack processes the sequence:

This gives each patch access to:

all through repeated attention layers.

Practical implications:

Fundamental trade-off: Attention makes global interaction easier, but loses some of the strong local spatial bias CNNs get "for free."

Mental model: Every image patch can directly "look at" other patches and decide which visual regions matter for interpreting itself.

Connection to other fields: Similar to graph-style all-to-all message passing over visual regions instead of only local neighborhood filtering.

When to use it:

Concept 3: ViT Trades CNN Inductive Bias for Architectural Generality and Scaling Behavior

Concrete example / mini-scenario: A CNN starts with strong assumptions:

ViT starts with weaker built-in assumptions and lets more of that structure be learned from data.

Intuition: This is the real trade-off. CNNs know more about images before training; ViTs know less, but are more general and often scale better with data and model size.

Technical structure (how it works):

Compared with CNNs, ViTs typically:

This means ViTs often shine when:

Historically, this is why early ViTs were strongest in large-data regimes and then became broadly practical through better pretraining and data-efficient variants.

Practical implications:

Fundamental trade-off:

Mental model: CNNs arrive with image-specific instincts; ViTs arrive with a more general reasoning mechanism and learn more of those instincts from data.

Connection to other fields: Similar to general-purpose systems versus specialized systems: specialization buys efficiency and priors, while generality buys reuse and scaling flexibility.

When to use it:


Troubleshooting

Issue: "If ViT just turns images into tokens, doesn't it lose too much spatial detail?"

Why it happens / is confusing: Flattening patches sounds destructive.

Clarification / Fix: Some fine detail is lost at the patch boundary, yes. That is why patch size matters so much. Smaller patches keep more detail, but increase sequence length and cost.

Issue: "Why not just use CNNs if images are naturally 2D?"

Why it happens / is confusing: CNNs have strong built-in visual priors and remain highly effective.

Clarification / Fix: CNNs are still strong. ViT matters because it offers a more uniform Transformer-based backbone that can scale well and integrate naturally with text and multimodal systems.

Issue: "Does ViT prove that inductive bias is unnecessary?"

Why it happens / is confusing: Strong ViT results can make it sound like data alone always wins.

Clarification / Fix: No. ViT shows that weaker priors can work extremely well at scale, not that priors are irrelevant. In low-data or efficiency-constrained settings, inductive bias still matters a lot.


Advanced Connections

Connection 1: ViT <-> Multimodal Models

The parallel: Once images are represented as token-like patch embeddings, visual inputs become much easier to connect with text-style Transformer systems.

Real-world case: Many multimodal models build on exactly this representational move: tokenize images, then process or align them with text tokens.

Connection 2: ViT <-> The Limits of Quadratic Attention

The parallel: Patch sequences are still sequences, so ViT inherits the same attention-scaling issues as NLP Transformers.

Real-world case: This is part of why efficient Transformer variants matter beyond language and why the next lesson on breaking the O(n^2) barrier follows naturally.


Resources

Suggested Resources


Key Insights

  1. ViT works by tokenizing the image into patches, then treating those patches as a sequence for a Transformer encoder.
  2. Self-attention gives ViT direct global interaction across visual regions, but positional encoding is still needed to preserve spatial structure.
  3. ViT trades strong built-in visual bias for a more general and scalable architecture, which can be very powerful when enough data and compute are available.

PREVIOUS T5 - Text-to-Text Transfer Transformer NEXT Efficient Transformers - Breaking the O(n²) Barrier

← Back to LLM Foundations

← Back to Learning Hub