LESSON

012 30 min intermediate

Day 300: Vision Transformers (ViT) - Transformers for Images

The core idea: a Vision Transformer treats an image as a sequence of learned patch tokens, then applies Transformer-style sequence modeling to that patch sequence.

Today's "Aha!" Moment

The insight: ViT becomes easy to place once you stop thinking "images need a completely different deep learning family" and start thinking:

what if we turn the image into a sequence first?

That is the key move. Instead of convolving over pixels with strong local biases from the start, ViT:

chops the image into patches
embeds each patch as a token
adds position information
runs a Transformer over the resulting sequence

Why this matters: ViT is one of the clearest examples of the Transformer becoming a general architecture pattern, not just an NLP tool.

Concrete anchor: A 224x224 image with 16x16 patches becomes a grid of 14x14 = 196 patches. Those 196 patches become the sequence the Transformer processes.

The practical sentence to remember:
ViT turns images into token sequences so the Transformer can reason over visual parts the way it reasons over words.

Why This Matters

ViT matters because it tests a major hypothesis:

can a relatively generic Transformer architecture compete with hand-crafted visual inductive biases from CNNs?

The answer turned out to be:

yes, especially at sufficient scale and data volume

But the route there is not magic. ViT works because it:

redefines the image as a patch sequence
preserves some spatial information through positional encoding
lets attention model long-range interactions directly across the visual field

This makes ViT important both conceptually and practically:

conceptually, it shows Transformer ideas transfer beyond text
practically, it changes how we think about visual backbones, pretraining, scaling, and multimodal systems

Learning Objectives

By the end of this session, you should be able to:

Explain how ViT converts images into Transformer-friendly inputs using patch embeddings and positional encodings.
Describe what ViT gains and loses compared with CNN-style inductive biases.
Evaluate when ViT is a good fit, especially in relation to data scale, transfer learning, and long-range visual interactions.

Core Concepts Explained

Concept 1: ViT Starts by Converting the Image into Patch Tokens

Concrete example / mini-scenario: An RGB image of shape 224 x 224 x 3 is split into non-overlapping 16 x 16 patches.

Intuition: A Transformer expects a sequence of vectors, not a 2D image grid. So ViT first tokenizes the image into patches.

Technical structure (how it works):

If the patch size is P x P, then an image of size H x W yields:

N = (H / P) * (W / P)

patches.

Each patch is:

flattened
linearly projected into an embedding vector

So the pipeline begins like this:

image -> patches -> flatten -> linear projection -> patch embeddings

A learned [CLS] token is often prepended, just as in BERT-style models, to support image-level classification.

Practical implications:

the image becomes a token sequence
patch size determines sequence length and local granularity
smaller patches preserve more detail but increase attention cost

Fundamental trade-off: Large patches are cheaper but lose fine detail; small patches keep more local structure but make the token sequence much longer.

Mental model: ViT turns the image into a sentence of visual tiles.

Connection to other fields: Similar to document chunking in NLP: the chunk size controls both information granularity and compute cost.

When to use it:

Best fit: visual pipelines where representing the image as a sequence is acceptable and global interactions matter.
Misuse pattern: choosing patch size without thinking about the cost-detail trade-off.

Concept 2: Positional Encoding and Self-Attention Rebuild Spatial Context

Concrete example / mini-scenario: Once the image has been chopped into patch tokens, the model still needs to know where each patch came from in the original 2D image.

Intuition: Just as in text, attention alone does not encode order or position. In vision, that missing information is even more obvious because spatial layout is central to meaning.

Technical structure (how it works):

After patch embedding, ViT adds positional information to each patch token:

patch_token = patch_embedding + positional_encoding

Then a standard Transformer encoder stack processes the sequence:

multi-head self-attention
feed-forward network
residual connections
layer normalization

This gives each patch access to:

nearby patches
distant patches
whole-image context

all through repeated attention layers.

Practical implications:

long-range visual relationships are easy to model directly
the architecture is uniform with text-style Transformers
global context does not require many stacked local convolutions to propagate

Fundamental trade-off: Attention makes global interaction easier, but loses some of the strong local spatial bias CNNs get "for free."

Mental model: Every image patch can directly "look at" other patches and decide which visual regions matter for interpreting itself.

Connection to other fields: Similar to graph-style all-to-all message passing over visual regions instead of only local neighborhood filtering.

When to use it:

Best fit: tasks where whole-image or long-range interactions matter and enough scale is available.
Misuse pattern: assuming positional encoding alone fully replaces all useful visual inductive bias.

Concept 3: ViT Trades CNN Inductive Bias for Architectural Generality and Scaling Behavior

Concrete example / mini-scenario: A CNN starts with strong assumptions:

locality matters
translation structure matters
nearby pixels interact first

ViT starts with weaker built-in assumptions and lets more of that structure be learned from data.

Intuition: This is the real trade-off. CNNs know more about images before training; ViTs know less, but are more general and often scale better with data and model size.

Technical structure (how it works):

Compared with CNNs, ViTs typically:

have weaker locality bias at initialization
rely more heavily on data or pretraining
use self-attention to model broad interactions across patches

This means ViTs often shine when:

pretraining data is large
transfer learning is available
model scale is high enough to compensate for weaker inductive priors

Historically, this is why early ViTs were strongest in large-data regimes and then became broadly practical through better pretraining and data-efficient variants.

Practical implications:

ViTs can be very strong backbones
they often integrate naturally into multimodal systems
they may need more data or stronger pretraining than CNNs in low-data settings

Fundamental trade-off:

more architectural uniformity and strong scaling behavior
less built-in visual prior, which can hurt in smaller-data or efficiency-sensitive regimes

Mental model: CNNs arrive with image-specific instincts; ViTs arrive with a more general reasoning mechanism and learn more of those instincts from data.

Connection to other fields: Similar to general-purpose systems versus specialized systems: specialization buys efficiency and priors, while generality buys reuse and scaling flexibility.

When to use it:

Best fit: large-scale vision pretraining, transfer learning, and multimodal architectures.
Misuse pattern: assuming ViT always dominates CNNs regardless of data size, compute budget, or deployment constraints.

Troubleshooting

Issue: "If ViT just turns images into tokens, doesn't it lose too much spatial detail?"

Why it happens / is confusing: Flattening patches sounds destructive.

Clarification / Fix: Some fine detail is lost at the patch boundary, yes. That is why patch size matters so much. Smaller patches keep more detail, but increase sequence length and cost.

Issue: "Why not just use CNNs if images are naturally 2D?"

Why it happens / is confusing: CNNs have strong built-in visual priors and remain highly effective.

Clarification / Fix: CNNs are still strong. ViT matters because it offers a more uniform Transformer-based backbone that can scale well and integrate naturally with text and multimodal systems.

Issue: "Does ViT prove that inductive bias is unnecessary?"

Why it happens / is confusing: Strong ViT results can make it sound like data alone always wins.

Clarification / Fix: No. ViT shows that weaker priors can work extremely well at scale, not that priors are irrelevant. In low-data or efficiency-constrained settings, inductive bias still matters a lot.

Advanced Connections

Connection 1: ViT <-> Multimodal Models

The parallel: Once images are represented as token-like patch embeddings, visual inputs become much easier to connect with text-style Transformer systems.

Real-world case: Many multimodal models build on exactly this representational move: tokenize images, then process or align them with text tokens.

Connection 2: ViT <-> The Limits of Quadratic Attention

The parallel: Patch sequences are still sequences, so ViT inherits the same attention-scaling issues as NLP Transformers.

Real-world case: This is part of why efficient Transformer variants matter beyond language and why the next lesson on breaking the O(n^2) barrier follows naturally.

Resources

Suggested Resources

[PAPER] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - arXiv
Focus: the original ViT paper and the patch-tokenization idea.
[DOC] Hugging Face ViT model docs - Documentation
Focus: practical mapping from paper concepts to implementation.
[PAPER] How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers - arXiv
Focus: useful context on why ViT behavior depends strongly on training recipe and scale.

Key Insights

ViT works by tokenizing the image into patches, then treating those patches as a sequence for a Transformer encoder.
Self-attention gives ViT direct global interaction across visual regions, but positional encoding is still needed to preserve spatial structure.
ViT trades strong built-in visual bias for a more general and scalable architecture, which can be very powerful when enough data and compute are available.

← Back to LLM Foundations

← Back to Learning Hub