LESSON
Day 300: Vision Transformers (ViT) - Transformers for Images
The core idea: a Vision Transformer treats an image as a sequence of learned patch tokens, then applies Transformer-style sequence modeling to that patch sequence.
Today's "Aha!" Moment
The insight: ViT becomes easy to place once you stop thinking "images need a completely different deep learning family" and start thinking:
- what if we turn the image into a sequence first?
That is the key move. Instead of convolving over pixels with strong local biases from the start, ViT:
- chops the image into patches
- embeds each patch as a token
- adds position information
- runs a Transformer over the resulting sequence
Why this matters: ViT is one of the clearest examples of the Transformer becoming a general architecture pattern, not just an NLP tool.
Concrete anchor: A 224x224 image with 16x16 patches becomes a grid of 14x14 = 196 patches. Those 196 patches become the sequence the Transformer processes.
The practical sentence to remember:
ViT turns images into token sequences so the Transformer can reason over visual parts the way it reasons over words.
Why This Matters
ViT matters because it tests a major hypothesis:
- can a relatively generic Transformer architecture compete with hand-crafted visual inductive biases from CNNs?
The answer turned out to be:
- yes, especially at sufficient scale and data volume
But the route there is not magic. ViT works because it:
- redefines the image as a patch sequence
- preserves some spatial information through positional encoding
- lets attention model long-range interactions directly across the visual field
This makes ViT important both conceptually and practically:
- conceptually, it shows Transformer ideas transfer beyond text
- practically, it changes how we think about visual backbones, pretraining, scaling, and multimodal systems
Learning Objectives
By the end of this session, you should be able to:
- Explain how ViT converts images into Transformer-friendly inputs using patch embeddings and positional encodings.
- Describe what ViT gains and loses compared with CNN-style inductive biases.
- Evaluate when ViT is a good fit, especially in relation to data scale, transfer learning, and long-range visual interactions.
Core Concepts Explained
Concept 1: ViT Starts by Converting the Image into Patch Tokens
Concrete example / mini-scenario: An RGB image of shape 224 x 224 x 3 is split into non-overlapping 16 x 16 patches.
Intuition: A Transformer expects a sequence of vectors, not a 2D image grid. So ViT first tokenizes the image into patches.
Technical structure (how it works):
If the patch size is P x P, then an image of size H x W yields:
N = (H / P) * (W / P)
patches.
Each patch is:
- flattened
- linearly projected into an embedding vector
So the pipeline begins like this:
image -> patches -> flatten -> linear projection -> patch embeddings
A learned [CLS] token is often prepended, just as in BERT-style models, to support image-level classification.
Practical implications:
- the image becomes a token sequence
- patch size determines sequence length and local granularity
- smaller patches preserve more detail but increase attention cost
Fundamental trade-off: Large patches are cheaper but lose fine detail; small patches keep more local structure but make the token sequence much longer.
Mental model: ViT turns the image into a sentence of visual tiles.
Connection to other fields: Similar to document chunking in NLP: the chunk size controls both information granularity and compute cost.
When to use it:
- Best fit: visual pipelines where representing the image as a sequence is acceptable and global interactions matter.
- Misuse pattern: choosing patch size without thinking about the cost-detail trade-off.
Concept 2: Positional Encoding and Self-Attention Rebuild Spatial Context
Concrete example / mini-scenario: Once the image has been chopped into patch tokens, the model still needs to know where each patch came from in the original 2D image.
Intuition: Just as in text, attention alone does not encode order or position. In vision, that missing information is even more obvious because spatial layout is central to meaning.
Technical structure (how it works):
After patch embedding, ViT adds positional information to each patch token:
patch_token = patch_embedding + positional_encoding
Then a standard Transformer encoder stack processes the sequence:
- multi-head self-attention
- feed-forward network
- residual connections
- layer normalization
This gives each patch access to:
- nearby patches
- distant patches
- whole-image context
all through repeated attention layers.
Practical implications:
- long-range visual relationships are easy to model directly
- the architecture is uniform with text-style Transformers
- global context does not require many stacked local convolutions to propagate
Fundamental trade-off: Attention makes global interaction easier, but loses some of the strong local spatial bias CNNs get "for free."
Mental model: Every image patch can directly "look at" other patches and decide which visual regions matter for interpreting itself.
Connection to other fields: Similar to graph-style all-to-all message passing over visual regions instead of only local neighborhood filtering.
When to use it:
- Best fit: tasks where whole-image or long-range interactions matter and enough scale is available.
- Misuse pattern: assuming positional encoding alone fully replaces all useful visual inductive bias.
Concept 3: ViT Trades CNN Inductive Bias for Architectural Generality and Scaling Behavior
Concrete example / mini-scenario: A CNN starts with strong assumptions:
- locality matters
- translation structure matters
- nearby pixels interact first
ViT starts with weaker built-in assumptions and lets more of that structure be learned from data.
Intuition: This is the real trade-off. CNNs know more about images before training; ViTs know less, but are more general and often scale better with data and model size.
Technical structure (how it works):
Compared with CNNs, ViTs typically:
- have weaker locality bias at initialization
- rely more heavily on data or pretraining
- use self-attention to model broad interactions across patches
This means ViTs often shine when:
- pretraining data is large
- transfer learning is available
- model scale is high enough to compensate for weaker inductive priors
Historically, this is why early ViTs were strongest in large-data regimes and then became broadly practical through better pretraining and data-efficient variants.
Practical implications:
- ViTs can be very strong backbones
- they often integrate naturally into multimodal systems
- they may need more data or stronger pretraining than CNNs in low-data settings
Fundamental trade-off:
- more architectural uniformity and strong scaling behavior
- less built-in visual prior, which can hurt in smaller-data or efficiency-sensitive regimes
Mental model: CNNs arrive with image-specific instincts; ViTs arrive with a more general reasoning mechanism and learn more of those instincts from data.
Connection to other fields: Similar to general-purpose systems versus specialized systems: specialization buys efficiency and priors, while generality buys reuse and scaling flexibility.
When to use it:
- Best fit: large-scale vision pretraining, transfer learning, and multimodal architectures.
- Misuse pattern: assuming ViT always dominates CNNs regardless of data size, compute budget, or deployment constraints.
Troubleshooting
Issue: "If ViT just turns images into tokens, doesn't it lose too much spatial detail?"
Why it happens / is confusing: Flattening patches sounds destructive.
Clarification / Fix: Some fine detail is lost at the patch boundary, yes. That is why patch size matters so much. Smaller patches keep more detail, but increase sequence length and cost.
Issue: "Why not just use CNNs if images are naturally 2D?"
Why it happens / is confusing: CNNs have strong built-in visual priors and remain highly effective.
Clarification / Fix: CNNs are still strong. ViT matters because it offers a more uniform Transformer-based backbone that can scale well and integrate naturally with text and multimodal systems.
Issue: "Does ViT prove that inductive bias is unnecessary?"
Why it happens / is confusing: Strong ViT results can make it sound like data alone always wins.
Clarification / Fix: No. ViT shows that weaker priors can work extremely well at scale, not that priors are irrelevant. In low-data or efficiency-constrained settings, inductive bias still matters a lot.
Advanced Connections
Connection 1: ViT <-> Multimodal Models
The parallel: Once images are represented as token-like patch embeddings, visual inputs become much easier to connect with text-style Transformer systems.
Real-world case: Many multimodal models build on exactly this representational move: tokenize images, then process or align them with text tokens.
Connection 2: ViT <-> The Limits of Quadratic Attention
The parallel: Patch sequences are still sequences, so ViT inherits the same attention-scaling issues as NLP Transformers.
Real-world case: This is part of why efficient Transformer variants matter beyond language and why the next lesson on breaking the O(n^2) barrier follows naturally.
Resources
Suggested Resources
- [PAPER] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - arXiv
Focus: the original ViT paper and the patch-tokenization idea. - [DOC] Hugging Face ViT model docs - Documentation
Focus: practical mapping from paper concepts to implementation. - [PAPER] How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers - arXiv
Focus: useful context on why ViT behavior depends strongly on training recipe and scale.
Key Insights
- ViT works by tokenizing the image into patches, then treating those patches as a sequence for a Transformer encoder.
- Self-attention gives ViT direct global interaction across visual regions, but positional encoding is still needed to preserve spatial structure.
- ViT trades strong built-in visual bias for a more general and scalable architecture, which can be very powerful when enough data and compute are available.