LESSON
Day 298: GPT - Generative Pretrained Transformer
The core idea: GPT turns the Transformer decoder into a general-purpose generative model by training it to predict the next token from all previous tokens.
Today's "Aha!" Moment
The insight: If BERT showed that a pretrained encoder can become a reusable understanding backbone, GPT showed the mirror image:
- a pretrained decoder can become a reusable generation engine
Why this matters: GPT is not just a model that writes text. It is an architectural bet that:
- left-to-right next-token prediction
- over huge corpora
- with enough scale
can teach a model broad world knowledge, language structure, and task behavior without needing a different objective for every downstream use.
Concrete anchor: Give GPT the prompt "Translate to French: good morning ->". The model does not switch into a special translation architecture. It keeps doing the same thing it always does:
- predict the next token conditioned on the prompt prefix
That simple interface is part of why GPT-style models became so versatile.
The practical sentence to remember:
GPT treats many tasks as conditional continuation.
Why This Matters
GPT is the natural contrast to BERT:
- BERT: encoder-only, bidirectional, optimized for understanding visible input
- GPT: decoder-only, causal, optimized for generating the next token
That difference is not cosmetic. It changes:
- what context the model may use
- what pretraining objective it sees
- what downstream behaviors come naturally
Because GPT is trained autoregressively, it is naturally aligned with:
- continuation
- completion
- instruction following through prompting
- dialogue and free-form generation
This makes GPT the core architecture behind modern LLM-style assistants and generative systems.
Learning Objectives
By the end of this session, you should be able to:
- Explain why GPT uses a decoder-only causal architecture and why that suits generation.
- Describe the autoregressive training loop and how prompting turns many tasks into next-token prediction.
- Evaluate where GPT excels and where it is limited, especially compared with encoder-style models like BERT.
Core Concepts Explained
Concept 1: GPT Is a Decoder-Only Transformer with Causal Self-Attention
Concrete example / mini-scenario: If the model is predicting token 8, it may use tokens 1 through 7, but not token 9 or beyond.
Intuition: GPT behaves like a writer moving forward through text. At each step, it sees the prefix and must continue it.
Technical structure (how it works):
GPT is built from stacked Transformer decoder-style blocks, which means:
- self-attention is causal
- each position can attend only to earlier positions and itself
- there is no bidirectional peek into the future
Architecturally, GPT still uses familiar pieces:
- token embeddings
- positional information
- masked self-attention
- feed-forward networks
- residuals and layer norm
But the causal mask changes the whole behavior of the stack.
Practical implications:
- the model is naturally suited to left-to-right generation
- inference can proceed token by token
- it cannot use future right-context the way BERT can
Fundamental trade-off: Causality makes generation coherent and operationally natural, but it gives up the full bidirectional context that helps understanding models.
Mental model: GPT is always writing from the current cursor position forward, never rereading future words that have not been written yet.
Connection to other fields: Similar to online algorithms that must decide using only past and present information, not future observations.
When to use it:
- Best fit: generation, completion, dialogue, code synthesis, and prompt-conditioned continuation.
- Misuse pattern: expecting decoder-only models to be the most efficient choice for every pure understanding task.
Concept 2: Next-Token Prediction Is a Simple Objective with Surprisingly Broad Power
Concrete example / mini-scenario: Given the prefix:
"The capital of France is"
the model should assign high probability to the next token:
" Paris"
Intuition: If a model becomes very good at predicting what comes next across enough varied text, it must learn:
- syntax
- semantics
- common facts
- discourse patterns
- latent task structure
Technical structure (how it works):
GPT's pretraining objective is autoregressive language modeling:
maximize P(x_t | x_1, x_2, ..., x_t-1)
for every token position t.
Training proceeds by:
- feeding a prefix of tokens
- predicting the next token distribution
- comparing with the actual next token
- repeating over huge text corpora
This seems simple, but it is powerful because it turns almost all text into supervision.
Practical implications:
- pretraining data is abundant
- one training objective scales to many domains
- broad capabilities can emerge without task-specific architecture changes
Fundamental trade-off: The objective is simple and general, but it is also expensive at large scale and not perfectly aligned with every task you might care about.
Mental model: GPT learns language by playing an enormous, continuous game of "what comes next?"
Connection to other fields: Similar to self-supervised objectives elsewhere that look simple locally but force broad structure learning globally.
When to use it:
- Best fit: large-scale pretraining when flexible generative behavior is the goal.
- Misuse pattern: assuming next-token prediction automatically guarantees factuality, reasoning reliability, or task correctness.
Concept 3: Prompting Turns the Prefix into a General Task Interface
Concrete example / mini-scenario: These all fit the same basic GPT interface:
- "Summarize this article:"
- "Translate to German:"
- "Write Python code that..."
- "Question: ... Answer:"
Intuition: If the model is always predicting the next token from a prefix, then the prefix becomes a programmable interface. You do not need a new architecture for each task; you shape behavior through context.
Technical structure (how it works):
At inference time, GPT receives a prompt prefix and repeatedly:
- computes a next-token distribution
- samples or selects a token
- appends it to the context
- repeats until stopping
This means tasks can often be reformulated as:
- write the right prefix
- let the model continue
That is the bridge from language modeling to modern prompting.
Practical implications:
- one model can serve many tasks
- prompt design becomes part of system behavior
- model capability and prompt quality interact strongly
But it also creates limits:
- outputs may drift
- hallucinations remain possible
- long generation compounds earlier mistakes
Fundamental trade-off:
- huge interface flexibility
- but less deterministic control than task-specific supervised pipelines
Mental model: Prompting is task specification by context rather than by architecture.
Connection to other fields: Similar to treating a shell or API as a general interface where the same engine behaves differently depending on the input program or command.
When to use it:
- Best fit: flexible generative tasks, assistant behavior, open-ended completion, and instruction following.
- Misuse pattern: assuming prompting alone replaces evaluation, grounding, or task-specific safeguards.
Troubleshooting
Issue: "If GPT is trained only to predict the next token, how can it do translation, QA, or coding?"
Why it happens / is confusing: The training objective sounds too narrow to produce broad behavior.
Clarification / Fix: The objective is narrow locally but broad in aggregate. Across huge corpora, next-token prediction forces the model to internalize many latent structures and task patterns.
Issue: "Why isn't GPT automatically better than BERT at every NLP task?"
Why it happens / is confusing: GPT models are highly capable, so it is tempting to flatten architectural differences.
Clarification / Fix: GPT is naturally aligned with generation and prompt-conditioned continuation. Encoder-style models may still be more efficient or appropriate for some pure understanding or retrieval tasks.
Issue: "Does prompting mean the model truly understands the task?"
Why it happens / is confusing: Good outputs can make the model seem more reliable than it is.
Clarification / Fix: Prompting is a powerful interface, but capability is still probabilistic. You still need evaluation, grounding, constraints, and system design around the model.
Advanced Connections
Connection 1: GPT <-> The Rise of Prompt Programming
The parallel: Once next-token models became strong enough, the input prefix itself became a control surface for behavior.
Real-world case: This is why prompt design, system prompts, tool-calling instructions, and in-context examples later become practical engineering concerns.
Connection 2: GPT <-> Scaling as a Capability Strategy
The parallel: GPT-style models showed that one simple objective, pushed far enough in model size and data scale, could unlock surprisingly broad behaviors.
Real-world case: Much of the modern LLM ecosystem inherits that scaling-first philosophy, then layers instruction tuning, alignment, and tooling on top.
Resources
Suggested Resources
- [PAPER] Improving Language Understanding by Generative Pre-Training - OpenAI PDF
Focus: the original GPT framing of generative pretraining. - [PAPER] Language Models are Unsupervised Multitask Learners - OpenAI PDF
Focus: a key step in showing broad task behavior emerging from scaled language modeling. - [DOC] Hugging Face GPT-2 model docs - Documentation
Focus: practical mapping from the decoder-only paper idea to modern implementation usage.
Key Insights
- GPT is a decoder-only causal Transformer, naturally aligned with generation rather than bidirectional understanding.
- Autoregressive next-token prediction is the core pretraining objective, and its simplicity is part of its power.
- Prompting works because GPT treats the prefix as task-conditioning context, turning many tasks into controlled continuation.