LESSON

010 30 min intermediate

Day 298: GPT - Generative Pretrained Transformer

The core idea: GPT turns the Transformer decoder into a general-purpose generative model by training it to predict the next token from all previous tokens.

Today's "Aha!" Moment

The insight: If BERT showed that a pretrained encoder can become a reusable understanding backbone, GPT showed the mirror image:

a pretrained decoder can become a reusable generation engine

Why this matters: GPT is not just a model that writes text. It is an architectural bet that:

left-to-right next-token prediction
over huge corpora
with enough scale

can teach a model broad world knowledge, language structure, and task behavior without needing a different objective for every downstream use.

Concrete anchor: Give GPT the prompt "Translate to French: good morning ->". The model does not switch into a special translation architecture. It keeps doing the same thing it always does:

predict the next token conditioned on the prompt prefix

That simple interface is part of why GPT-style models became so versatile.

The practical sentence to remember:
GPT treats many tasks as conditional continuation.

Why This Matters

GPT is the natural contrast to BERT:

BERT: encoder-only, bidirectional, optimized for understanding visible input
GPT: decoder-only, causal, optimized for generating the next token

That difference is not cosmetic. It changes:

what context the model may use
what pretraining objective it sees
what downstream behaviors come naturally

Because GPT is trained autoregressively, it is naturally aligned with:

continuation
completion
instruction following through prompting
dialogue and free-form generation

This makes GPT the core architecture behind modern LLM-style assistants and generative systems.

Learning Objectives

By the end of this session, you should be able to:

Explain why GPT uses a decoder-only causal architecture and why that suits generation.
Describe the autoregressive training loop and how prompting turns many tasks into next-token prediction.
Evaluate where GPT excels and where it is limited, especially compared with encoder-style models like BERT.

Core Concepts Explained

Concept 1: GPT Is a Decoder-Only Transformer with Causal Self-Attention

Concrete example / mini-scenario: If the model is predicting token 8, it may use tokens 1 through 7, but not token 9 or beyond.

Intuition: GPT behaves like a writer moving forward through text. At each step, it sees the prefix and must continue it.

Technical structure (how it works):

GPT is built from stacked Transformer decoder-style blocks, which means:

self-attention is causal
each position can attend only to earlier positions and itself
there is no bidirectional peek into the future

Architecturally, GPT still uses familiar pieces:

token embeddings
positional information
masked self-attention
feed-forward networks
residuals and layer norm

But the causal mask changes the whole behavior of the stack.

Practical implications:

the model is naturally suited to left-to-right generation
inference can proceed token by token
it cannot use future right-context the way BERT can

Fundamental trade-off: Causality makes generation coherent and operationally natural, but it gives up the full bidirectional context that helps understanding models.

Mental model: GPT is always writing from the current cursor position forward, never rereading future words that have not been written yet.

Connection to other fields: Similar to online algorithms that must decide using only past and present information, not future observations.

When to use it:

Best fit: generation, completion, dialogue, code synthesis, and prompt-conditioned continuation.
Misuse pattern: expecting decoder-only models to be the most efficient choice for every pure understanding task.

Concept 2: Next-Token Prediction Is a Simple Objective with Surprisingly Broad Power

Concrete example / mini-scenario: Given the prefix:

"The capital of France is"

the model should assign high probability to the next token:

" Paris"

Intuition: If a model becomes very good at predicting what comes next across enough varied text, it must learn:

syntax
semantics
common facts
discourse patterns
latent task structure

Technical structure (how it works):

GPT's pretraining objective is autoregressive language modeling:

maximize P(x_t | x_1, x_2, ..., x_t-1)

for every token position t.

Training proceeds by:

feeding a prefix of tokens
predicting the next token distribution
comparing with the actual next token
repeating over huge text corpora

This seems simple, but it is powerful because it turns almost all text into supervision.

Practical implications:

pretraining data is abundant
one training objective scales to many domains
broad capabilities can emerge without task-specific architecture changes

Fundamental trade-off: The objective is simple and general, but it is also expensive at large scale and not perfectly aligned with every task you might care about.

Mental model: GPT learns language by playing an enormous, continuous game of "what comes next?"

Connection to other fields: Similar to self-supervised objectives elsewhere that look simple locally but force broad structure learning globally.

When to use it:

Best fit: large-scale pretraining when flexible generative behavior is the goal.
Misuse pattern: assuming next-token prediction automatically guarantees factuality, reasoning reliability, or task correctness.

Concept 3: Prompting Turns the Prefix into a General Task Interface

Concrete example / mini-scenario: These all fit the same basic GPT interface:

"Summarize this article:"
"Translate to German:"
"Write Python code that..."
"Question: ... Answer:"

Intuition: If the model is always predicting the next token from a prefix, then the prefix becomes a programmable interface. You do not need a new architecture for each task; you shape behavior through context.

Technical structure (how it works):

At inference time, GPT receives a prompt prefix and repeatedly:

computes a next-token distribution
samples or selects a token
appends it to the context
repeats until stopping

This means tasks can often be reformulated as:

write the right prefix
let the model continue

That is the bridge from language modeling to modern prompting.

Practical implications:

one model can serve many tasks
prompt design becomes part of system behavior
model capability and prompt quality interact strongly

But it also creates limits:

outputs may drift
hallucinations remain possible
long generation compounds earlier mistakes

Fundamental trade-off:

huge interface flexibility
but less deterministic control than task-specific supervised pipelines

Mental model: Prompting is task specification by context rather than by architecture.

Connection to other fields: Similar to treating a shell or API as a general interface where the same engine behaves differently depending on the input program or command.

When to use it:

Best fit: flexible generative tasks, assistant behavior, open-ended completion, and instruction following.
Misuse pattern: assuming prompting alone replaces evaluation, grounding, or task-specific safeguards.

Troubleshooting

Issue: "If GPT is trained only to predict the next token, how can it do translation, QA, or coding?"

Why it happens / is confusing: The training objective sounds too narrow to produce broad behavior.

Clarification / Fix: The objective is narrow locally but broad in aggregate. Across huge corpora, next-token prediction forces the model to internalize many latent structures and task patterns.

Issue: "Why isn't GPT automatically better than BERT at every NLP task?"

Why it happens / is confusing: GPT models are highly capable, so it is tempting to flatten architectural differences.

Clarification / Fix: GPT is naturally aligned with generation and prompt-conditioned continuation. Encoder-style models may still be more efficient or appropriate for some pure understanding or retrieval tasks.

Issue: "Does prompting mean the model truly understands the task?"

Why it happens / is confusing: Good outputs can make the model seem more reliable than it is.

Clarification / Fix: Prompting is a powerful interface, but capability is still probabilistic. You still need evaluation, grounding, constraints, and system design around the model.

Advanced Connections

Connection 1: GPT <-> The Rise of Prompt Programming

The parallel: Once next-token models became strong enough, the input prefix itself became a control surface for behavior.

Real-world case: This is why prompt design, system prompts, tool-calling instructions, and in-context examples later become practical engineering concerns.

Connection 2: GPT <-> Scaling as a Capability Strategy

The parallel: GPT-style models showed that one simple objective, pushed far enough in model size and data scale, could unlock surprisingly broad behaviors.

Real-world case: Much of the modern LLM ecosystem inherits that scaling-first philosophy, then layers instruction tuning, alignment, and tooling on top.

Resources

Suggested Resources

[PAPER] Improving Language Understanding by Generative Pre-Training - OpenAI PDF
Focus: the original GPT framing of generative pretraining.
[PAPER] Language Models are Unsupervised Multitask Learners - OpenAI PDF
Focus: a key step in showing broad task behavior emerging from scaled language modeling.
[DOC] Hugging Face GPT-2 model docs - Documentation
Focus: practical mapping from the decoder-only paper idea to modern implementation usage.

Key Insights

GPT is a decoder-only causal Transformer, naturally aligned with generation rather than bidirectional understanding.
Autoregressive next-token prediction is the core pretraining objective, and its simplicity is part of its power.
Prompting works because GPT treats the prefix as task-conditioning context, turning many tasks into controlled continuation.

← Back to LLM Foundations

← Back to Learning Hub