GPT - Generative Pretrained Transformer

LESSON

LLM Foundations

010 30 min intermediate

Day 298: GPT - Generative Pretrained Transformer

The core idea: GPT turns the Transformer decoder into a general-purpose generative model by training it to predict the next token from all previous tokens.


Today's "Aha!" Moment

The insight: If BERT showed that a pretrained encoder can become a reusable understanding backbone, GPT showed the mirror image:

Why this matters: GPT is not just a model that writes text. It is an architectural bet that:

can teach a model broad world knowledge, language structure, and task behavior without needing a different objective for every downstream use.

Concrete anchor: Give GPT the prompt "Translate to French: good morning ->". The model does not switch into a special translation architecture. It keeps doing the same thing it always does:

That simple interface is part of why GPT-style models became so versatile.

The practical sentence to remember:
GPT treats many tasks as conditional continuation.


Why This Matters

GPT is the natural contrast to BERT:

That difference is not cosmetic. It changes:

Because GPT is trained autoregressively, it is naturally aligned with:

This makes GPT the core architecture behind modern LLM-style assistants and generative systems.


Learning Objectives

By the end of this session, you should be able to:

  1. Explain why GPT uses a decoder-only causal architecture and why that suits generation.
  2. Describe the autoregressive training loop and how prompting turns many tasks into next-token prediction.
  3. Evaluate where GPT excels and where it is limited, especially compared with encoder-style models like BERT.

Core Concepts Explained

Concept 1: GPT Is a Decoder-Only Transformer with Causal Self-Attention

Concrete example / mini-scenario: If the model is predicting token 8, it may use tokens 1 through 7, but not token 9 or beyond.

Intuition: GPT behaves like a writer moving forward through text. At each step, it sees the prefix and must continue it.

Technical structure (how it works):

GPT is built from stacked Transformer decoder-style blocks, which means:

Architecturally, GPT still uses familiar pieces:

But the causal mask changes the whole behavior of the stack.

Practical implications:

Fundamental trade-off: Causality makes generation coherent and operationally natural, but it gives up the full bidirectional context that helps understanding models.

Mental model: GPT is always writing from the current cursor position forward, never rereading future words that have not been written yet.

Connection to other fields: Similar to online algorithms that must decide using only past and present information, not future observations.

When to use it:

Concept 2: Next-Token Prediction Is a Simple Objective with Surprisingly Broad Power

Concrete example / mini-scenario: Given the prefix:

"The capital of France is"

the model should assign high probability to the next token:

" Paris"

Intuition: If a model becomes very good at predicting what comes next across enough varied text, it must learn:

Technical structure (how it works):

GPT's pretraining objective is autoregressive language modeling:

maximize P(x_t | x_1, x_2, ..., x_t-1)

for every token position t.

Training proceeds by:

  1. feeding a prefix of tokens
  2. predicting the next token distribution
  3. comparing with the actual next token
  4. repeating over huge text corpora

This seems simple, but it is powerful because it turns almost all text into supervision.

Practical implications:

Fundamental trade-off: The objective is simple and general, but it is also expensive at large scale and not perfectly aligned with every task you might care about.

Mental model: GPT learns language by playing an enormous, continuous game of "what comes next?"

Connection to other fields: Similar to self-supervised objectives elsewhere that look simple locally but force broad structure learning globally.

When to use it:

Concept 3: Prompting Turns the Prefix into a General Task Interface

Concrete example / mini-scenario: These all fit the same basic GPT interface:

Intuition: If the model is always predicting the next token from a prefix, then the prefix becomes a programmable interface. You do not need a new architecture for each task; you shape behavior through context.

Technical structure (how it works):

At inference time, GPT receives a prompt prefix and repeatedly:

  1. computes a next-token distribution
  2. samples or selects a token
  3. appends it to the context
  4. repeats until stopping

This means tasks can often be reformulated as:

That is the bridge from language modeling to modern prompting.

Practical implications:

But it also creates limits:

Fundamental trade-off:

Mental model: Prompting is task specification by context rather than by architecture.

Connection to other fields: Similar to treating a shell or API as a general interface where the same engine behaves differently depending on the input program or command.

When to use it:


Troubleshooting

Issue: "If GPT is trained only to predict the next token, how can it do translation, QA, or coding?"

Why it happens / is confusing: The training objective sounds too narrow to produce broad behavior.

Clarification / Fix: The objective is narrow locally but broad in aggregate. Across huge corpora, next-token prediction forces the model to internalize many latent structures and task patterns.

Issue: "Why isn't GPT automatically better than BERT at every NLP task?"

Why it happens / is confusing: GPT models are highly capable, so it is tempting to flatten architectural differences.

Clarification / Fix: GPT is naturally aligned with generation and prompt-conditioned continuation. Encoder-style models may still be more efficient or appropriate for some pure understanding or retrieval tasks.

Issue: "Does prompting mean the model truly understands the task?"

Why it happens / is confusing: Good outputs can make the model seem more reliable than it is.

Clarification / Fix: Prompting is a powerful interface, but capability is still probabilistic. You still need evaluation, grounding, constraints, and system design around the model.


Advanced Connections

Connection 1: GPT <-> The Rise of Prompt Programming

The parallel: Once next-token models became strong enough, the input prefix itself became a control surface for behavior.

Real-world case: This is why prompt design, system prompts, tool-calling instructions, and in-context examples later become practical engineering concerns.

Connection 2: GPT <-> Scaling as a Capability Strategy

The parallel: GPT-style models showed that one simple objective, pushed far enough in model size and data scale, could unlock surprisingly broad behaviors.

Real-world case: Much of the modern LLM ecosystem inherits that scaling-first philosophy, then layers instruction tuning, alignment, and tooling on top.


Resources

Suggested Resources


Key Insights

  1. GPT is a decoder-only causal Transformer, naturally aligned with generation rather than bidirectional understanding.
  2. Autoregressive next-token prediction is the core pretraining objective, and its simplicity is part of its power.
  3. Prompting works because GPT treats the prefix as task-conditioning context, turning many tasks into controlled continuation.

PREVIOUS BERT - Bidirectional Encoder Representations NEXT T5 - Text-to-Text Transfer Transformer

← Back to LLM Foundations

← Back to Learning Hub