Dataset Curation & Tokenization - The Foundation of LLMs

LESSON

LLM Training, Alignment, and Serving

001 30 min intermediate

Day 305: Dataset Curation & Tokenization - The Foundation of LLMs

The core idea: an LLM is not trained on "raw text." It is trained on a curated stream of token sequences. That means dataset curation and tokenization are not preprocessing details around the model; they are part of the model's real behavior, cost, and limits.


Today's "Aha!" Moment

The insight: Before we talk about trillion-parameter training, distributed optimizers, or alignment, we need to lock in something more basic:

Those choices determine what the model sees, what it forgets, what it overfits, how long contexts become in practice, and how much compute each training step costs.

Why this matters: Teams often talk about model architecture as if it were the main source of behavior, but the earliest decisions are upstream:

Concrete anchor: If you train two identical Transformer stacks with different corpora or different tokenizers, you have not built "the same model with different inputs." You have built two models with different internal statistics, different cost profiles, and often different strengths and weaknesses.

Keep this mental hook in view: Your tokenizer and your data mix are part of the model, not just part of the pipeline.


Why This Matters

Month 19 closed with the idea that ChatGPT-style systems are built from:

  1. pretraining
  2. fine-tuning
  3. alignment
  4. runtime product layers

This month opens by going one step earlier:

That question is not cosmetic. Pretraining quality depends heavily on:

If those foundations are weak, later optimization work mostly scales noise faster.


Learning Objectives

By the end of this session, you should be able to:

  1. Explain why dataset curation is a model design problem rather than a simple scraping step.
  2. Describe what tokenization changes mechanically in both learning dynamics and training cost.
  3. Evaluate trade-offs between dataset quality, vocabulary design, sequence length, and compute budget in LLM training.

Core Concepts Explained

Concept 1: Dataset Curation Decides What the Model Is Allowed to Generalize From

For example, a team trains a new base model on a giant web crawl plus some books and code. The model writes fluent prose, but it hallucinates documentation formats, reproduces near-duplicates, and behaves oddly on specialized domains that were underrepresented.

At a high level, More data is not automatically better data. A large corpus is only useful if its composition matches the behaviors you want to buy.

Mechanically: Dataset curation usually includes decisions like:

Each one changes the empirical distribution the model sees during pretraining.

That means curation controls things like:

In practice:

The trade-off is clear: Broader corpora buy coverage, but they also introduce more inconsistency, more low-quality text, and more cleaning work.

A useful mental model is: The dataset is the environment the model grows up in. If the environment is noisy, repetitive, or badly balanced, scale alone does not fix that upbringing.

Use this lens when:

Concept 2: Tokenization Changes Both What the Model Learns and What It Costs to Learn It

For example, Two teams train on the same raw text. One tokenizer breaks common words and code symbols efficiently; the other fragments them badly into many small pieces. The second model needs more tokens to say the same thing and spends more compute per training example.

At a high level, The model never sees characters or "words" directly. It sees token IDs. So the tokenizer defines the alphabet of reusable building blocks the model gets to think in.

Mechanically: A tokenizer typically:

  1. normalizes text
  2. learns or applies a vocabulary
  3. segments text into subword units
  4. maps those units to token IDs

Common families include:

The tokenizer design affects:

In practice:

The trade-off is clear: Better compression of text into tokens can reduce sequence length, but vocabulary growth and language-specific bias create their own costs.

A useful mental model is: Tokenization is a lossy but structured compression scheme that decides which patterns the model can represent cheaply.

Use this lens when:

Concept 3: Data Curation and Tokenization Together Define the Real Training Budget

For example, a team says it has "five trillion words" of data, but after cleaning, deduplication, and tokenization the usable corpus is much smaller and the actual training budget is constrained by total tokens, sequence packing efficiency, and optimizer throughput.

At a high level, The training budget is not paid in abstract documents. It is paid in useful tokens that fit through the model efficiently.

Mechanically: In practice, the pipeline looks more like:

  1. collect raw documents
  2. filter and normalize them
  3. remove junk and duplicates
  4. tokenize into IDs
  5. pack IDs into training sequences
  6. sample from mixtures during training

This is where several hidden costs emerge:

In practice:

The trade-off is clear: Investing in better corpus construction and tokenizer fit slows down the start of training, but usually buys far better data efficiency and more trustworthy model behavior later.

A useful mental model is: Architecture is the engine, but curation and tokenization determine the fuel and how efficiently it burns.

Use this lens when:


Troubleshooting

Issue: "We already have a huge corpus, so data curation is probably fine."

Why it happens / is confusing: Raw size feels like evidence of coverage.

Clarification / Fix: Large corpora often contain duplication, boilerplate, spam, and skewed domain mixtures. Measure usable tokens, duplication rates, and source balance instead of trusting gross size.

Issue: "Tokenization is just a compression detail."

Why it happens / is confusing: Tokenizers are easy to treat as preprocessing utilities rather than part of model design.

Clarification / Fix: Inspect how your tokenizer handles code, math, non-English text, whitespace, and rare terms. If frequent patterns are fragmented badly, you are paying recurring compute and quality costs.

Issue: "Benchmark gains prove our corpus is better."

Why it happens / is confusing: Better scores can hide contamination or domain overspecialization.

Clarification / Fix: Separate true generalization from benchmark leakage. Check contamination, hold out realistic domains, and evaluate behavior outside the most represented sources.


Advanced Connections

Connection 1: Dataset Curation <-> Alignment

Alignment begins later, but pretraining data already shapes what the model considers normal, plausible, or common. Fine-tuning can redirect behavior, but it works on top of priors created upstream by the corpus.

Connection 2: Tokenization <-> Systems Cost

Tokenization is not only an NLP choice. It affects:

That makes it a systems decision as much as a modeling one.


Resources

Optional Deepening Resources


Key Insights

  1. Curating data is part of model design - The corpus defines what the model is exposed to, repeats, and prioritizes.
  2. Tokenization is part of model efficiency - The units you choose determine sequence length, compute cost, and how patterns are represented.
  3. Scaling only helps if the input stream deserves scaling - Better distributed training cannot rescue badly curated or badly tokenized data.

NEXT ZeRO & DeepSpeed - Training at Trillion-Parameter Scale

← Back to LLM Training, Alignment, and Serving

← Back to Learning Hub