LESSON

001 30 min intermediate

Day 305: Dataset Curation & Tokenization - The Foundation of LLMs

The core idea: an LLM is not trained on "raw text." It is trained on a curated stream of token sequences. That means dataset curation and tokenization are not preprocessing details around the model; they are part of the model's real behavior, cost, and limits.

Today's "Aha!" Moment

The insight: Before we talk about trillion-parameter training, distributed optimizers, or alignment, we need to lock in something more basic:

what text enters the system
in what proportions
in what cleaned form
and how that text is broken into tokens

Those choices determine what the model sees, what it forgets, what it overfits, how long contexts become in practice, and how much compute each training step costs.

Why this matters: Teams often talk about model architecture as if it were the main source of behavior, but the earliest decisions are upstream:

the data mixture shapes what worlds the model inhabits
the tokenizer shapes what units of language the model can reuse efficiently

Concrete anchor: If you train two identical Transformer stacks with different corpora or different tokenizers, you have not built "the same model with different inputs." You have built two models with different internal statistics, different cost profiles, and often different strengths and weaknesses.

Keep this mental hook in view: Your tokenizer and your data mix are part of the model, not just part of the pipeline.

Why This Matters

Month 19 closed with the idea that ChatGPT-style systems are built from:

pretraining
fine-tuning
alignment
runtime product layers

This month opens by going one step earlier:

what exactly are we pretraining on?

That question is not cosmetic. Pretraining quality depends heavily on:

document quality
duplication rate
domain balance
contamination control
multilingual coverage
code vs prose ratio
tokenization efficiency

If those foundations are weak, later optimization work mostly scales noise faster.

Learning Objectives

By the end of this session, you should be able to:

Explain why dataset curation is a model design problem rather than a simple scraping step.
Describe what tokenization changes mechanically in both learning dynamics and training cost.
Evaluate trade-offs between dataset quality, vocabulary design, sequence length, and compute budget in LLM training.

Core Concepts Explained

Concept 1: Dataset Curation Decides What the Model Is Allowed to Generalize From

For example, a team trains a new base model on a giant web crawl plus some books and code. The model writes fluent prose, but it hallucinates documentation formats, reproduces near-duplicates, and behaves oddly on specialized domains that were underrepresented.

At a high level, More data is not automatically better data. A large corpus is only useful if its composition matches the behaviors you want to buy.

Mechanically: Dataset curation usually includes decisions like:

source selection
quality filtering
deduplication
language identification
document normalization
domain weighting
contamination checks against evals or downstream benchmarks

Each one changes the empirical distribution the model sees during pretraining.

That means curation controls things like:

what styles are common
what domains dominate
how much redundancy exists
how much junk the optimizer wastes steps on

In practice:

aggressive deduplication can improve data efficiency and reduce memorization pressure
domain reweighting can make a model much stronger on code, math, legal text, or multilingual tasks
contamination mistakes can give you falsely optimistic benchmark results

The trade-off is clear: Broader corpora buy coverage, but they also introduce more inconsistency, more low-quality text, and more cleaning work.

A useful mental model is: The dataset is the environment the model grows up in. If the environment is noisy, repetitive, or badly balanced, scale alone does not fix that upbringing.

Use this lens when:

Best fit: designing pretraining corpora, continued pretraining, or domain adaptation datasets.
Misuse pattern: assuming scraping more tokens is equivalent to adding more knowledge.

Concept 2: Tokenization Changes Both What the Model Learns and What It Costs to Learn It

For example, Two teams train on the same raw text. One tokenizer breaks common words and code symbols efficiently; the other fragments them badly into many small pieces. The second model needs more tokens to say the same thing and spends more compute per training example.

At a high level, The model never sees characters or "words" directly. It sees token IDs. So the tokenizer defines the alphabet of reusable building blocks the model gets to think in.

Mechanically: A tokenizer typically:

normalizes text
learns or applies a vocabulary
segments text into subword units
maps those units to token IDs

Common families include:

BPE-style tokenization
unigram language-model tokenization
byte-level approaches

The tokenizer design affects:

average sequence length
vocabulary size
efficiency on rare words
multilingual sharing
code handling
whitespace and punctuation behavior

In practice:

shorter tokenized sequences mean more usable context and cheaper training per document
bad tokenization of code or non-English text can silently handicap the model
larger vocabularies reduce fragmentation but increase embedding/output matrix size

The trade-off is clear: Better compression of text into tokens can reduce sequence length, but vocabulary growth and language-specific bias create their own costs.

A useful mental model is: Tokenization is a lossy but structured compression scheme that decides which patterns the model can represent cheaply.

Use this lens when:

Best fit: choosing or auditing a tokenizer for a new base model, multilingual model, or code model.
Misuse pattern: treating tokenization as fixed infrastructure that can be copied blindly from another project with different data.

Concept 3: Data Curation and Tokenization Together Define the Real Training Budget

For example, a team says it has "five trillion words" of data, but after cleaning, deduplication, and tokenization the usable corpus is much smaller and the actual training budget is constrained by total tokens, sequence packing efficiency, and optimizer throughput.

At a high level, The training budget is not paid in abstract documents. It is paid in useful tokens that fit through the model efficiently.

Mechanically: In practice, the pipeline looks more like:

collect raw documents
filter and normalize them
remove junk and duplicates
tokenize into IDs
pack IDs into training sequences
sample from mixtures during training

This is where several hidden costs emerge:

low-quality text still consumes tokens
poor tokenization inflates sequence length
bad packing wastes context window space
badly mixed domains distort gradient allocation

In practice:

a cleaner smaller corpus can outperform a dirtier bigger one
"tokens seen" only becomes meaningful after curation and tokenizer choice
throughput, memory pressure, and final quality are all downstream of these early decisions

The trade-off is clear: Investing in better corpus construction and tokenizer fit slows down the start of training, but usually buys far better data efficiency and more trustworthy model behavior later.

A useful mental model is: Architecture is the engine, but curation and tokenization determine the fuel and how efficiently it burns.

Use this lens when:

Best fit: planning pretraining budgets, comparing corpora, or diagnosing why scaling runs are underperforming.
Misuse pattern: jumping directly to distributed training tricks before checking whether the input stream is worth scaling.

Troubleshooting

Issue: "We already have a huge corpus, so data curation is probably fine."

Why it happens / is confusing: Raw size feels like evidence of coverage.

Clarification / Fix: Large corpora often contain duplication, boilerplate, spam, and skewed domain mixtures. Measure usable tokens, duplication rates, and source balance instead of trusting gross size.

Issue: "Tokenization is just a compression detail."

Why it happens / is confusing: Tokenizers are easy to treat as preprocessing utilities rather than part of model design.

Clarification / Fix: Inspect how your tokenizer handles code, math, non-English text, whitespace, and rare terms. If frequent patterns are fragmented badly, you are paying recurring compute and quality costs.

Issue: "Benchmark gains prove our corpus is better."

Why it happens / is confusing: Better scores can hide contamination or domain overspecialization.

Clarification / Fix: Separate true generalization from benchmark leakage. Check contamination, hold out realistic domains, and evaluate behavior outside the most represented sources.

Advanced Connections

Connection 1: Dataset Curation <-> Alignment

Alignment begins later, but pretraining data already shapes what the model considers normal, plausible, or common. Fine-tuning can redirect behavior, but it works on top of priors created upstream by the corpus.

Connection 2: Tokenization <-> Systems Cost

Tokenization is not only an NLP choice. It affects:

tokens per example
memory footprint
effective context usage
throughput during training and inference

That makes it a systems decision as much as a modeling one.

Resources

Optional Deepening Resources

[PAPER] The Pile: An 800GB Dataset of Diverse Text for Language Modeling
- Focus: Why corpus composition and diversity matter in large-scale language model training.
[PAPER] SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
- Focus: A foundational reference for modern subword tokenization design.
[DOC] Hugging Face Tokenizers Documentation
- Focus: Practical tokenizer design, training, and inspection workflows.
[DOC] tiktoken Repository
- Focus: A real production tokenizer implementation optimized for LLM workloads.

Key Insights

Curating data is part of model design - The corpus defines what the model is exposed to, repeats, and prioritizes.
Tokenization is part of model efficiency - The units you choose determine sequence length, compute cost, and how patterns are represented.
Scaling only helps if the input stream deserves scaling - Better distributed training cannot rescue badly curated or badly tokenized data.

← Back to LLM Training, Alignment, and Serving

← Back to Learning Hub