LESSON
Day 305: Dataset Curation & Tokenization - The Foundation of LLMs
The core idea: an LLM is not trained on "raw text." It is trained on a curated stream of token sequences. That means dataset curation and tokenization are not preprocessing details around the model; they are part of the model's real behavior, cost, and limits.
Today's "Aha!" Moment
The insight: Before we talk about trillion-parameter training, distributed optimizers, or alignment, we need to lock in something more basic:
- what text enters the system
- in what proportions
- in what cleaned form
- and how that text is broken into tokens
Those choices determine what the model sees, what it forgets, what it overfits, how long contexts become in practice, and how much compute each training step costs.
Why this matters: Teams often talk about model architecture as if it were the main source of behavior, but the earliest decisions are upstream:
- the data mixture shapes what worlds the model inhabits
- the tokenizer shapes what units of language the model can reuse efficiently
Concrete anchor: If you train two identical Transformer stacks with different corpora or different tokenizers, you have not built "the same model with different inputs." You have built two models with different internal statistics, different cost profiles, and often different strengths and weaknesses.
Keep this mental hook in view: Your tokenizer and your data mix are part of the model, not just part of the pipeline.
Why This Matters
Month 19 closed with the idea that ChatGPT-style systems are built from:
- pretraining
- fine-tuning
- alignment
- runtime product layers
This month opens by going one step earlier:
- what exactly are we pretraining on?
That question is not cosmetic. Pretraining quality depends heavily on:
- document quality
- duplication rate
- domain balance
- contamination control
- multilingual coverage
- code vs prose ratio
- tokenization efficiency
If those foundations are weak, later optimization work mostly scales noise faster.
Learning Objectives
By the end of this session, you should be able to:
- Explain why dataset curation is a model design problem rather than a simple scraping step.
- Describe what tokenization changes mechanically in both learning dynamics and training cost.
- Evaluate trade-offs between dataset quality, vocabulary design, sequence length, and compute budget in LLM training.
Core Concepts Explained
Concept 1: Dataset Curation Decides What the Model Is Allowed to Generalize From
For example, a team trains a new base model on a giant web crawl plus some books and code. The model writes fluent prose, but it hallucinates documentation formats, reproduces near-duplicates, and behaves oddly on specialized domains that were underrepresented.
At a high level, More data is not automatically better data. A large corpus is only useful if its composition matches the behaviors you want to buy.
Mechanically: Dataset curation usually includes decisions like:
- source selection
- quality filtering
- deduplication
- language identification
- document normalization
- domain weighting
- contamination checks against evals or downstream benchmarks
Each one changes the empirical distribution the model sees during pretraining.
That means curation controls things like:
- what styles are common
- what domains dominate
- how much redundancy exists
- how much junk the optimizer wastes steps on
In practice:
- aggressive deduplication can improve data efficiency and reduce memorization pressure
- domain reweighting can make a model much stronger on code, math, legal text, or multilingual tasks
- contamination mistakes can give you falsely optimistic benchmark results
The trade-off is clear: Broader corpora buy coverage, but they also introduce more inconsistency, more low-quality text, and more cleaning work.
A useful mental model is: The dataset is the environment the model grows up in. If the environment is noisy, repetitive, or badly balanced, scale alone does not fix that upbringing.
Use this lens when:
- Best fit: designing pretraining corpora, continued pretraining, or domain adaptation datasets.
- Misuse pattern: assuming scraping more tokens is equivalent to adding more knowledge.
Concept 2: Tokenization Changes Both What the Model Learns and What It Costs to Learn It
For example, Two teams train on the same raw text. One tokenizer breaks common words and code symbols efficiently; the other fragments them badly into many small pieces. The second model needs more tokens to say the same thing and spends more compute per training example.
At a high level, The model never sees characters or "words" directly. It sees token IDs. So the tokenizer defines the alphabet of reusable building blocks the model gets to think in.
Mechanically: A tokenizer typically:
- normalizes text
- learns or applies a vocabulary
- segments text into subword units
- maps those units to token IDs
Common families include:
- BPE-style tokenization
- unigram language-model tokenization
- byte-level approaches
The tokenizer design affects:
- average sequence length
- vocabulary size
- efficiency on rare words
- multilingual sharing
- code handling
- whitespace and punctuation behavior
In practice:
- shorter tokenized sequences mean more usable context and cheaper training per document
- bad tokenization of code or non-English text can silently handicap the model
- larger vocabularies reduce fragmentation but increase embedding/output matrix size
The trade-off is clear: Better compression of text into tokens can reduce sequence length, but vocabulary growth and language-specific bias create their own costs.
A useful mental model is: Tokenization is a lossy but structured compression scheme that decides which patterns the model can represent cheaply.
Use this lens when:
- Best fit: choosing or auditing a tokenizer for a new base model, multilingual model, or code model.
- Misuse pattern: treating tokenization as fixed infrastructure that can be copied blindly from another project with different data.
Concept 3: Data Curation and Tokenization Together Define the Real Training Budget
For example, a team says it has "five trillion words" of data, but after cleaning, deduplication, and tokenization the usable corpus is much smaller and the actual training budget is constrained by total tokens, sequence packing efficiency, and optimizer throughput.
At a high level, The training budget is not paid in abstract documents. It is paid in useful tokens that fit through the model efficiently.
Mechanically: In practice, the pipeline looks more like:
- collect raw documents
- filter and normalize them
- remove junk and duplicates
- tokenize into IDs
- pack IDs into training sequences
- sample from mixtures during training
This is where several hidden costs emerge:
- low-quality text still consumes tokens
- poor tokenization inflates sequence length
- bad packing wastes context window space
- badly mixed domains distort gradient allocation
In practice:
- a cleaner smaller corpus can outperform a dirtier bigger one
- "tokens seen" only becomes meaningful after curation and tokenizer choice
- throughput, memory pressure, and final quality are all downstream of these early decisions
The trade-off is clear: Investing in better corpus construction and tokenizer fit slows down the start of training, but usually buys far better data efficiency and more trustworthy model behavior later.
A useful mental model is: Architecture is the engine, but curation and tokenization determine the fuel and how efficiently it burns.
Use this lens when:
- Best fit: planning pretraining budgets, comparing corpora, or diagnosing why scaling runs are underperforming.
- Misuse pattern: jumping directly to distributed training tricks before checking whether the input stream is worth scaling.
Troubleshooting
Issue: "We already have a huge corpus, so data curation is probably fine."
Why it happens / is confusing: Raw size feels like evidence of coverage.
Clarification / Fix: Large corpora often contain duplication, boilerplate, spam, and skewed domain mixtures. Measure usable tokens, duplication rates, and source balance instead of trusting gross size.
Issue: "Tokenization is just a compression detail."
Why it happens / is confusing: Tokenizers are easy to treat as preprocessing utilities rather than part of model design.
Clarification / Fix: Inspect how your tokenizer handles code, math, non-English text, whitespace, and rare terms. If frequent patterns are fragmented badly, you are paying recurring compute and quality costs.
Issue: "Benchmark gains prove our corpus is better."
Why it happens / is confusing: Better scores can hide contamination or domain overspecialization.
Clarification / Fix: Separate true generalization from benchmark leakage. Check contamination, hold out realistic domains, and evaluate behavior outside the most represented sources.
Advanced Connections
Connection 1: Dataset Curation <-> Alignment
Alignment begins later, but pretraining data already shapes what the model considers normal, plausible, or common. Fine-tuning can redirect behavior, but it works on top of priors created upstream by the corpus.
Connection 2: Tokenization <-> Systems Cost
Tokenization is not only an NLP choice. It affects:
- tokens per example
- memory footprint
- effective context usage
- throughput during training and inference
That makes it a systems decision as much as a modeling one.
Resources
Optional Deepening Resources
-
[PAPER] The Pile: An 800GB Dataset of Diverse Text for Language Modeling
- Focus: Why corpus composition and diversity matter in large-scale language model training.
-
- Focus: A foundational reference for modern subword tokenization design.
-
[DOC] Hugging Face Tokenizers Documentation
- Focus: Practical tokenizer design, training, and inspection workflows.
-
[DOC]
tiktokenRepository- Focus: A real production tokenizer implementation optimized for LLM workloads.
Key Insights
- Curating data is part of model design - The corpus defines what the model is exposed to, repeats, and prioritizes.
- Tokenization is part of model efficiency - The units you choose determine sequence length, compute cost, and how patterns are represented.
- Scaling only helps if the input stream deserves scaling - Better distributed training cannot rescue badly curated or badly tokenized data.