Day 136: Text Generation with RNNs

Text generation with RNNs matters because it shows what it means for a model to produce a sequence one token at a time while conditioning on its own past outputs.

Today's "Aha!" Moment

The previous lessons explained how recurrent models keep state over time and how encoder-decoder systems transform one sequence into another. Text generation makes the autoregressive idea fully explicit.

At every step, the model receives a prefix and predicts a probability distribution for the next token. Then one token is chosen, added to the prefix, and the process repeats. That means generation is not just classification repeated many times. It is a feedback loop in which the model's own output becomes part of the next input context.

This is why text generation is so revealing. During training, the model often sees the correct previous token. During inference, it must live with its own previous decisions. One slightly bad token can shift the context, which can then distort later predictions too.

That is the aha. Text generation is where recurrence, memory, probability, and error accumulation all meet in one concrete process.

Why This Matters

Suppose the warehouse platform wants a small assistant that turns short fault-code sequences into operator-friendly status messages. Even in a simple setup, the model is not choosing one label from a menu. It is generating a sentence token by token.

That means you care about things that do not appear in ordinary classification:

how the first few generated tokens shape the rest of the sentence
whether the model becomes repetitive or incoherent
how sampling choices affect diversity versus stability
why a model that looks good under teacher forcing may still drift during real generation

This is why text generation with RNNs is such a useful teaching case. It exposes the core mechanics of autoregressive generation in a form that is easier to grasp than later large language models.

Learning Objectives

By the end of this session, you will be able to:

Explain how autoregressive generation works - Understand next-token prediction as the core loop.
Distinguish training from inference in generation - Recognize why teacher forcing and free-running generation behave differently.
Reason about output quality and sampling - Understand repetition, drift, and diversity as consequences of the generation loop.

Core Concepts Explained

Concept 1: An RNN Language Model Learns Next-Token Prediction

The simplest text-generation setup uses a recurrent model as a language model. Given a sequence prefix, the model predicts the next token.

If the observed sequence is:

the package is damaged

the model is trained on shifted input-target pairs like:

input:  the        -> target: package
input:  the package -> target: is
input:  the package is -> target: damaged

In practice, this is done in parallel across time steps, but conceptually it is still "use the previous context to predict the next token."

The RNN hidden state is what lets the model carry forward the prefix information. At each step, the network updates its internal state and emits logits over the vocabulary for the next token.

This is the key foundation for generation. If the model learns a good next-token distribution, you can sample from it repeatedly to produce new text.

Concept 2: Generation Is a Feedback Loop, Not Just Repeated Prediction

At inference time, the model starts with an initial token or prompt, predicts the next-token distribution, chooses a token, feeds that token back in, and repeats.

start token
   -> predict next-token distribution
   -> choose one token
   -> append it
   -> use it as part of the next input
   -> repeat

This loop is what makes generation feel creative, but it is also what makes it fragile. A classifier predicts once. A generator predicts, consumes its own prediction, and then must continue from that new state.

That means mistakes compound. If the model picks an odd token early, the hidden state evolves under that altered context, and later predictions may drift further away.

This is also where training and inference diverge. During training, teacher forcing usually feeds the true previous token to the model. During inference, the model has to consume its own predicted tokens instead. So the generation environment is harsher than the training environment.

Concept 3: Sampling Strategy Changes the Personality of the Output

The model does not directly emit one token. It emits a probability distribution over the vocabulary. How you choose the next token from that distribution matters.

Three common options:

greedy decoding: always take the highest-probability token
stochastic sampling: sample according to the distribution
temperature-adjusted sampling: sharpen or flatten the distribution before sampling

The intuition is:

low temperature  -> safer, more repetitive, more deterministic
high temperature -> more diverse, more risky, more chaotic

Greedy decoding can make text dull or repetitive because it always chooses the locally safest token. Sampling can make text more varied, but too much randomness can make it incoherent.

This is one of the most important lessons from text generation. Output quality is not determined only by the model weights. It is also affected by the decoding policy used at inference time.

For RNN generators, this becomes especially visible because local mistakes or repetitive loops can quickly dominate the sequence. Sampling is not just decoration; it is part of the system behavior.

Troubleshooting

Issue: The generated text starts fine but quickly becomes repetitive or nonsensical.

Why it happens / is confusing: The model may be locally reasonable at next-token prediction, but once it consumes its own outputs, small errors can compound.

Clarification / Fix: Remember that generation is a feedback loop. Check sampling strategy, model capacity, and whether the training setup creates too much train/inference mismatch.

Issue: Greedy decoding produces bland text.

Why it happens / is confusing: Choosing the most likely token feels like the safest strategy.

Clarification / Fix: Greedy decoding is often too conservative. Controlled sampling can improve diversity when the model distribution is sensible.

Issue: High-temperature sampling produces chaotic output.

Why it happens / is confusing: More diversity can sound attractive, especially in demos.

Clarification / Fix: Higher temperature flattens the distribution and increases the chance of low-probability tokens. Diversity and coherence trade off against each other.

Issue: Training loss looks good, but free-running generation still feels weak.

Why it happens / is confusing: Teacher forcing makes the training task easier than inference.

Clarification / Fix: Good next-token training does not automatically imply stable long free-running generations.

Advanced Connections

Connection 1: Text Generation ↔ Autoregressive Modeling

The parallel: The generator always predicts the next token conditioned on a prefix, which is the same high-level idea used by later language models.

Real-world case: Understanding this loop is one of the clearest ways to understand what "autoregressive" really means.

Connection 2: Text Generation ↔ Exposure Bias

The parallel: The difference between teacher-forced training and self-fed inference is an early example of exposure bias.

Real-world case: This issue appears broadly in sequence generation and is one reason decoding behavior matters so much.

Resources

Optional Deepening Resources

[DOCS] PyTorch Sequence Models Tutorial
- Link: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
- Focus: See recurrent sequence modeling in a small practical PyTorch example.
[COURSE] CS231n: Recurrent Neural Networks
- Link: https://cs231n.github.io/rnn/
- Focus: Review language modeling and sequence generation intuition.
[BOOK] Dive into Deep Learning: Language Models and Dataset
- Link: https://d2l.ai/chapter_recurrent-neural-networks/language-model.html
- Focus: Connect next-token prediction to recurrent language modeling.
[PAPER] Finding Structure in Time
- Link: https://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf
- Focus: Read historical context around recurrent sequence modeling and generation.

Key Insights

Text generation with RNNs is next-token prediction in a loop - The model predicts one token, consumes it, and continues.
Inference is harder than training - Teacher forcing hides part of the instability that appears when the model feeds on its own outputs.
Decoding policy changes behavior - Sampling and temperature directly affect repetition, diversity, and coherence.

Knowledge Check (Test Questions)

What is the core training objective of an RNN language model for text generation?
- A) Predict the next token given the prefix seen so far.
- B) Predict the full sequence in one matrix multiplication.
- C) Remove the need for hidden state.
Why can free-running generation degrade even when training loss looks reasonable?
- A) Because inference requires the model to condition on its own past predictions, which can accumulate errors.
- B) Because RNNs cannot output probabilities.
- C) Because text generation does not use recurrence.
What is the effect of increasing sampling temperature?
- A) It makes the distribution sharper and more deterministic.
- B) It flattens the distribution, increasing diversity and also riskier token choices.
- C) It removes low-probability tokens completely.

Answers

1. A: The model learns to predict the next token from the previous context.

2. A: Inference is a self-feeding process, so local mistakes can distort later context.

3. B: Higher temperature makes the model more exploratory, which can improve diversity but also reduce coherence.

← Back to Learning