Day 135: Sequence-to-Sequence Models
Sequence-to-sequence models matter because some tasks do not ask for one label; they ask you to read one sequence and generate another.
Today's "Aha!" Moment
So far, the recurrent story has mostly been about consuming a sequence and producing some summary or stepwise interpretation. But many important tasks do something harder: they transform one sequence into another.
Translation is the classic example. A sentence in French comes in, a sentence in English comes out. But the same pattern appears in many other places: summarization, converting a speech signal into text, rewriting user commands into structured actions, or turning a stream of events into a predicted next-step sequence.
Sequence-to-sequence models solve this by splitting the job in two. An encoder reads the input sequence and turns it into an internal representation. A decoder then generates the output sequence step by step from that representation.
That is the aha. A seq2seq model is not just "an RNN that keeps going." It is an architecture for sequence transformation: understand first, generate second.
Why This Matters
Imagine the warehouse platform now wants to convert raw scanner event codes into short human-readable fault descriptions for operators. The input is a sequence of low-level events. The output is another sequence, this time natural-language tokens. A single label is not enough, and the output length may not match the input length.
That is exactly the kind of problem seq2seq models were built for. They give you a way to map variable-length input to variable-length output while preserving order and context.
This matters historically because encoder-decoder models were the bridge from simple recurrent classification to full neural sequence generation. They also made one important limitation painfully visible: compressing an entire input sequence into one fixed context vector is often too restrictive. That bottleneck is a big part of why attention became so important later.
Learning Objectives
By the end of this session, you will be able to:
- Explain what problem seq2seq models solve - Understand why sequence transformation is different from sequence classification.
- Describe the encoder-decoder workflow - Read how input encoding and output generation fit together.
- Recognize the classic bottleneck of early seq2seq models - Understand why a single fixed context vector can become limiting.
Core Concepts Explained
Concept 1: Seq2Seq Models Map Variable-Length Input Sequences to Variable-Length Output Sequences
A standard classifier reads an input and predicts one label. A seq2seq model reads a sequence and produces another sequence. That means the model needs to support:
- variable input length
- variable output length
- order-sensitive context on both sides
You can picture the high-level task like this:
input sequence -> internal representation -> output sequence
Examples:
- translation: source sentence -> target sentence
- summarization: document -> shorter summary
- speech recognition: audio frames -> token sequence
- structured prediction: event sequence -> action sequence
The key conceptual difference is that the model is not choosing from a fixed output set. It is generating the output one step at a time.
Concept 2: The Encoder Reads, the Decoder Writes
The classic seq2seq architecture has two main parts.
The encoder consumes the input sequence and updates its hidden state as it goes. By the end, it produces a context representation intended to summarize the input.
The decoder starts from that context and generates output tokens step by step. At each step, it uses:
- the current decoder state
- the previous generated token or previous ground-truth token during training
- the context coming from the encoder
That workflow looks like this:
input: x1 -> x2 -> x3 -> x4
\ \ \ \
[ encoder recurrent states ]
|
context vector
|
target: y1 <- y2 <- y3 <- y4
[ decoder generates one token at a time ]
The decoder stops when it emits a special end-of-sequence token.
This architecture is elegant because it separates two jobs clearly:
- encoding: build a useful representation of the source
- decoding: turn that representation into the target sequence
That separation also made seq2seq models widely reusable across tasks with different input and output modalities.
Concept 3: The Fixed-Context Bottleneck Is the Main Weakness of Early Seq2Seq
The cleanest version of encoder-decoder seq2seq compresses the whole input into one context vector. That works surprisingly well for shorter or simpler sequences, but it becomes a bottleneck as the input grows longer or more complex.
Why? Because the decoder is being asked to reconstruct a whole useful output sequence from a single compressed summary.
long input sequence
-> one fixed vector
-> many output steps depend on that one summary
This creates a natural failure mode: the model may capture the broad idea of the input but lose details needed for later output steps.
That is why early seq2seq models are such an important teaching point. They solve a real problem elegantly, but they also make a structural limitation visible. Once you see that limitation, the motivation for attention becomes obvious: instead of forcing the decoder to rely only on one fixed context, let it look back at encoder states selectively.
A second practical detail worth noticing is teacher forcing during training. The decoder is often trained using the true previous target token as input at each step, even though at inference time it must feed on its own previous predictions. That train/inference mismatch is another important part of how seq2seq systems behave.
Troubleshooting
Issue: Treating seq2seq as if it were just sequence classification with a longer output.
Why it happens / is confusing: Both consume sequences and use recurrent components.
Clarification / Fix: Seq2seq must model the output sequence autoregressively, one token at a time, rather than choosing a single label.
Issue: Assuming the encoder's final state can perfectly summarize any input length.
Why it happens / is confusing: The architecture diagram makes the single context vector look authoritative.
Clarification / Fix: That fixed summary is exactly the classical bottleneck. Longer or information-dense sequences make it harder for one vector to carry everything.
Issue: Confusing training behavior with inference behavior.
Why it happens / is confusing: During training, the decoder often gets the correct previous token; during inference, it gets its own predicted token.
Clarification / Fix: Remember teacher forcing. A decoder can look good in training while still compounding errors at generation time.
Issue: Thinking seq2seq requires the input and output lengths to match.
Why it happens / is confusing: Many toy examples use similarly sized sequences.
Clarification / Fix: One of the key strengths of seq2seq is that the input and output lengths can differ.
Advanced Connections
Connection 1: Seq2Seq ↔ Autoregressive Generation
The parallel: The decoder generates the target one step at a time, always conditioning on what has been produced so far.
Real-world case: This same autoregressive idea later appears everywhere from language models to speech decoders.
Connection 2: Seq2Seq ↔ Attention Motivation
The parallel: The fixed-context bottleneck of encoder-decoder RNNs is one of the clearest motivations for attention mechanisms.
Real-world case: Attention did not appear out of nowhere; it was an answer to the weakness of early seq2seq compression.
Resources
Optional Deepening Resources
- [DOCS] PyTorch Translation with a Seq2Seq Network and Attention
- Link: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
- Focus: See encoder, decoder, and later attention in one practical tutorial.
- [COURSE] CS231n: Recurrent Neural Networks
- Link: https://cs231n.github.io/rnn/
- Focus: Review the sequence-modeling setup that leads naturally into encoder-decoder architectures.
- [PAPER] Sequence to Sequence Learning with Neural Networks
- Link: https://arxiv.org/abs/1409.3215
- Focus: Read the classic encoder-decoder seq2seq paper.
- [PAPER] Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
- Link: https://arxiv.org/abs/1406.1078
- Focus: See an early encoder-decoder formulation for translation.
Key Insights
- Seq2seq models are built for sequence transformation - They map one ordered sequence into another, often with different lengths.
- Encoder and decoder play different roles - One summarizes the source; the other generates the target autoregressively.
- The fixed context vector is the classic bottleneck - It makes early seq2seq elegant but limited, especially for longer inputs.
Knowledge Check (Test Questions)
-
What problem are seq2seq models designed to solve?
- A) Predicting one label from one input vector.
- B) Transforming one variable-length sequence into another variable-length sequence.
- C) Removing the need for recurrence entirely.
-
What is the role of the decoder in a classic seq2seq model?
- A) To read the whole source sequence again from scratch.
- B) To generate the target sequence one step at a time using context and previous outputs.
- C) To replace the loss function.
-
Why was attention such a natural next step after early seq2seq models?
- A) Because the decoder relying on one fixed context vector creates a bottleneck for long or detailed inputs.
- B) Because encoder-decoder models cannot use variable-length outputs.
- C) Because RNNs cannot share parameters across time.
Answers
1. B: Seq2seq models are for mapping one ordered sequence into another, not just assigning a single class.
2. B: The decoder generates the target autoregressively from context and prior generated information.
3. A: Attention is motivated directly by the weakness of compressing the entire source into one fixed vector.