Day 133: RNN Fundamentals
RNNs matter because some data is not just a bag of features; it is a sequence where meaning depends on what came before.
Today's "Aha!" Moment
The CNN lessons worked well because images can often be processed as spatial structures available all at once. A sequence is different. If you read the word "bank" in a sentence, its meaning depends on earlier words. If you monitor a machine sensor, the current reading matters differently depending on the recent trend. If you generate text, the next token depends on the whole prefix, not just the current character.
A recurrent neural network addresses that by reusing the same computation at each time step while carrying forward a hidden state. Instead of processing every position independently, the model repeatedly updates an internal summary of what it has seen so far.
That is the core shift. A recurrent model is not just looking at the current input. It is combining the current input with a memory-like state inherited from the previous step.
That is the aha. RNNs are the first major deep-learning architecture where state across time is the point, not an implementation detail.
Why This Matters
Imagine the warehouse system now wants to classify short event sequences from a scanner line: for example, sensor readings that indicate whether a package is moving normally, slipping, or jamming. A single measurement is often not enough. What matters is the pattern over time.
You could flatten the whole sequence into one large vector, but that loses the nice idea that the same type of local decision happens at each time step. You would also have trouble with variable-length sequences. A recurrent model says something more natural: reuse the same update rule for each new event and let the hidden state carry forward what matters from the past.
This is why RNNs were so important historically. They made sequential modeling feel structurally similar to the data itself: process one step, update memory, move to the next. Even though newer architectures now dominate many sequence tasks, RNNs are still the clearest entry point for understanding neural sequence models.
Learning Objectives
By the end of this session, you will be able to:
- Explain why recurrent networks exist - Understand what sequences demand that feedforward models handle awkwardly.
- Describe the RNN update mechanism - See how input and hidden state combine at each time step.
- Recognize both the usefulness and limits of vanilla RNNs - Especially their difficulty with long-range dependencies.
Core Concepts Explained
Concept 1: An RNN Reuses the Same Cell at Every Time Step
The basic RNN idea is simple: use one recurrent cell and apply it repeatedly across the sequence.
At time step t, the cell receives:
- the current input
x_t - the previous hidden state
h_(t-1)
It then produces:
- a new hidden state
h_t - optionally an output
y_t
You can picture it like this:
x1 -> [RNN cell] -> h1
\
x2 -> [RNN cell] -> h2
\
x3 -> [RNN cell] -> h3
If you unroll it over time, you see the same parameters reused again and again. That is the sequential analogue of weight sharing in convolution, except now the sharing is across time steps rather than spatial positions.
This matters because it makes the model naturally handle sequences of different lengths while using one consistent update rule. The network learns how to update memory, not just how to classify one fixed-size input.
Concept 2: The Hidden State Is a Running Summary of the Past
The hidden state is the core mechanism of a vanilla RNN. It acts like a compressed summary of what the model thinks still matters from earlier steps.
A common simplified update looks like this:
h_t = tanh(W_x x_t + W_h h_(t-1) + b)
This equation says:
- transform the current input
- transform the previous hidden state
- combine them
- squash the result into the new hidden state
That is the whole recurrent idea in one line. The current interpretation depends on both the present observation and the remembered past.
For a sentence, the hidden state might gradually accumulate syntactic or semantic context. For time series, it might track trend, momentum, or phase. For sequence labeling, it might keep enough context to decide how to interpret the next token.
The trade-off is compression. The hidden state is useful because it carries history forward, but it is also a bottleneck. A fixed-size vector has to summarize everything the model wants to keep.
Concept 3: Vanilla RNNs Struggle When Important Information Must Survive for Many Steps
The elegant part of a basic RNN is also its weakness. At every time step, the hidden state is updated again. That means old information has to survive repeated transformations if it is going to remain useful later.
In short sequences, that can work fine. In longer sequences, it becomes difficult for training to preserve the right information across many steps. This is closely tied to vanishing and exploding gradients during backpropagation through time.
important fact at step 3
-> many recurrent updates
-> by step 50 it may be diluted or unstable
This is why vanilla RNNs are historically important but not the end of the story. They reveal the right architecture idea, shared computation plus state across time, while also exposing the need for better memory control. That need is exactly what motivates LSTMs and GRUs in the next lesson.
So the honest mental model is:
- RNNs are a natural first sequence model
- they are good for understanding recurrence
- their main practical limit is long-range dependency handling
Troubleshooting
Issue: Thinking an RNN is just a feedforward network applied many times.
Why it happens / is confusing: The same cell is reused repeatedly, so it can look like simple repetition.
Clarification / Fix: The crucial difference is the hidden state. Each step depends not only on the current input, but also on the evolving summary of earlier steps.
Issue: Assuming the hidden state stores the full past perfectly.
Why it happens / is confusing: The language of "memory" makes it sound like exact retention.
Clarification / Fix: The hidden state is a compressed summary, not a perfect log. Important information can fade or be overwritten.
Issue: Confusing sequence length handling with true long-range reasoning.
Why it happens / is confusing: RNNs can process variable-length inputs, so it feels like they should automatically handle long dependencies well.
Clarification / Fix: Variable-length processing is easy for RNNs. Reliably preserving useful information over many steps is not.
Issue: Forgetting that recurrence changes the training story too.
Why it happens / is confusing: The forward pass is easy to understand, so the optimization difficulty can be missed.
Clarification / Fix: Training requires backpropagation through time, which is exactly where vanishing and exploding gradient problems appear.
Advanced Connections
Connection 1: RNNs ↔ State Machines
The parallel: An RNN can be read as a learned state machine whose state update is parameterized rather than handwritten.
Real-world case: This perspective is helpful for reasoning about why sequences with hidden context need something more than independent per-step classification.
Connection 2: RNNs ↔ Sequence Modeling More Broadly
The parallel: Even if transformers dominate many modern tasks, they solve the same high-level problem: representing history so the current step can depend on it.
Real-world case: Understanding RNNs makes later sequence architectures easier to understand because it isolates the core question: how is past context represented and updated?
Resources
Optional Deepening Resources
- [DOCS] PyTorch
RNN- Link: https://pytorch.org/docs/stable/generated/torch.nn.RNN.html
- Focus: See the basic recurrent API and tensor shapes used in practice.
- [COURSE] CS231n: Recurrent Neural Networks
- Link: https://cs231n.github.io/rnn/
- Focus: Read an intuition-first explanation of hidden state and unrolling through time.
- [BOOK] Dive into Deep Learning: Recurrent Neural Networks
- Link: https://d2l.ai/chapter_recurrent-neural-networks/index.html
- Focus: Connect the recurrence equations to training and sequence tasks.
- [PAPER] Learning Representations by Back-Propagating Errors
- Link: https://www.iro.umontreal.ca/~lisa/pointeurs/ieeetnn94.pdf
- Focus: Read a classic reference on recurrent learning and temporal credit assignment.
Key Insights
- RNNs process sequences by carrying state across time - The current step depends on both the current input and remembered past context.
- The hidden state is useful but limited - It is a compressed summary, not a perfect memory.
- Vanilla RNNs expose the right idea and the main problem - They make sequence modeling natural but struggle with long-range dependencies.
Knowledge Check (Test Questions)
-
What makes an RNN different from processing each sequence element independently?
- A) It uses the same hidden state update across time, so each step can depend on previous context.
- B) It removes the need for weights.
- C) It only works for fixed-length sequences.
-
What is the role of the hidden state in a vanilla RNN?
- A) To store the full past exactly as a log of all previous inputs.
- B) To act as a running summary of what the model wants to carry forward from earlier steps.
- C) To replace the output layer entirely.
-
Why are vanilla RNNs limited on long sequences?
- A) Because recurrence cannot process more than ten time steps.
- B) Because information and gradients can become hard to preserve across many recurrent updates.
- C) Because they cannot share parameters across time.
Answers
1. A: Recurrence means each step uses both the present input and an evolving summary of the past.
2. B: The hidden state is a compressed memory-like representation, not a perfect transcript of history.
3. B: Long-range dependencies are difficult because repeated updates make information and gradients harder to preserve.