Sequence Modeling

So far, vanilla neural networks usually assume fixed-size input and fixed-size output.

But many real tasks are naturally in the form of sequence:

image $\to$ sequence of words
sequence of video frames $\to$ one action class
sequence of video frames $\to$ sequence of words
sequence $\to$ sequence with the same length

Common sequence modeling forms:

one to many: image captioning
many to one: action classification from video
many to many: video captioning / frame-level classification

The key difference is that the sequence length can be variable.

Recurrent Neural Networks

High-level idea: RNN has an internal state / hidden state that is updated when the sequence is processed.

So at time step $t$, the new hidden state depends on:

previous hidden state $h_{t-1}$
current input $x_t$

A general formulation is: $$ h_t = f_h(W_{hh} h_{t-1} + W_{xh} x_t) $$ $$ y_t = f_y(W_{hy} h_t) $$

where:

$h_t$ is the hidden state at time step $t$
$x_t$ is the input at time step $t$
$y_t$ is the output at time step $t$
$W_{hh}$ transforms previous hidden state to current hidden state
$W_{xh}$ transforms current input to hidden state space
$W_{hy}$ transforms hidden state to output space

Important property:

the same function is used at every time step
the same weights are reused at every time step

This is why RNN can process variable-length input without increasing model size.

Unrolled RNN

The compact recurrent diagram is sometimes not easy to reason about. So we often unroll the RNN across time.

Then the dependency becomes explicit:

$h_1$ depends on $h_0$ and $x_1$
$h_2$ depends on $h_1$ and $x_2$
$h_3$ depends on $h_2$ and $x_3$
…

This view is especially useful for understanding forward pass and backpropagation through time.

Vanilla RNN

A standard vanilla RNN often uses tanh: $$ h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t) $$ $$ y_t = W_{hy} h_t $$

Why tanh is commonly used:

it is zero-centered
it can represent both positive and negative values
it is bounded in $[-1, 1]$

But this bounded nonlinearity also contributes to vanishing gradients later.

toy example: manually build an RNN to detect repeated 1s in a binary sequence.

Example task:

input: $0,1,0,1,1,1,0,1,1,\dots$
output: give $1$ when there are two consecutive 1s, otherwise output $0$

This is a many to many problem.

What should the hidden state remember?

To decide whether the current output should be 1, the model mainly needs:

the current input
the previous input

So the hidden state can store: $$ h_t = (\text{Current}, \text{Previous}, 1) $$

The constant 1 is added to make output computation easier.

We initialize: $h_0 = (0,0,1)$ For simplicity, the lecture uses ReLU instead of tanh in this toy construction.

Hand-crafted weights

The hidden state update is: $$ h_t = \operatorname{ReLU}(W_{hh} h_{t-1} + W_{xh} x_t) $$ $$ y_t = \operatorname{ReLU}(W_{hy} h_t) $$

A valid construction is: $$ W_{xh} = \begin{bmatrix}1\0\0\end{bmatrix}, \qquad W_{hh} = \begin{bmatrix} 0 & 0 & 0 \ 1 & 0 & 0 \ 0 & 0 & 1 \end{bmatrix}, \qquad W_{hy} = \begin{bmatrix}1 & 1 & -1\end{bmatrix} $$

The intuition:

$W_{xh}$ writes the current input into the first entry of the hidden state
the second row of $W_{hh}$ copies the previous “current” value into the new “previous” slot
the third row of $W_{hh}$ keeps the constant 1

Therefore the output becomes: $$ y_t = \max(\text{Current} + \text{Previous} - 1, 0) $$

So only when both current and previous are 1, the output becomes 1.

This example is useful because it shows:

what the hidden state is storing
what each weight matrix is doing
how the output is produced from the hidden state

Computational Graph and Shared Weights

When an RNN is unrolled in time, it looks like a deep computational graph.

But unlike an ordinary deep network, the weight matrix $W$ is shared across all time steps.

That means:

conceptually, each step contributes a gradient to the same parameter
practically, the total gradient for that parameter is the sum over time steps

Different task forms:

many to many: one output for each input, maybe one loss for each time step
many to one: only use final state or pooled states for one final output
one to many: start from one input, then generate a sequence

Backpropagation Through Time

Backpropagation Through Time (BPTT): forward through the whole sequence to compute the loss, then backward through the whole sequence to compute the gradient.

In many-to-many tasks, the total loss can be written as: $$ L = \sum_t L_t $$

Because the same weights are reused at every time step, gradients from different time steps are accumulated onto the same parameter.

Truncated BPTT

In practice, sequences may be too long.

If we backprop through the whole sequence:

memory cost becomes very high
computation becomes very slow

So a common trick is truncated BPTT.

High-level idea:

carry the hidden states forward for the full sequence
but only backpropagate through a smaller time window

This is essentially chunking the sequence into manageable pieces.

Benefits:

lower memory cost
more practical training on long sequences

Cost:

gradient is only approximate compared with full BPTT
long-range dependency across chunks is weakened

Character-level Language Model

A practical use of RNN is language modeling.

Example vocabulary: $[h, e, l, o]$

Example sequence: $\text{“hello”}$

At each time step:

input the current character
predict the next character

So this is essentially a time-step-wise classification problem.

One-hot input and softmax output

Each character can be represented as a one-hot vector.

Example:

$h \to [1,0,0,0]$
$e \to [0,1,0,0]$
$l \to [0,0,1,0]$
$o \to [0,0,0,1]$

The output logits are passed through softmax to produce a probability distribution over the vocabulary.

At training time, we compare the predicted distribution with the ground-truth next character.

Sampling at test time

At test time:

sample one character from the output distribution
feed it back as the next input
repeat until the sequence is finished

So generation is autoregressive.

If we always choose the maximum probability, that is greedy decoding. In practice, sampling is often better because it gives more diverse outputs.

Embedding layer

Using one-hot vectors directly is usually not the best formulation.

A common trick is to insert an embedding layer before the RNN.

Key idea:

multiplying an embedding matrix by a one-hot vector is equivalent to selecting one row / column from the matrix
therefore each token gets mapped to a learnable dense vector

Why embeddings are useful:

dense vector input works better than raw discrete index / one-hot representation
embeddings are learnable
the model can place similar tokens in nearby vector space

A tiny char-RNN can already generate text that gradually becomes more coherent during training. This is historically important because it already showed the basic recipe of modern language modeling:

predict next item in a sequence
feed generated item back into the model
train on huge corpora without manual labeling

Interpretable Hidden Cells

One very interesting thing in RNNs is that some hidden units can become somewhat interpretable.

Examples shown in lecture:

quote detector cell: activates inside quotation marks
line length tracking cell: tracks when a newline may come
if statement cell: tracks whether we are inside an if block
quote/comment cell: detects whether we are inside comment / quote structure
code depth cell: tracks code indentation depth

This suggests that some hidden states are automatically learning useful discrete-like structural signals.

RNN Tradeoffs

Advantages:

can process inputs of arbitrary length
model size does not increase with sequence length
the same weights are reused at each time step
in theory, information from many steps back can influence the current state
at inference / recurrence level, compute scales linearly with sequence length

Disadvantages:

recurrent computation is inherently sequential, so it is slow
training is hard because later states depend on all previous states
hidden state has fixed size, so long-range information is hard to preserve in practice
gradients can vanish or explode over many time steps

Computer Vision Applications of RNNs

Before transformers dominated, RNNs were widely used in vision-language tasks.

Examples:

image captioning
visual question answering (VQA)
visual dialog
vision-language navigation

Image captioning

A common formulation is:

use a CNN to encode the image
use the visual feature as an extra input to the RNN hidden state update
generate one word/token at a time until an <END> token is sampled

A simple modified hidden state formula is: $$ h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + W_{ih} v) $$

where $v$ is the visual feature from the image encoder.

These systems already worked quite well for simple scenes, but also had obvious failure modes:

hallucinating objects that are common in a scene
confusing visually similar contexts
over-relying on dataset co-occurrence bias

Multilayer RNNs

Just like CNNs or MLPs, RNNs can also be stacked in depth.

High-level idea:

each layer has its own hidden states across time
recurrence happens within the same layer over time
output of a lower layer becomes input to the upper layer

So RNN has two axes: time and depth.

This makes the computation graph much more expensive than a shallow recurrent model.

Gradient Flow in Vanilla RNN

The core training problem of vanilla RNN is gradient flow across many time steps.

Suppose: $$ h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t) $$

Then during backpropagation, gradients repeatedly multiply by:

the derivative of tanh
the recurrent weight matrix $W_{hh}$

Why gradients vanish

The derivative of tanh is at most 1, and usually less than 1.

So across many time steps, repeated multiplication tends to shrink gradients.

This leads to vanishing gradients.

Why gradients explode

Even if we ignore the nonlinearity, repeated multiplication by $W_{hh}$ can still be problematic.

The key is the largest singular value of the matrix:

if largest singular value $> 1$, gradients may explode
if largest singular value $< 1$, gradients may vanish

So the recurrent chain is unstable over long time horizons.

Gradient clipping

For exploding gradients, a common trick is gradient clipping.

High-level idea:

if the gradient norm is too large, rescale it

This helps stabilize training, but it does not solve vanishing gradients.

Long Short-Term Memory (LSTM)

LSTM is a more complicated RNN variant designed to make long-term information flow easier.

Instead of only one hidden state, LSTM introduces:

cell state
hidden state
several gates controlling write / forget / reveal behavior

The four common gates are:

$i$: input gate — whether to write to the cell
$f$: forget gate — whether to erase old cell information
$o$: output gate — how much cell information to reveal
$g$: candidate content to write into the cell

The most important idea is that LSTM creates a more direct pathway for information flow in the cell state.

So from $c_t$ back to $c_{t-1}$, the gradient mainly goes through elementwise multiplication by the forget gate, instead of repeated matrix multiplication plus tanh at every step.

This makes long-range gradient flow much easier.

So compared with vanilla RNN:

vanilla RNN has to keep information by repeatedly applying the same unstable transformation
LSTM gives the model a more explicit mechanism to keep / erase / expose information

But LSTM does not completely guarantee no vanishing / exploding gradients. It only makes long-term dependency learning much easier in practice.

important intuition:

in ResNet, skip connections help information pass across depth
in LSTM, cell-state pathway helps information pass across time

So both ideas create an easier path for signal propagation.

Modern RNNs

Although vanilla RNNs are old, the hidden-state idea did not disappear.

Modern architectures such as RWKV and Mamba can also be viewed in the broad direction of recurrent / state-space modeling.

Main attractions:

unlimited context length in principle
linear scaling with sequence length
hidden-state style sequential summary of the past

So current research is in some sense trying to get the best of both worlds:

transformer-level performance
recurrent/state-space efficiency on long sequences

Glossary

recurrent adj. 循环的，递归的
sequence n. 序列
hidden state n. 隐状态
unroll v. 展开（计算图）
recurrence n. 递推，递归关系
autoregressive adj. 自回归的
one-hot adj. 独热的
embedding n. 嵌入向量
interpretable adj. 可解释的
singular value n. 奇异值
clipping n. 裁剪
gate n. 门控
captioning n. 图像描述生成
hallucination n. 幻觉，虚构内容
state space model n. 状态空间模型

Sequence Modeling#

Recurrent Neural Networks#

Unrolled RNN#

Vanilla RNN#

What should the hidden state remember?#

Hand-crafted weights#

Computational Graph and Shared Weights#

Backpropagation Through Time#

Truncated BPTT#

Character-level Language Model#

One-hot input and softmax output#

Sampling at test time#

Embedding layer#

Interpretable Hidden Cells#

RNN Tradeoffs#

Computer Vision Applications of RNNs#

Image captioning#

Multilayer RNNs#

Gradient Flow in Vanilla RNN#

Why gradients vanish#

Why gradients explode#

Gradient clipping#

Long Short-Term Memory (LSTM)#

Modern RNNs#

Glossary#

Sequence Modeling

Recurrent Neural Networks

Unrolled RNN

Vanilla RNN

What should the hidden state remember?

Hand-crafted weights

Computational Graph and Shared Weights

Backpropagation Through Time

Truncated BPTT

Character-level Language Model

One-hot input and softmax output

Sampling at test time

Embedding layer

Interpretable Hidden Cells

RNN Tradeoffs

Computer Vision Applications of RNNs

Image captioning

Multilayer RNNs

Gradient Flow in Vanilla RNN

Why gradients vanish

Why gradients explode

Gradient clipping

Long Short-Term Memory (LSTM)

Modern RNNs

Glossary