From Seq2Seq RNN to Attention

For sequence-to-sequence tasks like machine translation, we may want to input one sequence and output another sequence.

Input: English sentence
Output: Italian / French sentence
The lengths of input and output sequences may be different.

A classic solution before Transformers was Encoder - Decoder RNN.

The encoder RNN reads the input sequence and updates hidden states
The decoder RNN generates output tokens one by one
We often use the final hidden state of encoder as a context vector $c$

The bottleneck problem

The basic Seq2Seq RNN summarizes the whole input sequence into one fixed-size context vector $c$.

This may work for short sequences.
But for very long sequences, it is difficult to compress everything into one vector.
Therefore, the information of the input sequence is bottlenecked by $c$.

High-level idea: Instead of forcing the whole input into one vector, let the decoder look back at the whole input sequence at each step.

Attention in Seq2Seq

For each decoder timestep $t$, we compare the previous decoder hidden state $s_{t-1}$ with every encoder hidden state $h_i$.

Step 1: alignment scores: We compute scalar alignment scores $$ e_{t,i} = f_{att}(s_{t-1}, h_i) $$ where $f_{att}$ can be a learnable function such as a linear layer.

Step 2: attention weights: apply softmax to get attention weights $$ a_{t,i} = \operatorname{softmax}(e_{t,i}) $$ with properties:

$0 < a_{t,i} < 1$
$\sum_i a_{t,i} = 1$

Step 3: context vector

We compute the context vector as a weighted sum of encoder hidden states $$ c_t = \sum_i a_{t,i} h_i $$

Step 4: decode with context

The decoder now uses the current context vector $$ s_t = g_U(y_{t-1}, s_{t-1}, c_t) $$

The understanding of attention:

for each output timestep, the decoder can attend to the relevant parts of the input sequence.
different output tokens may attend to different input positions.
the whole mechanism is differentiable, so we can backprop through everything.
no explicit supervision on alignment is needed.

Visualizing attention

Attention weights are interpretable.

If the attention map is close to diagonal, the input and output words roughly correspond in order.
Non-diagonal structure may indicate word reordering.
Therefore, attention also gives us some understanding of what the model is using.

Attention Layer

The attention mechanism inside Seq2Seq RNN can be abstracted into a more general operator.

Single query version

Inputs:

Query vector: $q \in \mathbb R^{D_Q}$
Data vectors: $X \in \mathbb R^{N_X \times D_X}$

Computation:

Similarities: $$ e_i = f_{att}(q, X_i) $$
Attention weights: $$ a = \operatorname{softmax}(e) $$
Output vector: $$ y = \sum_i a_i X_i $$

So the attention layer takes a query and summarizes a set of data vectors into one output vector.

Scaled Dot-Product Attention

A simple choice for similarity is dot product. $$ e_i = q \cdot X_i $$

In practice, we usually use scaled dot-product: $$ e_i = \frac{q \cdot X_i}{\sqrt{D}} $$

Why scale?

When the dimension is large, dot products can become very large.
Then softmax may saturate.
Saturated softmax leads to very small gradients.
So dividing by $\sqrt{D}$ helps stabilize training.

Multiple Queries, Keys and Values

We can generalize attention to multiple queries.

Let

$Q \in \mathbb R^{N_Q \times D_Q}$ be the query matrix
$X \in \mathbb R^{N_X \times D_X}$ be the data matrix

Then attention can be written as $$ E = \frac{QX^T}{\sqrt{D_Q}}, \qquad A = \operatorname{softmax}(E), \qquad Y = AX $$

But usually we further separate the role of data vectors into keys and values.

Keys and values

We learn two projections $$ K = XW_K, \qquad V = XW_V $$ where

$K$ is used for similarity matching with queries
$V$ is used for constructing outputs

So the final attention formula becomes $$ E = \frac{QK^T}{\sqrt{D_Q}}, \qquad A = \operatorname{softmax}(E), \qquad Y = AV $$

The understanding of keys and values:

keys decide where to attend
values decide what information to retrieve

Cross-Attention and Self-Attention

Cross-Attention

In cross-attention:

queries come from one source
keys and values come from another source

So each query produces an output by mixing information from another set of vectors.

This is natural in encoder-decoder setting.

decoder states provide queries
encoder states provide keys and values

Self-Attention

In self-attention, all queries, keys, and values are computed from the same input vectors $X$.

We learn three projections: $$ Q = XW_Q, \qquad K = XW_K, \qquad V = XW_V $$

Then compute $$ E = \frac{QK^T}{\sqrt{D_Q}}, \qquad A = \operatorname{softmax}(E), \qquad Y = AV $$

This means:

each input vector produces one query, one key, and one value.
each output vector is a weighted sum of all value vectors.
therefore, each input can interact with all other inputs.

In practice, $Q,K,V$ are often computed in one fused matrix multiplication: $$ [Q\ K\ V] = X[W_Q\ W_K\ W_V] $$

Permutation Equivariance and Positional Encoding

A pure self-attention layer is permutation equivariant.

If we permute the inputs, then queries, keys, values, similarities, attention weights, and outputs are also permuted in the same way.
So self-attention itself does not know the order of a sequence.
In that sense, self-attention works on a set of vectors.

Formally, $$ F(\sigma(X)) = \sigma(F(X)) $$

The problem

For language or time series, order matters.

The solution: positional encoding

We add positional encoding to each input vector.

It is a vector that depends on the position index.
After adding it, the model can distinguish different positions.

Masked Self-Attention

Sometimes we do not want an input to look at all other positions.

For autoregressive language modeling, a token should not look ahead to future tokens.

token 1 can only see token 1
token 2 can only see tokens 1, 2
token 3 can only see tokens 1, 2, 3

So we use masking:

overwrite forbidden similarity entries with $-\infty$
after softmax, those positions get attention weight $0$

This produces causal / masked self-attention.

Multiheaded Self-Attention

Instead of using one self-attention layer, we often run several self-attention layers in parallel. These are called heads.

If there are $H$ heads:

each head has its own $W_Q, W_K, W_V$
each head computes its own attention output
then we concatenate all head outputs and apply an output projection

If the model dimension is $D$, the head dimension is often $$ D_H = D / H $$ so that input and output dimensions stay the same.

Why multihead?

different heads may focus on different relations
some heads may learn local interactions
some heads may learn long-range interactions
it increases model capacity without changing the high-level structure

Self-Attention as Four Matrix Multiplies

Multiheaded self-attention can be viewed as four matrix multiplications.

QKV projection $$ [N \times D][D \times 3HD_H] \Rightarrow [N \times 3HD_H] $$
QK similarity $$ [H \times N \times D_H][H \times D_H \times N] \Rightarrow [H \times N \times N] $$
V-weighting $$ [H \times N \times N][H \times N \times D_H] \Rightarrow [H \times N \times D_H] $$
Output projection $$ [N \times HD_H][HD_H \times D] \Rightarrow [N \times D] $$

So although attention looks conceptually rich, most of the computation is still large matrix multiplication.

Complexity of Self-Attention

The main problem of self-attention is the $N \times N$ attention matrix.

Compute complexity

The compute of self-attention grows as $$ O(N^2) $$ with sequence length $N$.

Memory complexity

The memory also grows as $$ O(N^2) $$ if we explicitly store the whole attention matrix.

This becomes expensive for very long sequences.

Flash Attention

Flash Attention avoids storing the full attention matrix explicitly.

It computes the softmax-weighted value aggregation in a fused way.
Therefore, memory can be reduced to approximately $O(N)$.
This makes much larger sequence length possible in practice.

Three Ways of Processing Sequences

Recurrent Neural Network

works on 1D ordered sequences
theoretically good at long sequences: $O(N)$ compute and memory
but not parallelizable because hidden states must be computed sequentially

Convolution

works on $N$-dimensional grids
outputs can be computed in parallel
but long-range interaction is hard because receptive field grows slowly

Self-Attention

works on sets of vectors
each output depends directly on all inputs
highly parallelizable
but expensive because of quadratic complexity

Transformer Block

The Transformer uses self-attention as its core primitive.

A standard transformer block contains:

Multiheaded self-attention
Residual connection
Layer normalization
MLP / FFN on each vector independently
Another residual connection
Another layer normalization

Important understanding

Self-Attention is the only part where different vectors directly interact.
LayerNorm and MLP operate on each vector independently.
Most computation is just matrix multiplications:
- 4 from self-attention
- 2 from the MLP

MLP / FFN

The MLP is usually a two-layer feed-forward network applied independently to each vector: $$ D \Rightarrow 4D \Rightarrow D $$

This means:

self-attention mixes information across tokens
MLP increases nonlinear processing capacity for each token

A full Transformer is just a stack of many identical transformer blocks.

Transformers for Language Modeling

For language modeling:

Learn an embedding matrix of shape $[V \times D]$ to map tokens to vectors.
Add positional encoding.
Pass the sequence through many transformer blocks.
Use masked attention so each token only sees previous tokens.
At the end, project from $D$ to vocabulary size $V$.
Apply softmax and cross-entropy loss to predict the next token.

The understanding of LLMs from this lecture:

an LLM is essentially a masked transformer for next-token prediction.
the embedding matrix converts tokens into vectors.
the output projection converts vectors back to vocabulary scores.

Vision Transformers (ViT)

Transformers can also process images.

Main idea: Instead of treating the image as a grid for CNN filters, we split the image into patches and treat each patch like a token.

For example, for a $224 \times 224 \times 3$ image:

break it into $16 \times 16 \times 3$ patches
flatten each patch into a vector of length $768$
apply a linear transform $768 \Rightarrow D$
use the resulting $D$-dimensional patch vectors as transformer inputs

Additional details:

Use positional encoding to tell the model the 2D location of each patch.
Do not use causal masking, because every patch can attend to every other patch.
The transformer outputs one vector per patch.
Then we can average pool the patch outputs and use a linear layer for classification.

Common Tweaks of Transformers

Although the transformer architecture has not changed too much, some modifications have become common.

Pre-Norm

Move normalization before self-attention and MLP, inside the residual branches.

training is more stable
optimization is usually easier

RMSNorm

Replace LayerNorm with RMSNorm.

Given input $x \in \mathbb R^D$, RMSNorm computes $$ y_i = \frac{x_i}{\operatorname{RMS}(x)} \gamma_i $$ where $$ \operatorname{RMS}(x) = \sqrt{\epsilon + \frac{1}{N} \sum_{i=1}^{N} x_i^2} $$

Compared with LayerNorm:

it does not subtract mean
it is a bit simpler
it is often more stable in practice

SwiGLU MLP

Instead of the classic MLP $$ Y = \sigma(XW_1)W_2 $$ we can use SwiGLU: $$ Y = \sigma(XW_1) \odot XW_2 ; W_3 $$

This adds a gating effect and often improves performance.

Mixture of Experts (MoE)

Instead of one MLP, learn $E$ different experts.

each token is routed only to $A < E$ active experts
parameters increase a lot
compute increases much less

So MoE can greatly expand model size without increasing computation proportionally.

Glossary

primitive n. 基本操作；底层构件
permutation equivariant 置换等变的
positional encoding n. 位置编码
masked attention n. 掩码注意力
causal adj. 因果的
head dim 注意力头维度
receptive field 感受野
routing n. 路由；分配
expert n. 专家模块

From Seq2Seq RNN to Attention#

The bottleneck problem#

Attention in Seq2Seq#

Step 3: context vector#

Step 4: decode with context#

Visualizing attention#

Attention Layer#

Single query version#

Scaled Dot-Product Attention#

Multiple Queries, Keys and Values#

Keys and values#

Cross-Attention and Self-Attention#

Cross-Attention#

Self-Attention#

Permutation Equivariance and Positional Encoding#

The problem#

The solution: positional encoding#

Masked Self-Attention#

Multiheaded Self-Attention#

Self-Attention as Four Matrix Multiplies#

Complexity of Self-Attention#

Compute complexity#

Memory complexity#

Flash Attention#

Three Ways of Processing Sequences#

Recurrent Neural Network#

Convolution#

Self-Attention#

Transformer Block#

Important understanding#

MLP / FFN#

Transformers for Language Modeling#

Vision Transformers (ViT)#

Common Tweaks of Transformers#

Pre-Norm#

RMSNorm#

SwiGLU MLP#

Mixture of Experts (MoE)#

Glossary#

From Seq2Seq RNN to Attention

The bottleneck problem

Attention in Seq2Seq

Step 3: context vector

Step 4: decode with context

Visualizing attention

Attention Layer

Single query version

Scaled Dot-Product Attention

Multiple Queries, Keys and Values

Keys and values

Cross-Attention and Self-Attention

Cross-Attention

Self-Attention

Permutation Equivariance and Positional Encoding

The problem

The solution: positional encoding

Masked Self-Attention

Multiheaded Self-Attention

Self-Attention as Four Matrix Multiplies

Complexity of Self-Attention

Compute complexity

Memory complexity

Flash Attention

Three Ways of Processing Sequences

Recurrent Neural Network

Convolution

Self-Attention

Transformer Block

Important understanding

MLP / FFN

Transformers for Language Modeling

Vision Transformers (ViT)

Common Tweaks of Transformers

Pre-Norm

RMSNorm

SwiGLU MLP

Mixture of Experts (MoE)

Glossary