0% found this document useful (0 votes)
1 views17 pages

Lec 31

fhfhffg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views17 pages

Lec 31

fhfhffg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

An Alternate Formulation of

Transformers
Residual Stream Perspective

Tanmoy Chakraborty
Associate Professor, IIT Delhi
https://tanmoychak.com/

Introduction to Large Language Models


Recall: Masked Self-Attention in Decoders
Self-Attention: Scaled dot-product attention

where, 𝑄 = 𝑋𝑊 𝑄 , 𝐾 = 𝑋𝑊 𝐾 , V = 𝑋𝑊 𝑉
Problem: While training autoregressive models (with next-word-prediction objective), Transformers ‘can see
the future’ .
• For a current token 𝑥𝑖 , the attention scores are computed with all tokens in the sequence
including those which comes after 𝑥𝑖 (as the whole sequence is available to us during training).
Solution: Masking

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Recall: Masked Self-Attention in Decoders
Masking: ‘Masked’ scaled dot-product attention

where, masking matrix 𝑀 is defined as:

For future tokens, the attention scores becomes zero after applying softmax [softmax(-∞) = 0].
• Effectively, after masking, the query is the current token 𝑥𝑖 , and the keys and values
comes from the tokens before it, including itself (i.e., 𝑥𝑗 , 𝑗 ≤ 𝑖).

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Re-writing the Masked Self-Attention Equation
Now let’s re-write the masked attention equation for a current token 𝑥𝑖 .
• Assume that we are considering the attention head h of layer l.
• Let’s denote the matrix with the output hidden representation from layer k of previous
tokens 𝒙𝒋 , 𝒋 ≤ 𝒊 as 𝑿𝒌≤𝒊 .
Thus, for calculating attention scores for attention head h of layer l, input to the attention
sub-layer is the output representation from the previous layer l-1.
• Query: 𝑥𝑖𝑙−1 𝑊𝑄𝑙,ℎ
𝑙−1
• Keys: 𝑋≤𝑖 𝑊𝐾𝑙,ℎ
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


QK Circuit


QK (query-key) circuit: 𝑊𝑄𝐾 = 𝑊𝑄ℎ 𝑊𝐾ℎ T
• QK circuits are responsible for reading from the residual stream.

Let’s now look at the residual stream


Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Residual Stream Perspective
• Each input embedding gets updated
via vector additions from the
attention and feed-forward blocks
producing residual stream states
(or intermediate representations).

• The final layer residual stream state


is then projected into the vocabulary
space via the unembedding matrix
𝑊𝑈 ϵ 𝑅𝑑×|𝑉| and normalized via the
softmax.

Elhage, et al., A Mathematical Framework for Transformer Circuits

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Combining the Output of Multiple Attention Heads

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


OV Circuit

𝑙,ℎ
OV (output-value) circuit: 𝑊𝑂𝑉 = 𝑊𝑉𝑙,ℎ 𝑊𝑂𝑙,ℎ

• OV circuits are responsible for writing to the residual stream.

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Attention Block Output
The attention block output is the sum of individual attention heads, which is subsequently
added back into the residual stream.

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Feed-Forward Network (FFN)

𝑙
• 𝑊𝑖𝑛 reads from the residual stream state 𝑥𝑖𝑚𝑖𝑑,𝑙 .
• Its result is passed through an element-wise non-linear activation function 𝑔, producing the
neuron activations.
𝑙
• These get transformed by 𝑊𝑜𝑢𝑡 to produce the output 𝐹𝐹𝑁 𝑙 (𝑥𝑖𝑚𝑖𝑑,𝑙 ), which is then added
back to the residual stream
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Prediction as a Sum of Component Outputs
• Prediction head of a Transformer consists of an unembedding matrix:
We can rearrange the traditional forward pass formulation to separate the contribution of
each model component to the output logits:

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Prediction as an Ensemble of Shallow Networks
• Residual networks work as ensembles of shallow networks, where each subnetwork
defines a path in the computational graph.
Consider a two-layer attention-only Transformer, where each attention head is composed
just by an OV matrix:

We can decompose the forward pass as:

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Prediction as an Ensemble of Shallow Networks

• This term links the input embedding to the


unembedding matrix and is referred to as the
direct path

• It shows the contribution of the input embedding


towards the output logit of the next token to be
predicted.

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Prediction as an Ensemble of Shallow Networks

• These terms depicts paths traversing a single OV


circuit, and are named full OV circuits

• They show the contribution of each OV circuit


towards the output logit of the next token to be
predicted.

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Prediction as an Ensemble of Shallow Networks

• This term depicts the path involving both attention heads, and is
referred to as virtual attention heads doing V-composition

• This is called ‘composition’ since the sequential writing and reading of


the two heads is seen as OV matrices composing together.
• The amount of composition can be measured as:
1 2
||𝑊𝑂𝑉 𝑊𝑂𝑉 ||𝐹
ൗ||𝑊 1 || ||𝑊 2 ||
𝑂𝑉 𝐹 𝑂𝑉 𝐹
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Prediction as an Ensemble of Shallow Networks
• In full Transformer models, Q-composition and K-composition, i.e. compositions of WQ
and WK with the WOV output of previous layers, can also be found.

• Such decomposition enables us to localize the inputs or model components responsible


for a particular prediction.

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty


Why Do We Need Such a Formulation?
• By decomposing the Transformer into simpler components like the query-key circuit WQK
and the output-value circuit WOV , we can better understand the information flow within
Transformer-based LLMs.
• This formulation reveals how each layer incrementally transforms token representations.
• Also shows how attention heads and feedforward networks contribute to language modeling.
• Breaking down the contributions of individual circuits allows us to interpret which
aspects of the model influence specific predictions.
Thus, through this formulation, the behavior of attention heads,
the interaction between tokens, and the role of the residual
stream can be explored more clearly.

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

You might also like