0% found this document useful (0 votes)

1 views17 pages

Lec 31

fhfhffg

Uploaded by

Bhoja raj The king

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views17 pages

Lec 31

fhfhffg

Uploaded by

Bhoja raj The king

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

An Alternate Formulation of

Transformers
Residual Stream Perspective

Tanmoy Chakraborty
Associate Professor, IIT Delhi
https://tanmoychak.com/

Introduction to Large Language Models

Recall: Masked Self-Attention in Decoders
Self-Attention: Scaled dot-product attention

where, 𝑄 = 𝑋𝑊 𝑄 , 𝐾 = 𝑋𝑊 𝐾 , V = 𝑋𝑊 𝑉
Problem: While training autoregressive models (with next-word-prediction objective), Transformers ‘can see
the future’ .
• For a current token 𝑥𝑖 , the attention scores are computed with all tokens in the sequence
including those which comes after 𝑥𝑖 (as the whole sequence is available to us during training).
Solution: Masking

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Recall: Masked Self-Attention in Decoders
Masking: ‘Masked’ scaled dot-product attention

where, masking matrix 𝑀 is defined as:

For future tokens, the attention scores becomes zero after applying softmax [softmax(-∞) = 0].
• Effectively, after masking, the query is the current token 𝑥𝑖 , and the keys and values
comes from the tokens before it, including itself (i.e., 𝑥𝑗 , 𝑗 ≤ 𝑖).

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Re-writing the Masked Self-Attention Equation
Now let’s re-write the masked attention equation for a current token 𝑥𝑖 .
• Assume that we are considering the attention head h of layer l.
• Let’s denote the matrix with the output hidden representation from layer k of previous
tokens 𝒙𝒋 , 𝒋 ≤ 𝒊 as 𝑿𝒌≤𝒊 .
Thus, for calculating attention scores for attention head h of layer l, input to the attention
sub-layer is the output representation from the previous layer l-1.
• Query: 𝑥𝑖𝑙−1 𝑊𝑄𝑙,ℎ
𝑙−1
• Keys: 𝑋≤𝑖 𝑊𝐾𝑙,ℎ
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

QK Circuit

ℎ
QK (query-key) circuit: 𝑊𝑄𝐾 = 𝑊𝑄ℎ 𝑊𝐾ℎ T
• QK circuits are responsible for reading from the residual stream.

Let’s now look at the residual stream

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Residual Stream Perspective
• Each input embedding gets updated
via vector additions from the
attention and feed-forward blocks
producing residual stream states
(or intermediate representations).

• The final layer residual stream state

is then projected into the vocabulary
space via the unembedding matrix
𝑊𝑈 ϵ 𝑅𝑑×|𝑉| and normalized via the
softmax.

Elhage, et al., A Mathematical Framework for Transformer Circuits

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Combining the Output of Multiple Attention Heads

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

OV Circuit

𝑙,ℎ
OV (output-value) circuit: 𝑊𝑂𝑉 = 𝑊𝑉𝑙,ℎ 𝑊𝑂𝑙,ℎ

• OV circuits are responsible for writing to the residual stream.

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Attention Block Output
The attention block output is the sum of individual attention heads, which is subsequently
added back into the residual stream.

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Feed-Forward Network (FFN)

𝑙
• 𝑊𝑖𝑛 reads from the residual stream state 𝑥𝑖𝑚𝑖𝑑,𝑙 .
• Its result is passed through an element-wise non-linear activation function 𝑔, producing the
neuron activations.
𝑙
• These get transformed by 𝑊𝑜𝑢𝑡 to produce the output 𝐹𝐹𝑁 𝑙 (𝑥𝑖𝑚𝑖𝑑,𝑙 ), which is then added
back to the residual stream
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Prediction as a Sum of Component Outputs
• Prediction head of a Transformer consists of an unembedding matrix:
We can rearrange the traditional forward pass formulation to separate the contribution of
each model component to the output logits:

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Prediction as an Ensemble of Shallow Networks
• Residual networks work as ensembles of shallow networks, where each subnetwork
defines a path in the computational graph.
Consider a two-layer attention-only Transformer, where each attention head is composed
just by an OV matrix:

We can decompose the forward pass as:

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Prediction as an Ensemble of Shallow Networks

• This term links the input embedding to the

unembedding matrix and is referred to as the
direct path

• It shows the contribution of the input embedding

towards the output logit of the next token to be
predicted.

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Prediction as an Ensemble of Shallow Networks

• These terms depicts paths traversing a single OV

circuit, and are named full OV circuits

• They show the contribution of each OV circuit

towards the output logit of the next token to be
predicted.

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Prediction as an Ensemble of Shallow Networks

• This term depicts the path involving both attention heads, and is
referred to as virtual attention heads doing V-composition

• This is called ‘composition’ since the sequential writing and reading of

the two heads is seen as OV matrices composing together.
• The amount of composition can be measured as:
1 2
||𝑊𝑂𝑉 𝑊𝑂𝑉 ||𝐹
ൗ||𝑊 1 || ||𝑊 2 ||
𝑂𝑉 𝐹 𝑂𝑉 𝐹
Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Prediction as an Ensemble of Shallow Networks
• In full Transformer models, Q-composition and K-composition, i.e. compositions of WQ
and WK with the WOV output of previous layers, can also be found.

• Such decomposition enables us to localize the inputs or model components responsible

for a particular prediction.

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Why Do We Need Such a Formulation?
• By decomposing the Transformer into simpler components like the query-key circuit WQK
and the output-value circuit WOV , we can better understand the information flow within
Transformer-based LLMs.
• This formulation reveals how each layer incrementally transforms token representations.
• Also shows how attention heads and feedforward networks contribute to language modeling.
• Breaking down the contributions of individual circuits allows us to interpret which
aspects of the model influence specific predictions.
Thus, through this formulation, the behavior of attention heads,
the interaction between tokens, and the role of the residual
stream can be explored more clearly.

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Guravaiah CV TPT
No ratings yet
Guravaiah CV TPT
1 page
LLM Ki Inro
No ratings yet
LLM Ki Inro
10 pages
XNDJXNGJNG
No ratings yet
XNDJXNGJNG
13 pages
And Background: Granted
No ratings yet
And Background: Granted
5 pages
State Bank Collect
No ratings yet
State Bank Collect
3 pages
Scopeofvariables 140612013155 Phpapp01
No ratings yet
Scopeofvariables 140612013155 Phpapp01
21 pages
Extra Programs of Os
No ratings yet
Extra Programs of Os
21 pages
Remaing Daa
No ratings yet
Remaing Daa
29 pages
Chinu Room Rent
No ratings yet
Chinu Room Rent
1 page
OS Lab Programs 531
No ratings yet
OS Lab Programs 531
69 pages
CBSE - VI - Computers - Introduction To OO Calc
No ratings yet
CBSE - VI - Computers - Introduction To OO Calc
31 pages
Instrukcja Obslugi - TT-S6D-eng
No ratings yet
Instrukcja Obslugi - TT-S6D-eng
2 pages
College Management
No ratings yet
College Management
26 pages
Stage 7 Math: Area & Volume Guide
No ratings yet
Stage 7 Math: Area & Volume Guide
18 pages
(Ebook) Mathematical Olympiad Treasures by Titu Andreescu, Bogdan Enescu (Auth.) ISBN 9780817682521, 081768252X Download Full Chapters
100% (2)
(Ebook) Mathematical Olympiad Treasures by Titu Andreescu, Bogdan Enescu (Auth.) ISBN 9780817682521, 081768252X Download Full Chapters
184 pages
Solutions To Homework Seven: KN KC MN MC
No ratings yet
Solutions To Homework Seven: KN KC MN MC
3 pages
WMM Sheet
No ratings yet
WMM Sheet
6 pages
Tertiary Treatment - Uenrppt
No ratings yet
Tertiary Treatment - Uenrppt
29 pages
ATS48C32Q: Product Data Sheet
No ratings yet
ATS48C32Q: Product Data Sheet
2 pages
DigitalFilters Apracticalguide
No ratings yet
DigitalFilters Apracticalguide
67 pages
Servo Motor Guide for Hobbyists
100% (1)
Servo Motor Guide for Hobbyists
5 pages
CS102 Handbook
No ratings yet
CS102 Handbook
72 pages
Steam Turbine Mechanics
No ratings yet
Steam Turbine Mechanics
51 pages
White Paper On Data Center Infrastructure Smart O&M
No ratings yet
White Paper On Data Center Infrastructure Smart O&M
12 pages
IBIS Io Buffer
No ratings yet
IBIS Io Buffer
66 pages
Pengukuran
No ratings yet
Pengukuran
2 pages
Oic751-Transducer Engineering 2 Marks, 13 Marks and Problem
No ratings yet
Oic751-Transducer Engineering 2 Marks, 13 Marks and Problem
41 pages
Seminar Topics
No ratings yet
Seminar Topics
5 pages
I Hate To See That Evening Sun Go Down Collected Stories William Gay Gay Full Digital Chapters
No ratings yet
I Hate To See That Evening Sun Go Down Collected Stories William Gay Gay Full Digital Chapters
27 pages
Chemsheets AS 016 (Shapes of Molecules)
No ratings yet
Chemsheets AS 016 (Shapes of Molecules)
2 pages
Faculty of Science and Technology Savitribai Phule Pune University Maharashtra, India
No ratings yet
Faculty of Science and Technology Savitribai Phule Pune University Maharashtra, India
15 pages
Joana Field Reservoir Modeling Report
50% (2)
Joana Field Reservoir Modeling Report
41 pages
Data Manipulation Language Commands
No ratings yet
Data Manipulation Language Commands
20 pages
Chater 9 Solutions PDF
No ratings yet
Chater 9 Solutions PDF
22 pages
Revision History Revision Date Purpose
No ratings yet
Revision History Revision Date Purpose
5 pages
Part IV BJT
No ratings yet
Part IV BJT
76 pages
Soeteman Legal Logic or Can We Do Without PDF
No ratings yet
Soeteman Legal Logic or Can We Do Without PDF
14 pages
Sound Reinforcement Handbook
No ratings yet
Sound Reinforcement Handbook
30 pages
Layout Techniques of AD Converter
No ratings yet
Layout Techniques of AD Converter
20 pages
Chapter-2-Static of Data-1
No ratings yet
Chapter-2-Static of Data-1
13 pages

Lec 31

Uploaded by

Lec 31

Uploaded by

An Alternate Formulation of

Introduction to Large Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

where, masking matrix 𝑀 is defined as:

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Let’s now look at the residual stream

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

• The final layer residual stream state

Elhage, et al., A Mathematical Framework for Transformer Circuits

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

• OV circuits are responsible for writing to the residual stream.

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

We can decompose the forward pass as:

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

• This term links the input embedding to the

• It shows the contribution of the input embedding

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

• These terms depicts paths traversing a single OV

• They show the contribution of each OV circuit

Ferrando et al., A Primer on the Inner Workings of Transformer-based Language Models

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

• This is called ‘composition’ since the sequential writing and reading of

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

• Such decomposition enables us to localize the inputs or model components responsible

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

Introduction to LLMs Tanmoy Chakraborty Tanmoy Chakraborty

You might also like