Lecture 10 Recap
I2DL: Prof. Dai 2
LeNet
• Digit recognition: 10 classes 60k parameters
• Conv -> Pool -> Conv -> Pool -> Conv -> FC
• As we go deeper: Width, Height Number of Filters
I2DL: Prof. Dai 3
AlexNet
• Softmax for 1000 classes [Krizhevsky et al., ANIPS’12] AlexNet
I2DL: Prof. Dai 4
VGGNet
• Striving for simplicity
– Conv -> Pool -> Conv -> Pool -> Conv -> FC
– Conv=3x3, s=1, same; Maxpool=2x2, s=2
• As we go deeper: Width, Height Number of Filters
• Called VGG-16: 16 layers that have weights
138M parameters
• Large but simplicity makes it appealing
[Simonyan et al., ICLR’15] VGGNet
I2DL: Prof. Dai 5
Residual Block
• Two layers
𝐿−1 𝑥𝐿
𝑥 𝑥 𝐿+1
𝑥 𝐿+1 = 𝑓(𝑊 𝐿+1 𝑥 𝐿 + 𝑏 𝐿+1 + 𝑥 𝐿−1 )
Input Linear Linear
𝑥 𝐿+1 = 𝑓(𝑊 𝐿+1 𝑥 𝐿 + 𝑏 𝐿+1 )
I2DL: Prof. Dai 6
Inception Layer
[Szegedy et al., CVPR’15] GoogleNet
I2DL: Prof. Dai 7
Lecture 11
I2DL: Prof. Dai 8
Transfer Learning
I2DL: Prof. Dai 9
Transfer Learning
• Training your own model can be difficult with limited
data and other resources
e.g.,
• It is a laborious task to manually annotate your
own training dataset
• Why not reuse already pre-trained models?
I2DL: Prof. Dai 10
Transfer Learning
Distribution Distribution
P1 P2
Large dataset Small dataset
Use what has been
learned for another
setting
I2DL: Prof. Dai 11
Transfer Learning for Images
[Zeiler al., ECCV’14] Visualizing and Understanding Convolutional Networks
I2DL: Prof. Dai 12
Trained on Transfer Learning
ImageNet
Feature
extraction
[Donahue et al., ICML’14] DeCAF,
[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 13
Trained on Transfer Learning
ImageNet
Decision layers
Parts of an object (wheel, window)
Simple geometrical shapes (circles, etc)
Edges
[Donahue et al., ICML’14] DeCAF,
[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 14
Trained on Transfer Learning
ImageNet
TRAIN New dataset
with C classes
FROZEN
[Donahue et al., ICML’14] DeCAF,
[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 15
Transfer Learning
TRAIN
If the dataset is big
enough train more
layers with a low FROZEN
learning rate
I2DL: Prof. Dai 16
When Transfer Learning Makes Sense
• When task T1 and T2 have the same input (e.g. an
RGB image)
• When you have more data for task T1 than for task T2
• When the low-level features for T1 could be useful to
learn T2
I2DL: Prof. Dai 17
Now you are:
• Ready to perform image classification on any dataset
• Ready to design your own architecture
• Ready to deal with other problems such as semantic
segmentation (Fully Convolutional Network)
I2DL: Prof. Dai 18
Representation
Learning
I2DL: Prof. Dai 19
Learning Good Features
• Good features are essential for successful machine
learning
• (Supervised) deep learning depends on training data
used: input/target labels
• Change in inputs (noise, irregularities, etc) can result
in drastically different results
I2DL: Prof. Dai 20
Representation Learning
• Allows for discovery of representations required for
various tasks
• Deep representation learning: model maps input 𝑋 to
output 𝑌
I2DL: Prof. Dai 21
Deep Representation Learning
• Intuitively, deep networks learn multiple levels of
abstraction
I2DL: Prof. Dai 22
How to Learn Good Features?
• Determine desired feature invariances
• Teach machines to distinguish between similar and
dissimilar things
I2DL: Prof. Dai https://amitness.com/2020/03/illustrated-simclr/ 23
How to Learn Good Features?
[Chen et al., ICML’20] SimCLR,
I2DL: Prof. Dai https://amitness.com/2020/03/illustrated-simclr/ 24
Apply to Downstream Tasks
[Chen et al., ICML’20] SimCLR,
https://amitness.com/2020/03/illu
strated-simclr/
I2DL: Prof. Dai 25
Transfer & Representation Learning
• Transfer learning can be done via representation
learning
• Effectiveness of representation learning often
demonstrated by transfer learning performance (but
also other factors, e.g., smoothness of the manifold)
I2DL: Prof. Dai 26
Recurrent Neural
Networks
I2DL: Prof. Dai 27
Processing Sequences
• Recurrent neural networks process sequence data
• Input/output can be sequences
I2DL: Prof. Dai 28
RNNs are Flexible
Classical neural networks for image classification
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 29
RNNs are Flexible
Image captioning
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 30
RNNs are Flexible
Language recognition
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 31
RNNs are Flexible
Machine translation
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 32
RNNs are Flexible
Event classification
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 33
RNNs are Flexible
Event classification
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 34
Basic Structure of an RNN
• Multi-layer RNN
Outputs
Hidden
states
Inputs
I2DL: Prof. Dai 35
Basic Structure of an RNN
• Multi-layer RNN
Outputs
The hidden state
will have its own
internal dynamics Hidden
states
More expressive
model!
Inputs
I2DL: Prof. Dai 36
Basic Structure of an RNN
• We want to have notion of “time” or “sequence”
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡
Hidden
state Previous input
hidden
state
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 37
Basic Structure of an RNN
• We want to have notion of “time” or “sequence”
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡
Hidden
state Parameters to be learned
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 38
Basic Structure of an RNN
• We want to have notion of “time” or “sequence”
Output
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡
Hidden
state 𝒉𝑡 = 𝜽 𝒉 𝑨𝑡
Note: non-linearities
ignored for now
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 39
Basic Structure of an RNN
• We want to have notion of “time” or “sequence”
Output
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡
Hidden
state 𝒉𝑡 = 𝜽 𝒉 𝑨𝑡
Same parameters for each
time step = generalization!
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 40
Basic Structure of an RNN
• Unrolling RNNs Same function for the hidden layers
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 41
Basic Structure of an RNN
• Unrolling RNNs
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 42
Basic Structure of an RNN
• Unrolling RNNs as feedforward nets
Weights are the same!
I2DL: Prof. Dai 43
Backprop through an RNN
• Unrolling RNNs as feedforward nets
Chain rule
All the way to 𝑡 = 0
Add the derivatives at different times for each weight
I2DL: Prof. Dai 44
Long-term Dependencies
I moved to Germany … so I speak German fluently.
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 45
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡
• Let us forget the input 𝑨𝑡 = 𝜽 𝒄𝑡 𝑨0
Same weights are
multiplied over and over
again
I2DL: Prof. Dai 46
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0
What happens to small weights?
Vanishing gradient
What happens to large weights?
Exploding gradient
I2DL: Prof. Dai 47
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0
• If 𝜽 admits eigendecomposition
𝜽 = 𝑸𝚲𝑸𝑇
Matrix of Diagonal of this
eigenvectors matrix are the
eigenvalues
I2DL: Prof. Dai 48
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝑡 𝑨0
• If 𝜽 admits eigendecomposition
𝜽 = 𝑸𝚲𝑸𝑇
• Orthogonal 𝜽 allows us to simplify the recurrence
𝑨𝑡 = 𝑸𝚲𝑡 𝑸𝑇 𝑨0
I2DL: Prof. Dai 49
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝑸𝚲t 𝑸𝑇 𝑨0
What happens to eigenvalues with
magnitude less than one?
Vanishing gradient
What happens to eigenvalues with
magnitude larger than one?
Exploding gradient Gradient
I2DL: Prof. Dai
clipping 50
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0
Let us just make a matrix with eigenvalues = 1
Allow the cell to maintain its “state”
I2DL: Prof. Dai 51
Vanishing Gradient
• 1. From the weights 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0
• 2. From the activation functions (𝑡𝑎𝑛ℎ)
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 52
Vanishing Gradient
• 1. From the weights 𝑨𝑡 = 𝜽 𝑡 𝑨0
1
• 2. From the activation functions (𝑡𝑎𝑛ℎ) ?
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 53
Long Short Term
Memory
[Hochreiter et al., Neural Computation’97] Long Short-Term Memory
I2DL: Prof. Dai 54
Long-Short Term Memory Units
• Simple RNN has tanh as non-linearity
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 55
Long-Short Term Memory Units
LSTM
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 56
Long-Short Term Memory Units
• Key ingredients
• Cell = transports the information through the unit
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 57
Long-Short Term Memory Units
• Key ingredients
• Cell = transports the information through the unit
• Gate = remove or add information to the cell state
Sigmoid
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 58
LSTM: Step by Step
• Forget gate 𝒇𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑓 𝒙𝑡 + 𝜽ℎ𝑓 𝒉𝑡−1 + 𝒃𝑓 )
Decides when to
erase the cell state
Sigmoid = output
between 0 (forget)
and 1 (keep)
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 59
LSTM: Step by Step
• Input gate 𝒊𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑖 𝒙𝑡 + 𝜽ℎ𝑖 𝒉𝑡−1 + 𝒃𝑖 )
Decides which
values will be
updated
New cell state,
output from a
tanh (−1,1)
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 60
LSTM: Step by Step
• Element-wise operations
𝑪𝑡 = 𝒇𝑡 ⊙𝑪𝑡−1 +𝒊𝑡 ⊙𝒈𝑡
Previous Current
states state
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 61
LSTM: Step by Step
• Output gate 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡
Decides which
values will be
outputted
Output from a
tanh (−1, 1)
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 62
LSTM: Step by Step
• Forget gate 𝒇𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑓 𝒙𝑡 + 𝜽ℎ𝑓 𝒉𝑡−1 + 𝒃𝑓 )
• Input gate 𝒊𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑖 𝒙𝑡 + 𝜽ℎ𝑖 𝒉𝑡−1 + 𝒃𝑖 )
• Output gate 𝒐𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑜 𝒙𝑡 + 𝜽ℎ𝑜 𝒉𝑡−1 + 𝒃𝑜 )
• Cell update 𝒈𝑡 = 𝑡𝑎𝑛ℎ(𝜽𝑥𝑔 𝒙𝑡 + 𝜽ℎ𝑔 𝒉𝑡−1 + 𝒃𝑔 )
• Cell 𝑪𝑡 = 𝒇𝑡 ⊙𝑪𝑡−1 +𝒊𝑡 ⊙𝒈𝑡
• Output 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡
I2DL: Prof. Dai 63
LSTM: Step by Step
• Forget gate 𝒇𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑓 𝒙𝑡 + 𝜽ℎ𝑓 𝒉𝑡−1 + 𝒃𝑓 )
• Input gate 𝒊𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑖 𝒙𝑡 + 𝜽ℎ𝑖 𝒉𝑡−1 + 𝒃𝑖 )
• Output gate 𝒐𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑜 𝒙𝑡 + 𝜽ℎ𝑜 𝒉𝑡−1 + 𝒃𝑜 )
• Cell update 𝒈𝑡 = 𝑡𝑎𝑛ℎ(𝜽𝑥𝑔 𝒙𝑡 + 𝜽ℎ𝑔 𝒉𝑡−1 + 𝒃𝑔 )
• Cell 𝑪𝑡 = 𝒇𝑡 ⊙𝑪𝑡−1 +𝒊𝑡 ⊙𝒈𝑡
• Output 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡 Learned through
backpropagation
I2DL: Prof. Dai 64
LSTM
• Highway for the gradient to flow
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 66
LSTM: Dimensions
128 128 128
• Cell update 𝒈𝑡 = 𝑡𝑎𝑛ℎ(𝜽𝑥𝑔 𝒙𝑡 + 𝜽ℎ𝑔 𝒉𝑡−1 + 𝒃𝑔 )
When coding an
LSTM, we have to
define the size of the
128
hidden state
Dimensions need to
128 match
What operation do I need to do to my input to get
a 128 vector representation?
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 67
LSTM in code
Attention
I2DL: Prof. Dai 73
Attention is all you need
I2DL: Prof. Dai 75
Attention is all you need
~62,000 citations in
5 years!
I2DL: Prof. Dai 76
Attention vs convolution
I2DL: Prof. Dai 77
Long-Term Dependencies
I moved to Germany … so I speak German fluently.
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
I2DL: Prof. Dai 78
Attention: Intuition
Context
I moved to Germany … so I speak German fluently
I2DL: Prof. Dai 79
Attention: Architecture
• A decoder processes
the information
D D D
• Decoders take as Context
input:
– Previous decoder
hidden state
– Previous output
– Attention
I2DL: Prof. Dai 80
Transformers
I2DL: Prof. Dai 81
Deep Learning Revolution
Deep Learning Deep Learning 2.0
Main idea Convolution Attention
Field invented Computer vision NLP
Started NeurIPS 2012 NeurIPS 2017
Paper AlexNet Transformers
Conquered vision Around 2014-2015 Around 2020-2021
Replaced Traditional ML/CV CNNs, RNNs
(Augmented)
I2DL: Prof. Dai 82
Transformers
Fully connected
layer
Masked Multi-
Multi-Head Head Attention
Attention on the on the “decoder”
“encoder”
I2DL: Prof. Dai 84
Multi-Head Attention
Intuition: Take the query Q, find the most similar
key K, and then find the value V that
corresponds to the key.
In other words, learn V, K, Q where:
V – here is a bunch of interesting things.
K – here is how we can index some things.
Q – I would like to know this interesting thing.
Loosely connected to Neural Turing Machines
(Graves et al.).
I2DL: Prof. Dai 85
Multi-Head Attention
Index the values Multiply queries
via a differentiable with keys
operator.
Get the values
𝑄𝐾 𝑇
Attention 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘
To train them well, divide by 𝑑𝑘 , “probably” because for
large values of the key’s dimension, the dot product grows
large in magnitude, pushing the softmax function into regions
where it has extremely small gradients.
I2DL: Prof. Dai 86
Multi-Head Attention
Adapted from Y. Kilcher
I2DL: Prof. Dai 87
Multi-Head Attention
K1
K2
K5
K4 Q
K3
I2DL: Prof. Dai 88
Multi-Head Attention
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
I2DL: Prof. Dai 89
Multi-Head Attention
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
Essentially, dot product between (<Q,K1>), (<Q,K2>), (<Q,K3>),
(<Q,K4>), (<Q,K5>).
I2DL: Prof. Dai 90
Multi-Head Attention
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
𝑄𝐾 𝑇 Is simply inducing a distribution over the values.
softmax The larger a value is, the higher is its softmax value.
𝑑𝑘 Can be interpreted as a differentiable soft indexing.
I2DL: Prof. Dai 91
Multi-Head Attention
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
𝑄𝐾 𝑇 Is simply inducing a distribution over the values.
softmax The larger a value is, the higher is its softmax value.
𝑑𝑘 Can be interpreted as a differentiable soft indexing.
I2DL: Prof. Dai 92
Multi-Head Attention
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
𝑄𝐾 𝑇 Selecting the value V where
softmax the network needs to attend..
𝑑𝑘
I2DL: Prof. Dai 93
Transformers – a closer look
K parallel
attention heads.
I2DL: Prof. Dai 96
Transformers – a closer look
Good old fully-
connected
layers.
I2DL: Prof. Dai 97
Transformers – a closer look
N layers of
attention
followed by FC
I2DL: Prof. Dai 98
Transformers – a closer look
Same as multi-head attention,
but masked. Ensures that the
predictions for position i can
depend only on the known
outputs at positions less than i.
I2DL: Prof. Dai 99
Transformers – a closer look
Multi-headed attention between
encoder and the decoder.
I2DL: Prof. Dai 100
Transformers – a closer look
Projection and prediction.
I2DL: Prof. Dai 101
What is missing from self-attention?
• Convolution: a different linear transformation for each
relative position. Allows you to distinguish what
information came from where.
• Self-attention: a weighted average.
I2DL: Prof. Dai 102
Transformers – a closer look
Uses fixed positional encoding
based on trigonometric series, in
order for the model to make use
of the order of the sequence
dimension
𝑝𝑜𝑠
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin
100002𝑖/𝑑model
𝑝𝑜𝑠
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = cos( )
100002𝑖/𝑑model
I2DL: Prof. Dai 103
Transformers – a final look
I2DL: Prof. Dai 104
Self-attention: complexity
where n is the sequence length, d is the representation dimension,
k is the convolutional kernel size, and r is the size of the neighborhood.
I2DL: Prof. Dai 105
Self-attention: complexity
where n is the sequence length, d is the representation dimension,
k is the convolutional kernel size, and r is the size of the neighborhood.
Considering that most sentences have a smaller dimension than the representation
dimension (in the paper, it is 512), self-attention is very efficient.
I2DL: Prof. Dai 106
Transformers – training tricks
• ADAM optimizer with proportional learning rate:
• Residual dropout
• Label smoothing
• Checkpoint averaging
I2DL: Prof. Dai 107
Transformers - results
I2DL: Prof. Dai 108
Transformers - summary
• Significantly improved SOTA in machine translation
• Launched a new deep-learning revolution in MLP
• Building block of NLP models like BERT (Google) or
GPT/ChatGPT (OpenAI)
• BERT has been heavily used in Google Search
• And eventually made its way to computer vision (and
other related fields)
I2DL: Prof. Dai 109
See you next time!
I2DL: Prof. Dai 110