Automated Image Captioning with
ConvNets and Recurrent Nets
Andrej Karpathy, Fei-Fei Li
Automated Image Captioning with
ConvNets and Recurrent Nets
Andrej Karpathy, Fei-Fei Li
natural language
images of me scuba diving next to turtle
images of me scuba diving next to turtle
Very hard task
images of me scuba diving next to turtle
Very hard task
vzntrf bs zr fphon qvivat arkg gb ghegyr
Very hard task
vzntrf bs zr fphon qvivat arkg gb ghegyr
Neural Networks practitioner
Describing images
Recurrent Neural Network
Convolutional Neural Network
Convolutional Neural Networks
image
(32*32 class probabilities
numbers) differentiable function (10 numbers)
[LeCun et al., 1998]
[Krizhevsky, Sutskever, Hinton. 2012] 16.4% error
[Krizhevsky, Sutskever, Hinton. 2012] 16.4% error
[Zeiler and Fergus, 2013] 11.1% error
[Krizhevsky, Sutskever, Hinton. 2012] 16.4% error
[Szegedy et al., 2014] 6.6% error
[Simonyan and Zisserman, 2014] 7.3% error
[Zeiler and Fergus, 2013] 11.1% error
[Szegedy et al., 2014]
6.6% error
[Simonyan and Zisserman, 2014]
7.3% error
Human error: ~5.1%
Optimistic human error: ~3%
read more on my blog:
karpathy.github.io
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
(And available in Caffe.)
[224x224x3]
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
(And available in Caffe.)
[1000]
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
CONV
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
(And available in Caffe.)
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
CONV
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
POOL (And available in Caffe.)
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
CONV
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
POOL (And available in Caffe.)
FULLY-CONNECTED
Every layer of a ConvNet has the same API:
- Takes a 3D volume of numbers
- Outputs a 3D volume of numbers
- Constraint: function must be differentiable
probabilities
[1x1x1000]
image
[224x224x3]
Fully Connected Layer
[1x1x4096] “neurons”
[7x7x512]
Every “neuron” in the output:
1. computes a dot product between the
input and its weights
2. thresholds it at zero
Fully Connected Layer
[1x1x4096] “neurons”
[7x7x512]
The whole layer can be implemented
very efficiently as:
1. single matrix multiply
2. Elementwise thresholding at zero
Convolutional Layer
224
224
224 224
D=3 64
Every blue neuron is connected to a 3x3x3 array of inputs
Convolutional Layer Can be
implemented
efficiently with
convolutions
224
224
224 224
D=3 64
Every blue neuron is connected to a 3x3x3 array of inputs
Pooling Layer
[112x112x64]
[224x224x64]
Performs (spatial) downsampling
Pooling Layer
224
224
Pooling Layer
224 112
downsampling
112
224
Max Pooling Layer
Single depth slice
1 1 2 4
x
5 6 7 8 max pool 6 8
3 2 1 0 3 4
1 2 3 4
y
What do the neurons learn?
[Taken from Yann LeCun slides]
Example activation maps
CONV CONV POOL CONV CONV POOL CONV CONV POOL FC
ReLU ReLU ReLU ReLU ReLU ReLU (Fully-connected)
Example activation maps
CONV CONV POOL CONV CONV POOL CONV CONV POOL FC
ReLU ReLU ReLU ReLU ReLU ReLU (Fully-connected)
(tiny VGGNet trained with ConvNetJS)
[224x224x3]
differentiable function
[1000]
[224x224x3]
differentiable function
0.2 0.4 0.09 0.01 0.3
cat dog chair bagel banana
[1000]
[224x224x3]
differentiable function
0.2 0.4 0.09 0.01 0.3
cat dog chair bagel banana
[1000]
Training
Loop until tired:
1. Sample a batch of data
2. Forward it through the network to get predictions
3. Backprop the errors
4. Update the weights
Training
Loop until tired:
1. Sample a batch of data
2. Forward it through the network to get predictions
3. Backprop the errors
4. Update the weights
[image credit:
Karen Simonyan]
Summary so far:
Convolutional Networks express a single
differentiable function from raw image pixel
values to class probabilities.
Recurrent Neural Network
Convolutional Neural Network
Plug
- Fei-Fei and I are
teaching CS213n (A
Convolutional Neural
Networks Class) at
Stanford this quarter.
cs231n.stanford.edu
- All the notes are online:
cs231n.github.io
- Assignments are on
terminal.com
Recurrent Neural Network
Recurrent Networks are good at modeling sequences...
Generating Sequences With Recurrent Neural Networks
[Alex Graves, 2014]
Recurrent Networks are good at modeling sequences...
Word-level language model. Similar to:
Recurrent Neural Network Based Language Model
[Tomas Mikolov, 2010]
Recurrent Networks are good at modeling sequences...
Machine Translation model
French words English words
Sequence to Sequence Learning with Neural Networks
[Ilya Sutskever, Oriol Vinyals, Quoc V. Le, 2014]
RecurrentJS 2-layer LSTM
train recurrent
networks in
Javascript!*
*if you have a lot of time :)
RecurrentJS 2-layer LSTM
train recurrent networks
in Javascript!*
*if you have a lot of time :)
Character-level Paul Graham Wisdom Generator:
Suppose we had the training sentence “cat sat on mat”
We want to train a language model:
P(next word | previous words)
Suppose we had the training sentence “cat sat on mat”
We want to train a language model:
P(next word | previous words)
i.e. want these to be high:
P(cat | [<S>])
P(sat | [<S>, cat])
P(on | [<S>, cat, sat])
P(mat | [<S>, cat, sat, on])
“cat sat on mat”
y0 y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 300 (learnable) numbers
x0 x4
<START>
“cat” “sat” “on” “mat” associated with each word
P(word | [<S>]) P(word | [<S>, cat, sat])
“cat sat on mat”
P(word | [<S>, cat]) P(word | [<S>, cat, sat, on])
P(word | [<S>, cat, sat, on, mat])
y0 y1 y2 y3 y4
10,001 numbers (logprobs for
10,000 words in vocabulary and
a special <END> token)
y4 = Why * h4
h0 h1 h2 h3 h4
x1 x2 x3 300 (learnable) numbers
x0 x4
<START>
“cat” “sat” “on” “mat” associated with each word
P(word | [<S>]) P(word | [<S>, cat, sat])
“cat sat on mat”
P(word | [<S>, cat]) P(word | [<S>, cat, sat, on])
P(word | [<S>, cat, sat, on, mat])
y0 y1 y2 y3 y4
10,001 numbers (logprobs for
10,000 words in vocabulary and
a special <END> token)
y4 = Why * h4
h0 h1 h2 h3 h4
“hidden” representation mediates
the contextual information
(e.g. 200 numbers)
h4 = max(0, Wxh * x4 + Whh * h3)
x1 x2 x3 300 (learnable) numbers
x0 x4
<START>
“cat” “sat” “on” “mat” associated with each word
Training this on a lot of
sentences would give us a
language model. A way to
predict
P(next word | previous words)
x0
<START>
Training this on a lot of
sentences would give us a
language model. A way to y0
predict
P(next word | previous words)
h0
x0
<START>
Training this on a lot of
sentences would give us a
language model. A way to y0
predict
P(next word | previous words)
h0
sample!
x0 x1
<START>
“cat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1
predict
P(next word | previous words)
h0 h1
x0 x1
<START>
“cat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1
predict
P(next word | previous words)
h0 h1 sample!
x0 x1 x2
<START>
“cat” “sat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2
predict
P(next word | previous words)
h0 h1 h2
x0 x1 x2
<START>
“cat” “sat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2
predict sample!
P(next word | previous words)
h0 h1 h2
x0 x1 x2 x3
<START>
“cat” “sat” “on”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2 y3
predict
P(next word | previous words)
h0 h1 h2 h3
x0 x1 x2 x3
<START>
“cat” “sat” “on”
Training this on a lot of
sentences would give us a sample!
language model. A way to y0 y1 y2 y3
predict
P(next word | previous words)
h0 h1 h2 h3
x0 x1 x2 x3 x4
<START>
“cat” “sat” “on” “mat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2 y3 y4
predict
P(next word | previous words)
h0 h1 h2 h3 h4
x0 x1 x2 x3 x4
<START>
“cat” “sat” “on” “mat”
samples <END>? done.
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2 y3 y4
predict
P(next word | previous words)
h0 h1 h2 h3 h4
x0 x1 x2 x3 x4
<START>
“cat” “sat” “on” “mat”
Recurrent Neural Network
Convolutional Neural Network
“straw hat”
training example
“straw hat”
training example
“straw hat”
training example
X
“straw hat”
training example
y0 y1 y2
h0 h1 h2
x0
x1 x2
<STA
“straw” “hat”
RT>
X <START> straw hat
“straw hat”
training example
y0 y1 y2
h0 h1 h2
before:
h0 = max(0, Wxh * x0)
now:
x0
h0 = max(0, Wxh * x0 + Wih * v)
x1 x2
<STA
“straw” “hat”
RT>
X <START> straw hat
“straw hat”
training example
y0 y1 y2
h0 h1 h2
x0
x1 x2
<STA
“straw” “hat”
RT>
X <START> straw hat
test image
test image
x0
<STA
RT>
<START>
test image
y0
h0
x0
<STA
RT>
<START>
test image
y0
sample!
h0
x0
<STA x1
RT>
<START>
test image
y0 y1
h0 h1
x0
<STA straw
RT>
<START>
test image
y0 y1
h0 h1
sample!
x0
<STA straw hat
RT>
<START>
test image
y0 y1 y2
h0 h1 h2
x0
<STA straw hat
RT>
<START>
test image
y0 y1 y2
sample!
<END> token
h0 h1 h2 => finish.
x0
<STA straw hat
RT>
<START>
test image
y0 y1 y2
sample!
<END> token
h0 h1 h2 => finish.
- Don’t have to do greedy
word-by-word sampling, can
x0
<STA straw hat also search over longer
RT>
phrases with beam search
<START>
RNN vs. LSTM
y0 y1 “hidden” representation
(e.g. 200 numbers)
h1 = max(0, Wxh * x1 + Whh * h0)
h0 h1
x0 x1
<START>
“cat”
RNN vs. LSTM
y0 y1 “hidden” representation
(e.g. 200 numbers)
h1 = max(0, Wxh * x1 + Whh * h0)
h0 h1 LSTM changes the form of the equation for
h1 such that:
1. more expressive multiplicative interactions
2. gradients flow nicer
3. network can explicitly decide to reset the
x0 x1
<START>
“cat” hidden state
Image Sentence Datasets
Microsoft COCO
[Tsung-Yi Lin et al. 2014]
mscoco.org
currently:
~120K images
~5 sentences each
Training an RNN/LSTM...
- Clip the gradients (important!). 5 worked ok
- RMSprop adaptive learning rate worked nice
- Initialize softmax biases with log word
frequency distribution
- Train for long time
+ Transfer Learning
“straw hat”
y0 y1 y2 training example
h0 h1 h2
x0
x1
<ST x2
“stra
ART “hat”
w”
>
<START> straw hat
+ Transfer Learning
use weights
pretrained from “straw hat”
ImageNet
y0 y1 y2 training example
h0 h1 h2
x0
x1
<ST x2
“stra
ART “hat”
w”
>
<START> straw hat
+ Transfer Learning
use weights
pretrained from “straw hat”
ImageNet
y0 y1 y2 training example
h0 h1 h2
use word vectors
x0
<ST
x1
“stra
x2
pretrained with
ART “hat”
>
w”
word2vec [1]
<START> straw hat
[1] Mikolov et al., 2013
Summary of the approach
We wanted to describe images with sentences.
1. Define a single function from input -> output
2. Initialize parts of net from elsewhere if possible
3. Get some data
4. Train with SGD
Wow I can’t believe that worked
Wow I can’t believe that worked
Well, I can kind of see it
Well, I can kind of see it
Not sure what happened there...
See predictions on
1000 COCO images:
http://bit.ly/neuraltalkdemo
What this approach Doesn’t do:
- There is no reasoning
- A single glance is taken at the image, no
objects are detected, etc.
- We can’t just describe any image
NeuralTalk
- Code on Github
- Both RNN/LSTM
- Python+numpy (CPU)
- Matlab+Caffe if you want
to run on new images (for
now)
Ranking model
Ranking model
web demo:
http://bit.ly/rankingdemo
Recurrent Neural Network
Summary
Convolutional Neural Network
Neural Networks:
- input->output end-to-end optimization
- stackable / composable like Lego
- easily support Transfer Learning
- work very well.
Summary
1. image -> sentence
2. sentence -> image
Summary
1. image -> sentence
2. sentence -> image
natural language
Summary
1. image -> sentence
2. sentence -> image
natural language
Thank you!