0% found this document useful (0 votes)
107 views105 pages

Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li

The document discusses using convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for automated image captioning. CNNs are used to extract visual features from images, while RNNs are employed to generate natural language captions by modeling the sequence of words. Together, CNNs and RNNs form an end-to-end model that can learn relationships between visual content and associated text to produce captions for new images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views105 pages

Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li

The document discusses using convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for automated image captioning. CNNs are used to extract visual features from images, while RNNs are employed to generate natural language captions by modeling the sequence of words. Together, CNNs and RNNs form an end-to-end model that can learn relationships between visual content and associated text to produce captions for new images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

Automated Image Captioning with

ConvNets and Recurrent Nets


Andrej Karpathy, Fei-Fei Li
Automated Image Captioning with
ConvNets and Recurrent Nets
Andrej Karpathy, Fei-Fei Li
natural language
images of me scuba diving next to turtle
images of me scuba diving next to turtle
Very hard task
images of me scuba diving next to turtle
Very hard task
vzntrf bs zr fphon qvivat arkg gb ghegyr
Very hard task
vzntrf bs zr fphon qvivat arkg gb ghegyr
Neural Networks practitioner
Describing images
Recurrent Neural Network

Convolutional Neural Network


Convolutional Neural Networks
image
(32*32 class probabilities
numbers) differentiable function (10 numbers)

[LeCun et al., 1998]


[Krizhevsky, Sutskever, Hinton. 2012] 16.4% error
[Krizhevsky, Sutskever, Hinton. 2012] 16.4% error

[Zeiler and Fergus, 2013] 11.1% error


[Krizhevsky, Sutskever, Hinton. 2012] 16.4% error

[Szegedy et al., 2014] 6.6% error


[Simonyan and Zisserman, 2014] 7.3% error

[Zeiler and Fergus, 2013] 11.1% error


[Szegedy et al., 2014]
6.6% error
[Simonyan and Zisserman, 2014]
7.3% error

Human error: ~5.1%


Optimistic human error: ~3%
read more on my blog:
karpathy.github.io
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]

“VGGNet” or “OxfordNet”
Very simple and homogeneous.
(And available in Caffe.)
[224x224x3]

“Very Deep Convolutional Networks for Large-Scale Visual Recognition”


[Simonyan and Zisserman, 2014]

“VGGNet” or “OxfordNet”
Very simple and homogeneous.
(And available in Caffe.)

[1000]
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
CONV
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
(And available in Caffe.)
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
CONV
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
POOL (And available in Caffe.)
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
CONV
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
POOL (And available in Caffe.)

FULLY-CONNECTED
Every layer of a ConvNet has the same API:
- Takes a 3D volume of numbers
- Outputs a 3D volume of numbers
- Constraint: function must be differentiable

probabilities
[1x1x1000]
image
[224x224x3]
Fully Connected Layer

[1x1x4096] “neurons”

[7x7x512]

Every “neuron” in the output:


1. computes a dot product between the
input and its weights

2. thresholds it at zero
Fully Connected Layer

[1x1x4096] “neurons”

[7x7x512]

The whole layer can be implemented


very efficiently as:
1. single matrix multiply
2. Elementwise thresholding at zero
Convolutional Layer

224

224

224 224
D=3 64

Every blue neuron is connected to a 3x3x3 array of inputs


Convolutional Layer Can be
implemented
efficiently with
convolutions
224

224

224 224
D=3 64

Every blue neuron is connected to a 3x3x3 array of inputs


Pooling Layer

[112x112x64]
[224x224x64]

Performs (spatial) downsampling


Pooling Layer

224

224
Pooling Layer

224 112
downsampling
112
224
Max Pooling Layer

Single depth slice


1 1 2 4
x
5 6 7 8 max pool 6 8

3 2 1 0 3 4

1 2 3 4

y
What do the neurons learn?

[Taken from Yann LeCun slides]


Example activation maps
CONV CONV POOL CONV CONV POOL CONV CONV POOL FC
ReLU ReLU ReLU ReLU ReLU ReLU (Fully-connected)
Example activation maps
CONV CONV POOL CONV CONV POOL CONV CONV POOL FC
ReLU ReLU ReLU ReLU ReLU ReLU (Fully-connected)

(tiny VGGNet trained with ConvNetJS)


[224x224x3]

differentiable function

[1000]
[224x224x3]

differentiable function

0.2 0.4 0.09 0.01 0.3


cat dog chair bagel banana
[1000]
[224x224x3]

differentiable function

0.2 0.4 0.09 0.01 0.3


cat dog chair bagel banana
[1000]
Training
Loop until tired:
1. Sample a batch of data
2. Forward it through the network to get predictions
3. Backprop the errors
4. Update the weights
Training
Loop until tired:
1. Sample a batch of data
2. Forward it through the network to get predictions
3. Backprop the errors
4. Update the weights

[image credit:
Karen Simonyan]
Summary so far:
Convolutional Networks express a single
differentiable function from raw image pixel
values to class probabilities.
Recurrent Neural Network

Convolutional Neural Network


Plug
- Fei-Fei and I are
teaching CS213n (A
Convolutional Neural
Networks Class) at
Stanford this quarter.
cs231n.stanford.edu
- All the notes are online:
cs231n.github.io
- Assignments are on
terminal.com
Recurrent Neural Network
Recurrent Networks are good at modeling sequences...

Generating Sequences With Recurrent Neural Networks


[Alex Graves, 2014]
Recurrent Networks are good at modeling sequences...

Word-level language model. Similar to:

Recurrent Neural Network Based Language Model


[Tomas Mikolov, 2010]
Recurrent Networks are good at modeling sequences...

Machine Translation model


French words English words

Sequence to Sequence Learning with Neural Networks


[Ilya Sutskever, Oriol Vinyals, Quoc V. Le, 2014]
RecurrentJS 2-layer LSTM
train recurrent
networks in
Javascript!*

*if you have a lot of time :)


RecurrentJS 2-layer LSTM
train recurrent networks
in Javascript!*

*if you have a lot of time :)

Character-level Paul Graham Wisdom Generator:


Suppose we had the training sentence “cat sat on mat”

We want to train a language model:


P(next word | previous words)
Suppose we had the training sentence “cat sat on mat”

We want to train a language model:


P(next word | previous words)

i.e. want these to be high:


P(cat | [<S>])
P(sat | [<S>, cat])
P(on | [<S>, cat, sat])
P(mat | [<S>, cat, sat, on])
“cat sat on mat”

y0 y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 300 (learnable) numbers


x0 x4
<START>
“cat” “sat” “on” “mat” associated with each word
P(word | [<S>]) P(word | [<S>, cat, sat])
“cat sat on mat”
P(word | [<S>, cat]) P(word | [<S>, cat, sat, on])
P(word | [<S>, cat, sat, on, mat])
y0 y1 y2 y3 y4
10,001 numbers (logprobs for
10,000 words in vocabulary and
a special <END> token)
y4 = Why * h4
h0 h1 h2 h3 h4

x1 x2 x3 300 (learnable) numbers


x0 x4
<START>
“cat” “sat” “on” “mat” associated with each word
P(word | [<S>]) P(word | [<S>, cat, sat])
“cat sat on mat”
P(word | [<S>, cat]) P(word | [<S>, cat, sat, on])
P(word | [<S>, cat, sat, on, mat])
y0 y1 y2 y3 y4
10,001 numbers (logprobs for
10,000 words in vocabulary and
a special <END> token)
y4 = Why * h4
h0 h1 h2 h3 h4
“hidden” representation mediates
the contextual information
(e.g. 200 numbers)
h4 = max(0, Wxh * x4 + Whh * h3)

x1 x2 x3 300 (learnable) numbers


x0 x4
<START>
“cat” “sat” “on” “mat” associated with each word
Training this on a lot of
sentences would give us a
language model. A way to
predict

P(next word | previous words)

x0
<START>
Training this on a lot of
sentences would give us a
language model. A way to y0
predict

P(next word | previous words)


h0

x0
<START>
Training this on a lot of
sentences would give us a
language model. A way to y0
predict

P(next word | previous words)


h0
sample!

x0 x1
<START>
“cat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1
predict

P(next word | previous words)


h0 h1

x0 x1
<START>
“cat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1
predict

P(next word | previous words)


h0 h1 sample!

x0 x1 x2
<START>
“cat” “sat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2
predict

P(next word | previous words)


h0 h1 h2

x0 x1 x2
<START>
“cat” “sat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2
predict sample!

P(next word | previous words)


h0 h1 h2

x0 x1 x2 x3
<START>
“cat” “sat” “on”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2 y3
predict

P(next word | previous words)


h0 h1 h2 h3

x0 x1 x2 x3
<START>
“cat” “sat” “on”
Training this on a lot of
sentences would give us a sample!
language model. A way to y0 y1 y2 y3
predict

P(next word | previous words)


h0 h1 h2 h3

x0 x1 x2 x3 x4
<START>
“cat” “sat” “on” “mat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2 y3 y4
predict

P(next word | previous words)


h0 h1 h2 h3 h4

x0 x1 x2 x3 x4
<START>
“cat” “sat” “on” “mat”
samples <END>? done.
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2 y3 y4
predict

P(next word | previous words)


h0 h1 h2 h3 h4

x0 x1 x2 x3 x4
<START>
“cat” “sat” “on” “mat”
Recurrent Neural Network

Convolutional Neural Network


“straw hat”

training example
“straw hat”

training example
“straw hat”

training example

X
“straw hat”

training example
y0 y1 y2

h0 h1 h2

x0
x1 x2
<STA
“straw” “hat”
RT>

X <START> straw hat


“straw hat”

training example
y0 y1 y2

h0 h1 h2
before:
h0 = max(0, Wxh * x0)

now:
x0
h0 = max(0, Wxh * x0 + Wih * v)
x1 x2
<STA
“straw” “hat”
RT>

X <START> straw hat


“straw hat”

training example
y0 y1 y2

h0 h1 h2

x0
x1 x2
<STA
“straw” “hat”
RT>

X <START> straw hat


test image
test image

x0
<STA
RT>

<START>
test image

y0

h0

x0
<STA
RT>

<START>
test image

y0

sample!
h0

x0
<STA x1
RT>

<START>
test image

y0 y1

h0 h1

x0
<STA straw
RT>

<START>
test image

y0 y1

h0 h1
sample!

x0
<STA straw hat
RT>

<START>
test image

y0 y1 y2

h0 h1 h2

x0
<STA straw hat
RT>

<START>
test image

y0 y1 y2

sample!
<END> token
h0 h1 h2 => finish.

x0
<STA straw hat
RT>

<START>
test image

y0 y1 y2

sample!
<END> token
h0 h1 h2 => finish.

- Don’t have to do greedy


word-by-word sampling, can
x0
<STA straw hat also search over longer
RT>
phrases with beam search
<START>
RNN vs. LSTM
y0 y1 “hidden” representation
(e.g. 200 numbers)
h1 = max(0, Wxh * x1 + Whh * h0)

h0 h1

x0 x1
<START>
“cat”
RNN vs. LSTM
y0 y1 “hidden” representation
(e.g. 200 numbers)
h1 = max(0, Wxh * x1 + Whh * h0)

h0 h1 LSTM changes the form of the equation for


h1 such that:
1. more expressive multiplicative interactions
2. gradients flow nicer
3. network can explicitly decide to reset the
x0 x1
<START>
“cat” hidden state
Image Sentence Datasets

Microsoft COCO
[Tsung-Yi Lin et al. 2014]
mscoco.org

currently:
~120K images
~5 sentences each
Training an RNN/LSTM...
- Clip the gradients (important!). 5 worked ok
- RMSprop adaptive learning rate worked nice
- Initialize softmax biases with log word
frequency distribution
- Train for long time
+ Transfer Learning

“straw hat”

y0 y1 y2 training example

h0 h1 h2

x0
x1
<ST x2
“stra
ART “hat”
w”
>

<START> straw hat


+ Transfer Learning
use weights
pretrained from “straw hat”
ImageNet

y0 y1 y2 training example

h0 h1 h2

x0
x1
<ST x2
“stra
ART “hat”
w”
>

<START> straw hat


+ Transfer Learning
use weights
pretrained from “straw hat”
ImageNet

y0 y1 y2 training example

h0 h1 h2

use word vectors


x0
<ST
x1
“stra
x2
pretrained with
ART “hat”
>
w”
word2vec [1]
<START> straw hat
[1] Mikolov et al., 2013
Summary of the approach
We wanted to describe images with sentences.

1. Define a single function from input -> output


2. Initialize parts of net from elsewhere if possible
3. Get some data
4. Train with SGD
Wow I can’t believe that worked
Wow I can’t believe that worked
Well, I can kind of see it
Well, I can kind of see it
Not sure what happened there...
See predictions on
1000 COCO images:
http://bit.ly/neuraltalkdemo
What this approach Doesn’t do:
- There is no reasoning

- A single glance is taken at the image, no


objects are detected, etc.

- We can’t just describe any image


NeuralTalk
- Code on Github
- Both RNN/LSTM

- Python+numpy (CPU)
- Matlab+Caffe if you want
to run on new images (for
now)
Ranking model
Ranking model
web demo:
http://bit.ly/rankingdemo
Recurrent Neural Network

Summary
Convolutional Neural Network

Neural Networks:
- input->output end-to-end optimization
- stackable / composable like Lego
- easily support Transfer Learning
- work very well.
Summary

1. image -> sentence


2. sentence -> image
Summary

1. image -> sentence


2. sentence -> image

natural language
Summary

1. image -> sentence


2. sentence -> image

natural language
Thank you!

You might also like