0% found this document useful (0 votes)

107 views105 pages

Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li

The document discusses using convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for automated image captioning. CNNs are used to extract visual features from images, while RNNs are employed to generate natural language captions by modeling the sequence of words. Together, CNNs and RNNs form an end-to-end model that can learn relationships between visual content and associated text to produce captions for new images.

Uploaded by

Chandra Shekhar Kadiyam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views105 pages

Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li

Uploaded by

Chandra Shekhar Kadiyam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

Automated Image Captioning with

ConvNets and Recurrent Nets

Andrej Karpathy, Fei-Fei Li
Automated Image Captioning with
ConvNets and Recurrent Nets
Andrej Karpathy, Fei-Fei Li
natural language
images of me scuba diving next to turtle
images of me scuba diving next to turtle
Very hard task
images of me scuba diving next to turtle
Very hard task
vzntrf bs zr fphon qvivat arkg gb ghegyr
Very hard task
vzntrf bs zr fphon qvivat arkg gb ghegyr
Neural Networks practitioner
Describing images
Recurrent Neural Network

Convolutional Neural Network

Convolutional Neural Networks
image
(32*32 class probabilities
numbers) differentiable function (10 numbers)

[LeCun et al., 1998]

[Krizhevsky, Sutskever, Hinton. 2012] 16.4% error
[Krizhevsky, Sutskever, Hinton. 2012] 16.4% error

[Zeiler and Fergus, 2013] 11.1% error

[Krizhevsky, Sutskever, Hinton. 2012] 16.4% error

[Szegedy et al., 2014] 6.6% error

[Simonyan and Zisserman, 2014] 7.3% error

[Zeiler and Fergus, 2013] 11.1% error

[Szegedy et al., 2014]
6.6% error
[Simonyan and Zisserman, 2014]
7.3% error

Human error: ~5.1%

Optimistic human error: ~3%
read more on my blog:
karpathy.github.io
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]

“VGGNet” or “OxfordNet”
Very simple and homogeneous.
(And available in Caffe.)
[224x224x3]

“Very Deep Convolutional Networks for Large-Scale Visual Recognition”

[Simonyan and Zisserman, 2014]

“VGGNet” or “OxfordNet”
Very simple and homogeneous.
(And available in Caffe.)

[1000]
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
CONV
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
(And available in Caffe.)
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
CONV
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
POOL (And available in Caffe.)
“Very Deep Convolutional Networks for Large-Scale Visual Recognition”
[Simonyan and Zisserman, 2014]
CONV
“VGGNet” or “OxfordNet”
Very simple and homogeneous.
POOL (And available in Caffe.)

FULLY-CONNECTED
Every layer of a ConvNet has the same API:
- Takes a 3D volume of numbers
- Outputs a 3D volume of numbers
- Constraint: function must be differentiable

probabilities
[1x1x1000]
image
[224x224x3]
Fully Connected Layer

[1x1x4096] “neurons”

[7x7x512]

Every “neuron” in the output:

1. computes a dot product between the
input and its weights

2. thresholds it at zero
Fully Connected Layer

[1x1x4096] “neurons”

[7x7x512]

The whole layer can be implemented

very efficiently as:
1. single matrix multiply
2. Elementwise thresholding at zero
Convolutional Layer

224

224 224
D=3 64

Every blue neuron is connected to a 3x3x3 array of inputs

Convolutional Layer Can be
implemented
efficiently with
convolutions
224

224

224 224
D=3 64

Every blue neuron is connected to a 3x3x3 array of inputs

Pooling Layer

[112x112x64]
[224x224x64]

Performs (spatial) downsampling

Pooling Layer

224

224
Pooling Layer

224 112
downsampling
112
224
Max Pooling Layer

Single depth slice

1 1 2 4
x
5 6 7 8 max pool 6 8

3 2 1 0 3 4

1 2 3 4

y
What do the neurons learn?

[Taken from Yann LeCun slides]

Example activation maps
CONV CONV POOL CONV CONV POOL CONV CONV POOL FC
ReLU ReLU ReLU ReLU ReLU ReLU (Fully-connected)
Example activation maps
CONV CONV POOL CONV CONV POOL CONV CONV POOL FC
ReLU ReLU ReLU ReLU ReLU ReLU (Fully-connected)

(tiny VGGNet trained with ConvNetJS)

[224x224x3]

differentiable function

[1000]
[224x224x3]

differentiable function

0.2 0.4 0.09 0.01 0.3

cat dog chair bagel banana
[1000]
[224x224x3]

differentiable function

0.2 0.4 0.09 0.01 0.3

cat dog chair bagel banana
[1000]
Training
Loop until tired:
1. Sample a batch of data
2. Forward it through the network to get predictions
3. Backprop the errors
4. Update the weights
Training
Loop until tired:
1. Sample a batch of data
2. Forward it through the network to get predictions
3. Backprop the errors
4. Update the weights

[image credit:
Karen Simonyan]
Summary so far:
Convolutional Networks express a single
differentiable function from raw image pixel
values to class probabilities.
Recurrent Neural Network

Generating Sequences With Recurrent Neural Networks

[Alex Graves, 2014]
Recurrent Networks are good at modeling sequences...

Word-level language model. Similar to:

Recurrent Neural Network Based Language Model

[Tomas Mikolov, 2010]
Recurrent Networks are good at modeling sequences...

Machine Translation model

French words English words

Sequence to Sequence Learning with Neural Networks

[Ilya Sutskever, Oriol Vinyals, Quoc V. Le, 2014]
RecurrentJS 2-layer LSTM
train recurrent
networks in
Javascript!*

*if you have a lot of time :)

RecurrentJS 2-layer LSTM
train recurrent networks
in Javascript!*

*if you have a lot of time :)

Character-level Paul Graham Wisdom Generator:

Suppose we had the training sentence “cat sat on mat”

We want to train a language model:

P(next word | previous words)
Suppose we had the training sentence “cat sat on mat”

We want to train a language model:

P(next word | previous words)

i.e. want these to be high:

P(cat | [<S>])
P(sat | [<S>, cat])
P(on | [<S>, cat, sat])
P(mat | [<S>, cat, sat, on])
“cat sat on mat”

y0 y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 300 (learnable) numbers

x0 x4
<START>
“cat” “sat” “on” “mat” associated with each word
P(word | [<S>]) P(word | [<S>, cat, sat])
“cat sat on mat”
P(word | [<S>, cat]) P(word | [<S>, cat, sat, on])
P(word | [<S>, cat, sat, on, mat])
y0 y1 y2 y3 y4
10,001 numbers (logprobs for
10,000 words in vocabulary and
a special <END> token)
y4 = Why * h4
h0 h1 h2 h3 h4

x1 x2 x3 300 (learnable) numbers

x0 x4
<START>
“cat” “sat” “on” “mat” associated with each word
Training this on a lot of
sentences would give us a
language model. A way to
predict

P(next word | previous words)

x0
<START>
Training this on a lot of
sentences would give us a
language model. A way to y0
predict

P(next word | previous words)

x0
<START>
Training this on a lot of
sentences would give us a
language model. A way to y0
predict

P(next word | previous words)

h0
sample!

x0 x1
<START>
“cat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1
predict

P(next word | previous words)

h0 h1

x0 x1
<START>
“cat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1
predict

P(next word | previous words)

h0 h1 sample!

x0 x1 x2
<START>
“cat” “sat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2
predict

P(next word | previous words)

h0 h1 h2

x0 x1 x2
<START>
“cat” “sat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2
predict sample!

P(next word | previous words)

h0 h1 h2

x0 x1 x2 x3
<START>
“cat” “sat” “on”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2 y3
predict

P(next word | previous words)

h0 h1 h2 h3

x0 x1 x2 x3
<START>
“cat” “sat” “on”
Training this on a lot of
sentences would give us a sample!
language model. A way to y0 y1 y2 y3
predict

P(next word | previous words)

h0 h1 h2 h3

x0 x1 x2 x3 x4
<START>
“cat” “sat” “on” “mat”
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2 y3 y4
predict

P(next word | previous words)

h0 h1 h2 h3 h4

x0 x1 x2 x3 x4
<START>
“cat” “sat” “on” “mat”
samples <END>? done.
Training this on a lot of
sentences would give us a
language model. A way to y0 y1 y2 y3 y4
predict

P(next word | previous words)

h0 h1 h2 h3 h4

x0 x1 x2 x3 x4
<START>
“cat” “sat” “on” “mat”
Recurrent Neural Network

Convolutional Neural Network

“straw hat”

training example
“straw hat”

training example

X
“straw hat”

training example
y0 y1 y2

h0 h1 h2

x0
x1 x2
<STA
“straw” “hat”
RT>

X <START> straw hat

“straw hat”

training example
y0 y1 y2

h0 h1 h2
before:
h0 = max(0, Wxh * x0)

now:
x0
h0 = max(0, Wxh * x0 + Wih * v)
x1 x2
<STA
“straw” “hat”
RT>

X <START> straw hat

“straw hat”

training example
y0 y1 y2

h0 h1 h2

x0
x1 x2
<STA
“straw” “hat”
RT>

X <START> straw hat

test image
test image

x0
<STA
RT>

<START>
test image

x0
<STA
RT>

<START>
test image

sample!
h0

x0
<STA x1
RT>

<START>
test image

y0 y1

h0 h1

x0
<STA straw
RT>

<START>
test image

y0 y1

h0 h1
sample!

x0
<STA straw hat
RT>

<START>
test image

y0 y1 y2

h0 h1 h2

x0
<STA straw hat
RT>

<START>
test image

y0 y1 y2

sample!
<END> token
h0 h1 h2 => finish.

x0
<STA straw hat
RT>

<START>
test image

y0 y1 y2

sample!
<END> token
h0 h1 h2 => finish.

- Don’t have to do greedy

word-by-word sampling, can
x0
<STA straw hat also search over longer
RT>
phrases with beam search
<START>
RNN vs. LSTM
y0 y1 “hidden” representation
(e.g. 200 numbers)
h1 = max(0, Wxh * x1 + Whh * h0)

h0 h1

x0 x1
<START>
“cat”
RNN vs. LSTM
y0 y1 “hidden” representation
(e.g. 200 numbers)
h1 = max(0, Wxh * x1 + Whh * h0)

h0 h1 LSTM changes the form of the equation for

h1 such that:
1. more expressive multiplicative interactions
2. gradients flow nicer
3. network can explicitly decide to reset the
x0 x1
<START>
“cat” hidden state
Image Sentence Datasets

Microsoft COCO
[Tsung-Yi Lin et al. 2014]
mscoco.org

currently:
~120K images
~5 sentences each
Training an RNN/LSTM...
- Clip the gradients (important!). 5 worked ok
- RMSprop adaptive learning rate worked nice
- Initialize softmax biases with log word
frequency distribution
- Train for long time
+ Transfer Learning

“straw hat”

y0 y1 y2 training example

h0 h1 h2

x0
x1
<ST x2
“stra
ART “hat”
w”
>

<START> straw hat

+ Transfer Learning
use weights
pretrained from “straw hat”
ImageNet

y0 y1 y2 training example

h0 h1 h2

x0
x1
<ST x2
“stra
ART “hat”
w”
>

<START> straw hat

+ Transfer Learning
use weights
pretrained from “straw hat”
ImageNet

y0 y1 y2 training example

h0 h1 h2

use word vectors

x0
<ST
x1
“stra
x2
pretrained with
ART “hat”
>
w”
word2vec [1]
<START> straw hat
[1] Mikolov et al., 2013
Summary of the approach
We wanted to describe images with sentences.

1. Define a single function from input -> output

2. Initialize parts of net from elsewhere if possible
3. Get some data
4. Train with SGD
Wow I can’t believe that worked
Wow I can’t believe that worked
Well, I can kind of see it
Well, I can kind of see it
Not sure what happened there...
See predictions on
1000 COCO images:
http://bit.ly/neuraltalkdemo
What this approach Doesn’t do:
- There is no reasoning

- A single glance is taken at the image, no

objects are detected, etc.

- We can’t just describe any image

NeuralTalk
- Code on Github
- Both RNN/LSTM

- Python+numpy (CPU)
- Matlab+Caffe if you want
to run on new images (for
now)
Ranking model
Ranking model
web demo:
http://bit.ly/rankingdemo
Recurrent Neural Network

Summary
Convolutional Neural Network

Neural Networks:
- input->output end-to-end optimization
- stackable / composable like Lego
- easily support Transfer Learning
- work very well.
Summary

1. image -> sentence

2. sentence -> image
Summary

1. image -> sentence

2. sentence -> image

natural language
Summary

1. image -> sentence

2. sentence -> image

natural language
Thank you!

Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
Deep Learning Tutorial
No ratings yet
Deep Learning Tutorial
133 pages
22a Neural
No ratings yet
22a Neural
46 pages
A Quick Recap: Artificial Intelligence LAB
No ratings yet
A Quick Recap: Artificial Intelligence LAB
29 pages
DL4CV Seq Att
No ratings yet
DL4CV Seq Att
63 pages
For Seminar
No ratings yet
For Seminar
17 pages
AIDL03 EvolutionOfAI
No ratings yet
AIDL03 EvolutionOfAI
22 pages
Recurrent Neural Network (RNN) : Tuan Nguyen - AI4E
No ratings yet
Recurrent Neural Network (RNN) : Tuan Nguyen - AI4E
38 pages
DL Decode Endsem
No ratings yet
DL Decode Endsem
71 pages
1 AI - Introduction and ML
No ratings yet
1 AI - Introduction and ML
32 pages
5th Unit
No ratings yet
5th Unit
36 pages
Character-Level Convolutional Networks For Text Classification
No ratings yet
Character-Level Convolutional Networks For Text Classification
9 pages
Text Understanding From Scratch
No ratings yet
Text Understanding From Scratch
10 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
No ratings yet
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
15 pages
08 NLP With Deep Learning
No ratings yet
08 NLP With Deep Learning
31 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Listofpapers1 0
No ratings yet
Listofpapers1 0
8 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
Ai Image Captioning
No ratings yet
Ai Image Captioning
10 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Recurrent Neural Networks (RNNS) : 10-301/10-601 Introduction To Machine Learning
No ratings yet
Recurrent Neural Networks (RNNS) : 10-301/10-601 Introduction To Machine Learning
86 pages
Automated Neural Image Caption Generator For Visually Impaired People
No ratings yet
Automated Neural Image Caption Generator For Visually Impaired People
6 pages
Cours9a RNN
No ratings yet
Cours9a RNN
29 pages
Recurrent Neural Nets
No ratings yet
Recurrent Neural Nets
144 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
Image Captions With Deep Learning: Yulia Kogan & Ron Shiff
No ratings yet
Image Captions With Deep Learning: Yulia Kogan & Ron Shiff
24 pages
Ai 4 All
No ratings yet
Ai 4 All
18 pages
MachineLearningSlides PartTwo
No ratings yet
MachineLearningSlides PartTwo
141 pages
2015 - Convolutional Neural Networks For Sentence Classification (XXX) (15 Slides)
No ratings yet
2015 - Convolutional Neural Networks For Sentence Classification (XXX) (15 Slides)
15 pages
Sequence Models For NLP
No ratings yet
Sequence Models For NLP
195 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
Final Report
No ratings yet
Final Report
4 pages
RNN-1
No ratings yet
RNN-1
50 pages
An Overview of Convolutional Neural Network Architectures For Deep Learning
No ratings yet
An Overview of Convolutional Neural Network Architectures For Deep Learning
22 pages
Deep Learning For Information Retrieval
No ratings yet
Deep Learning For Information Retrieval
136 pages
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
No ratings yet
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
33 pages
Nn4ir PDF
No ratings yet
Nn4ir PDF
290 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
Super VIP Cheetsheet - Deep Learning, AI, ML
No ratings yet
Super VIP Cheetsheet - Deep Learning, AI, ML
47 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
Unec 1700728516
No ratings yet
Unec 1700728516
105 pages
2025 Lecture06 MachineLearning
No ratings yet
2025 Lecture06 MachineLearning
56 pages
Image Classification With Convolutional Neural Networks: Plotting
No ratings yet
Image Classification With Convolutional Neural Networks: Plotting
16 pages
Deep Learning for Video Experts
100% (1)
Deep Learning for Video Experts
114 pages
ML 5th and 6th
No ratings yet
ML 5th and 6th
37 pages
Set A
No ratings yet
Set A
20 pages
AML - Lecture - 09 - 08nov24
No ratings yet
AML - Lecture - 09 - 08nov24
126 pages
Introduction To Rnns
No ratings yet
Introduction To Rnns
48 pages
Introduction To Deep Learning: Nandita Bhaskhar
No ratings yet
Introduction To Deep Learning: Nandita Bhaskhar
56 pages
Deep Learning Fundamentals and ArchitecturesDeep Learning Fundamentals and Architectures
No ratings yet
Deep Learning Fundamentals and ArchitecturesDeep Learning Fundamentals and Architectures
9 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
12 pages
NLP Week7 RNNLSTM
No ratings yet
NLP Week7 RNNLSTM
66 pages
Script PFE
No ratings yet
Script PFE
8 pages
Matlab Iris RBF
No ratings yet
Matlab Iris RBF
21 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
37 pages
Convolutional Neural Networks: Computer Vision
No ratings yet
Convolutional Neural Networks: Computer Vision
41 pages
AI-Powered Fire Detection System
No ratings yet
AI-Powered Fire Detection System
23 pages
Inception Net
No ratings yet
Inception Net
88 pages
MTL782 A2 PS
No ratings yet
MTL782 A2 PS
18 pages
Deep Learning: History and Development
No ratings yet
Deep Learning: History and Development
21 pages
Classification
No ratings yet
Classification
14 pages
Chapter 5 - Neural Networks
No ratings yet
Chapter 5 - Neural Networks
52 pages
K Medroids
No ratings yet
K Medroids
13 pages
Applied Deep Learning - Part 3 - Autoencoders - by Arden Dertat - Towards Data Science
No ratings yet
Applied Deep Learning - Part 3 - Autoencoders - by Arden Dertat - Towards Data Science
20 pages
CNNs: A Guide for Tech Enthusiasts
No ratings yet
CNNs: A Guide for Tech Enthusiasts
80 pages
Unit 4 Deeplearning
No ratings yet
Unit 4 Deeplearning
41 pages
Deep Learning With Python File
No ratings yet
Deep Learning With Python File
22 pages
Clustering Algorithms: K-Means
No ratings yet
Clustering Algorithms: K-Means
17 pages
Deep Learning PPT Full Notes
No ratings yet
Deep Learning PPT Full Notes
105 pages
A Modified Adam Algorithm For Deep Neural Network Optimization
No ratings yet
A Modified Adam Algorithm For Deep Neural Network Optimization
18 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Affinity Propagation
No ratings yet
Affinity Propagation
3 pages
Artificial Neural Network Part-2
No ratings yet
Artificial Neural Network Part-2
15 pages
00-Introduction DNN
No ratings yet
00-Introduction DNN
32 pages
SSRN 5263710
No ratings yet
SSRN 5263710
94 pages
EastWestAirlines Cluster
100% (1)
EastWestAirlines Cluster
6 pages
Assignment 2 ML
No ratings yet
Assignment 2 ML
2 pages
Weighted Ensemble Model For Image Classification: Talib Iqball M. Arif Wani
No ratings yet
Weighted Ensemble Model For Image Classification: Talib Iqball M. Arif Wani
8 pages
FFNN
No ratings yet
FFNN
3 pages
Neural Networks and Deep Learning: Rationale
No ratings yet
Neural Networks and Deep Learning: Rationale
2 pages
Research Data Scientist-Scaler
No ratings yet
Research Data Scientist-Scaler
15 pages
Neural Networks: Backpropagation Basics
No ratings yet
Neural Networks: Backpropagation Basics
38 pages

Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li

Uploaded by

Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li

Uploaded by

Automated Image Captioning with

ConvNets and Recurrent Nets

Convolutional Neural Network

[LeCun et al., 1998]

[Zeiler and Fergus, 2013] 11.1% error

[Szegedy et al., 2014] 6.6% error

[Zeiler and Fergus, 2013] 11.1% error

Human error: ~5.1%

“Very Deep Convolutional Networks for Large-Scale Visual Recognition”

Every “neuron” in the output:

The whole layer can be implemented

Every blue neuron is connected to a 3x3x3 array of inputs

Every blue neuron is connected to a 3x3x3 array of inputs

Performs (spatial) downsampling

Single depth slice

[Taken from Yann LeCun slides]

(tiny VGGNet trained with ConvNetJS)

0.2 0.4 0.09 0.01 0.3

0.2 0.4 0.09 0.01 0.3

Convolutional Neural Network

Generating Sequences With Recurrent Neural Networks

Word-level language model. Similar to:

Recurrent Neural Network Based Language Model

Machine Translation model

Sequence to Sequence Learning with Neural Networks

*if you have a lot of time :)

*if you have a lot of time :)

Character-level Paul Graham Wisdom Generator:

We want to train a language model:

We want to train a language model:

i.e. want these to be high:

x1 x2 x3 300 (learnable) numbers

x1 x2 x3 300 (learnable) numbers

x1 x2 x3 300 (learnable) numbers

P(next word | previous words)

P(next word | previous words)

P(next word | previous words)

P(next word | previous words)

P(next word | previous words)

P(next word | previous words)

P(next word | previous words)

P(next word | previous words)

P(next word | previous words)

P(next word | previous words)

P(next word | previous words)

Convolutional Neural Network

X <START> straw hat

X <START> straw hat

X <START> straw hat

- Don’t have to do greedy

h0 h1 LSTM changes the form of the equation for

<START> straw hat

<START> straw hat

use word vectors

1. Define a single function from input -> output

- A single glance is taken at the image, no

- We can’t just describe any image

1. image -> sentence

1. image -> sentence

1. image -> sentence

You might also like