Visual Recognition
Visual Recognition
2
Introduction to image recognition
building
tree
building
lamp
lamp
umbrella
umbrella
person
person market stall
person Adapted from Fei-Fei Li
person person ground 9
Detection, semantic segmentation,
instance segmentation
12
Image classification
13
The statistical learning framework
• Apply a prediction function to a feature representation of
the image to get the desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
14
Classical statistical learning methods
• Some classical statistical learning methods include:
• Linear Regression
• Logistic Regression
• Decision Trees
• Naive Bayes
• Support Vector Machines
• K-Nearest Neighbors
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
15
The statistical learning framework
y = f(x)
output prediction feature
function representation
Learned
model
Testing
Image
Prediction
Features
Test Image
Slide credit: D. Hoiem 17
“Classic” recognition pipeline
18
Hand-crafted feature representation:
Bag of words features
• Example: each
group of patches
belongs to the
same visual word
…
Appearance codebook
Source: B. Leibe 21
Bag of features: Outline
1. Extract local features
2. Learn “visual vocabulary”
3. Quantize local features using visual vocabulary
4. Represent images by frequencies of “visual words”
22
“Classic” recognition pipeline
23
Classifiers: Nearest neighbor
Test Training
Training examples
examples example
from class 2
from class 1
• L1 distance: D(h1 , h2 ) = | h1 (i ) − h2 (i ) |
i =1
D(h1 , h2 ) =
N
(h1 (i) − h2 (i) )2
• χ2 distance:
i =1 h1 (i ) + h2 (i )
K=1 K=1
Slide adapted from https://cs231n.stanford.edu/ 27
K-nearest neighbor classifier
• For a new point, find the k closest points from training data
• Vote for class label with labels of the k points
k=5
28
K-Nearest Neighbors
http://vision.stanford.edu/teaching/cs231n-demos/knn/
Slide adapted from https://cs231n.stanford.edu/ 30
K-nearest neighbor classifier
• Very problem/dataset-dependent.
• Must try them all out and see what works best.
train
Useful for small datasets, but not used too frequently in deep learning
Slide adapted from https://cs231n.stanford.edu/ 34
Best practices for training classifiers
35
Best practices for training classifiers
Best Hyperparameters
Best Hyperparameters
https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-and-model-selection 36
Example Dataset: CIFAR10
10 classes
50,000 training images
10,000 testing images Test images and nearest neighbors
Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.
(All three images on the right have the same pixel distances to the one on the left)
39
Original image is CC0 public domain Slide adapted from https://cs231n.stanford.edu/
Parametric Approach: Linear Classifier
f(x) = sgn(w x + b)
40
Support vector machines
• Find hyperplane that maximizes the margin between the
positive and negative examples
xi positive ( yi = 1) : xi w + b 1
xi negative ( yi = −1) : xi w + b −1
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
41
and Knowledge Discovery, 1998
Parametric Approach: Linear Classifier
3072x1
Array of 32x32x3
numbers (3072
numbers total) W
Cat image by Nikita is licensed under CC-BY 2.0
parameters
or weights
Slide adapted from https://cs231n.stanford.edu/ 53
Example with an image with 4 pixels, and 3 classes
(cat/dog/ship) Algebraic Viewpoint
Flatten tensors into a vector
56
0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score
56 231
231
24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9 Dog score
49
“Shallow” recognition pipeline
50
“Deep” recognition pipeline
Image Simple
pixels Layer 1 Layer 2 Layer 3 Classifier
51
Neural networks vs. SVMs
(a.k.a. “deep” vs. “shallow” learning)
52
Brief history of neural network
53
Important events
• AlexNet: Winner of ImageNet 2012
• Microsoft: Speech Recognition Breakthrough for the
Spoken, Translated Word, 2012.
• MIT 10 Breakthrough Technologies, 2013.
• Explosive Growth of AI Startups, since 2013.
• Deep learning based face recognition surpasses human,
2014
• Wide deployment of face recognition techniques, 2015
• AlphaGo, 2016.
• ResNet, 2016.
• Coming … (it can be your work!)
54
DALL-E 2
“Teddy bears working on new AI research “Rabbits attending a college seminar on “A wise cat meditating in the Himalayas
on the moon in the 1980s.” human anatomy.” searching for enlightenment.”
In a fantastical setting, a
highly detailed furry
humanoid skunk with
piercing eyes confidently
poses in a medium shot,
wearing an animal hide
jacket. The artist has
masterfully rendered the
character in digital art,
capturing the intricate details
of fur and clothing texture.
56
An illustration from a graphic novel.
A bustling city street under the shine
of a full moon. The sidewalks
bustling with pedestrians enjoying
the nightlife. At the corner stall, a
young woman with fiery red hair,
dressed in a signature velvet cloak, is
haggling with the grumpy old
vendor. The grumpy vendor, a tall,
sophisticated man wearing a sharp
suit, who sports a noteworthy
mustache is animatedly conversing
on his steampunk telephone.
57
GPT-4
58
Segment Anything Model
Neural Networks
(NN), also called as
Artificial Neural
Network is named
after its artificial
representation of https://www.analyticsvidhya.com/blog/2016/03/introduction-deep-
working of a human learning-fundamentals-neural-networks/
being’s nervous In simple terms, each neuron takes input from numerous
system. other neurons through the dendrites. It then performs the
required processing on the input and sends another
electrical pulse through the axon into the terminal nodes
from where it is transmitted to numerous other neurons.
63
Impulses carried toward cell body
dendrite
presynaptic
terminal
axon
• A neuron is a function f,
as shown in the formula, 1. x1, x2,…, xN: Inputs to the neuron. These can either be
which is known as the actual observations from input layer or an
activation function. It intermediate value from one of the hidden layers.
makes a neural network 2. x0: Bias unit. This is a constant value added to the input of
the activation function. It works similar to an intercept
extremely flexible and
term and typically has +1 value.
imparts the capability to 3. w0,w1, w2,…,wN: Weights on each input. Note that even
estimate complex non- bias unit has a weight.
linear relationships in 4. a: Output of the neuron which is calculated as:
data.
66
We will model these like linear classifiers
with the following activation function:
Fundamental
Function Examples
• Let us implement a
fundamental functions –
AND using Neural
Networks.
• This will help us
understand how they work.
You can assume these to
be like a classification
problem where we’ll
predict the output (0 or 1)
for different combination
of inputs.
67
We will model these like linear classifiers with
the following activation function:
Fundamental
Function Examples
OR NOT
a = f( -0.5 + x1 + x2 ) a = f( 1 – 2*x1 )
68
From
A Neuron
to
Multilayer Perceptron (MLP)
(In practice we will usually add a learnable bias at each layer as well)
h W2 s
x W1
100 10
3072 Learn 100 lower-level
templates instead of 10.
Sigmoid Maxout
ELU
tanh
Forward pass
Gradient descent
𝐸 𝐰 = 𝑙(𝐱𝑖 , 𝑦𝑖 ; 𝐰)
𝑖
77
Dealing with multiple classes
• If we need to classify inputs into C different classes, we put
C units in the last layer to produce C one-vs.-others scores
𝑓1 , 𝑓2 , … , 𝑓𝐶
• Apply softmax function to convert these scores to
probabilities:
exp(𝑓1 ) exp(𝑓𝐶 )
softmax 𝑓1 , … , 𝑓𝑐 = ,…,
σ𝑗 exp(𝑓𝑗 ) σ𝑗 exp(𝑓𝑗 )
• If one of the inputs is much larger than the others, then the
corresponding softmax value will be close to 1 and others will be
close to 0
• Use log likelihood (cross-entropy) loss:
𝑙 𝐱𝑖 , 𝑦𝑖 ; 𝐰 = −log 𝑃𝐰 𝑦𝑖 | 𝐱 𝑖
78
Cross-Entropy Loss
By author of https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e.
80
Cross-Entropy Loss
• The categorical cross-entropy is computed as follows
81
By author of https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e.
Cross-Entropy Loss
By author of https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e. 82
Training multi-layer networks: Gradient Descent
• Find network weights to minimize the prediction loss between
true and estimated labels of training examples:
𝐸 𝐰 = 𝑙(𝐱 𝑖 , 𝑦𝑖 ; 𝐰)
𝑖
E
• Update weights by gradient descent: w w −
w
w2
w1
83
Vanilla Gradient Descent
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c 84
Vanilla Gradient Descent
https://commons.wikimedia.org/wiki/File:Gradient_descent.gif
85
Training multi-layer networks: Gradient Descent
• Find network weights to minimize the prediction loss between true
and estimated labels of training examples:
𝐸 𝐰 = 𝑙(𝐱 𝑖 , 𝑦𝑖 ; 𝐰)
E
𝑖 w w −
• Update weights by gradient descent: w
86
Learning rate
87
Learning rate
88
A Visual Explanation of Gradient Descent Methods
(Momentum, AdaGrad, RMSProp, Adam)
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c
89
90
https://www.youtube.com/watch?v=IHZwWFHWa-w Gradient descent, how neural networks learn | Chapter 2, Deep learning (youtube.com)
Training multi-layer networks: Back Propagation
91
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
e.g. x = -2, y = 5, z = -4
e.g. x = -2, y = 5, z = -4
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Chain rule:
e.g. x = -2, y = 5, z = -4
Chain rule:
e.g. x = -2, y = 5, z = -4
Chain rule:
e.g. x = -2, y = 5, z = -4
Chain rule:
Lecture 4 - 76
“local gradient”
“Upstream
gradient”
“Downstream
gradients”
f
“Upstream
gradient”
“Downstream
gradients”
f
“Upstream
gradient”
“Downstream
gradients”
f
“Upstream
gradient”
115
Training multi-layer networks: Regularization
117
Figure source
118
Stay away from overfitting: L2-norm Regularization, Weight Decay and L1-norm Regularization techniques | by Inara Koppert-Anisimova | unpack | Medium
Underfitting and overfitting
119
Training multi-layer networks: Regularization
• It is common to add a penalty (e.g., quadratic) on weight
magnitudes to the objective function:
𝐸 𝐰 = 𝑙(𝐱𝑖 , 𝑦𝑖 ; 𝐰) + 𝜆 𝐰 2
𝑖
• Quadratic penalty encourages network to use all of its inputs “a little” rather
than a few inputs “a lot”
121
Multi-Layer Network Demo
http://playground.tensorflow.org/
122
References
Many slides, images and contents of this
lecture are adapted from:
• CS 231n: Deep Learning for Computer Vision
https://cs231n.stanford.edu/schedule.html
• CS 376: Computer Vision
http://vision.cs.utexas.edu/376-
spring2018/#Syllabus
• 16-385: Computer Vision
http://www.cs.cmu.edu/~16385/
123