0% found this document useful (0 votes)

10 views123 pages

Visual Recognition

The document outlines a lecture on visual recognition in computer vision, covering both shallow and deep recognition pipelines. It discusses various tasks such as image classification, object detection, and activity recognition, along with classifiers like K-NN and linear classifiers. Additionally, it highlights the importance of training models, setting hyperparameters, and the evolution of neural networks in the field.

Uploaded by

cks454436164

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views123 pages

Visual Recognition

Uploaded by

cks454436164

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 123

COMP 5523 Lecture 712

Computer Vision and Image Processing

Visual Recognition
Fall, 2024
Instructor: WU, Xiao-Ming
For internal use only,
please do not distribute!
Outline
• Visual recognition tasks
• “Shallow” recognition pipeline
• “Bag of features” representation
• Classifiers: K-NN, linear classifier
• “Deep” recognition pipeline
• Multi-layer neural networks
• Training multi-layer neural networks

2
Introduction to image recognition

Source: Charley Harper 3

Common recognition tasks

Adapted from Fei-Fei Li

4
Image classification and tagging
• outdoor
• mountains
• city
• Asia
• Lhasa
•…

Adapted from Fei-Fei Li

5
Object detection
• find pedestrians

Adapted from Fei-Fei Li

6
Activity recognition
• walking
• shopping
• rolling a cart
• sitting
• talking
•…

Adapted from Fei-Fei Li

7
Semantic segmentation

Adapted from Fei-Fei Li

8
Semantic segmentation
sky
mountain

building
tree
building
lamp
lamp
umbrella
umbrella
person
person market stall
person Adapted from Fei-Fei Li
person person ground 9
Detection, semantic segmentation,
instance segmentation

image classification object detection

semantic segmentation instance segmentation

Image source 10
Image description This is a busy street in an Asian city.
Mountains and a large palace or
fortress loom in the background. In the
foreground, we see colorful souvenir
stalls and people walking around and
shopping. One person in the lower left
is pushing an empty cart, and a couple
of people in the middle are sitting,
possibly posing for a photograph.

Adapted from Fei-Fei Li

11
Outline
• Visual recognition tasks
• “Shallow” recognition pipeline
• “Bag of features” representation
• Classifiers: K-NN, linear classifier
• “Deep” recognition pipeline
• Multi-layer neural networks
• Training multi-layer neural networks

12
Image classification

13
The statistical learning framework
• Apply a prediction function to a feature representation of
the image to get the desired output:

f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
14
Classical statistical learning methods
• Some classical statistical learning methods include:
• Linear Regression
• Logistic Regression
• Decision Trees
• Naive Bayes
• Support Vector Machines
• K-Nearest Neighbors
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)

15
The statistical learning framework

y = f(x)
output prediction feature
function representation

• Training: given a training set of labeled examples

{(x1,y1), …, (xN,yN)}, estimate the prediction function f by
minimizing the prediction error on the training set
• Testing: apply f to a never before seen test example x and
output the predicted value y = f(x)
16
Steps
Training Training
Labels
Training
Images
Image Learned
Training
Features model

Learned
model
Testing

Image
Prediction
Features
Test Image
Slide credit: D. Hoiem 17
“Classic” recognition pipeline

Image Feature Trainable Class

Pixels representation classifier label

• Hand-crafted feature representation

• Off-the-shelf trainable classifier

18
Hand-crafted feature representation:
Bag of words features

Visual words: main idea

19
Visual words

• Example: each
group of patches
belongs to the
same visual word

Figure from Sivic & Zisserman, ICCV 2003 20

Visual vocabularies

…
Appearance codebook
Source: B. Leibe 21
Bag of features: Outline
1. Extract local features
2. Learn “visual vocabulary”
3. Quantize local features using visual vocabulary
4. Represent images by frequencies of “visual words”

22
“Classic” recognition pipeline

Image Feature Trainable Class

Pixels representation classifier label

• Hand-crafted feature representation

• Off-the-shelf trainable classifier

23
Classifiers: Nearest neighbor

Test Training
Training examples
examples example
from class 2
from class 1

f(x) = label of the training example nearest to x

All we need is a distance or similarity function for our inputs

No training required!
24
Functions for comparing histograms
N

• L1 distance: D(h1 , h2 ) =  | h1 (i ) − h2 (i ) |
i =1

D(h1 , h2 ) = 
N
(h1 (i) − h2 (i) )2
• χ2 distance:
i =1 h1 (i ) + h2 (i )

• Quadratic distance (cross-bin distance):

D(h1 , h2 ) =  Aij (h1 (i) − h2 ( j )) 2
i, j

• Histogram intersection (similarity function):

N
I (h1 , h2 ) =  min( h1 (i ), h2 (i ))
i =1
25
What does this look like?

Slide adapted from https://cs231n.stanford.edu/

1-nearest neighbor 26
K-Nearest Neighbors: Distance Metric
L1 (Manhattan) distance L2 (Euclidean) distance

K=1 K=1
Slide adapted from https://cs231n.stanford.edu/ 27
K-nearest neighbor classifier
• For a new point, find the k closest points from training data
• Vote for class label with labels of the k points

k=5

28
K-Nearest Neighbors

K=1 K=3 K=5

Instead of copying label from nearest neighbor,

take majority vote from K closest points
Slide adapted from https://cs231n.stanford.edu/ 29
K-Nearest Neighbors: try it yourself!

http://vision.stanford.edu/teaching/cs231n-demos/knn/
Slide adapted from https://cs231n.stanford.edu/ 30
K-nearest neighbor classifier

Which classifier is more robust to outliers?

Credit: Andrej Karpathy, http://cs231n.github.io/classification/ 31

Hyperparameters
• What is the best value of k to use? What is the
best distance metric to use?

• These are hyperparameters: choices about the algorithms

themselves.

• Very problem/dataset-dependent.
• Must try them all out and see what works best.

Slide adapted from https://cs231n.stanford.edu/ 32

Setting Hyperparameters
Idea #1: Choose hyperparameters that BAD: K = 1 always works
work best on the training data perfectly on training data

train

Idea #2: choose hyperparameters BAD: No idea how algorithm

that work best on test data will perform on new data
train test

Idea #3: Split data into train, val; choose

Better!
hyperparameters on val and evaluate on test
train validation test
Slide adapted from https://cs231n.stanford.edu/
33
Setting Hyperparameters
train
Idea #4: Cross-Validation: Split data into folds, try
each fold as validation and average the results

fold 1 fold 2 fold 3 fold 4 fold 5 test

fold 1 fold 2 fold 3 fold 4 fold 5 test
fold 1 fold 2 fold 3 fold 4 fold 5 test
fold 1 fold 2 fold 3 fold 4 fold 5 test
fold 1 fold 2 fold 3 fold 4 fold 5 test

Useful for small datasets, but not used too frequently in deep learning
Slide adapted from https://cs231n.stanford.edu/ 34
Best practices for training classifiers

• Goal: obtain a classifier with good generalization or

performance on never before seen data

1. Learn parameters on the training set

2. Tune hyperparameters (implementation choices) on
the held-out validation set
3. Evaluate performance on the test set
• Crucial: do not peek at the test set when iterating
steps 1 and 2!

35
Best practices for training classifiers
Best Hyperparameters

Best Hyperparameters

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-and-model-selection 36
Example Dataset: CIFAR10
10 classes
50,000 training images
10,000 testing images Test images and nearest neighbors

Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.

Slide adapted from https://cs231n.stanford.edu/

37
Setting Hyperparameters
Example of
5-fold cross-validation
for the value of k.

Each point: single

outcome.

The line goes

through the mean, bars
indicated standard
deviation

(Seems that k ~= 7 works best

for this data)
Slide adapted from https://cs231n.stanford.edu/ 38
k-Nearest Neighbor with pixel distance never used.
- Distance metrics on pixels are not informative
Original Occluded Shifted (1 pixel) Tinted

(All three images on the right have the same pixel distances to the one on the left)

39
Original image is CC0 public domain Slide adapted from https://cs231n.stanford.edu/
Parametric Approach: Linear Classifier

Find a linear function to separate the classes:

f(x) = sgn(w  x + b)
40
Support vector machines
• Find hyperplane that maximizes the margin between the
positive and negative examples
xi positive ( yi = 1) : xi  w + b  1
xi negative ( yi = −1) : xi  w + b  −1

For support vectors, x i  w + b = 1

Distance between point | xi  w + b |

and hyperplane: || w ||

Therefore, the margin is 2 / ||w||

Support vectors Margin

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
41
and Knowledge Discovery, 1998
Parametric Approach: Linear Classifier
3072x1

Image f(x,W) = Wx + b 10x1

10x1 10x3072
10 numbers giving
f(x,W) class scores

Array of 32x32x3
numbers (3072
numbers total) W
Cat image by Nikita is licensed under CC-BY 2.0
parameters
or weights
Slide adapted from https://cs231n.stanford.edu/ 53
Example with an image with 4 pixels, and 3 classes
(cat/dog/ship) Algebraic Viewpoint
Flatten tensors into a vector

56
0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score
56 231
231

24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9 Dog score

0 0.25 0.2 -0.3 -1.2 61.95 Ship score

Input image
2
W b
Slide adapted from https://cs231n.stanford.edu/ 58
Interpreting a Linear Classifier

Slide adapted from https://cs231n.stanford.edu/ 59

Interpreting a Linear Classifier: Visual Viewpoint

Slide adapted from https://cs231n.stanford.edu/ 60

Interpreting a Linear Classifier: Visual Viewpoint

Plot created using Wolfram Cloud

Slide adapted from https://cs231n.stanford.edu/ 61

Hard cases for a linear classifier
Class 1: Class 1: Class 1:
First and third quadrants 1 <= L2 norm <= 2 Three modes
Class 2: Class 2:
Class 2:
Second and fourth quadrants Everything else Everything else

Slide adapted from https://cs231n.stanford.edu/

47
Nearest neighbor vs. linear classifiers
• NN pros:
• Simple to implement
• Decision boundaries not necessarily linear
• Works for any number of classes
• Nonparametric method
• NN cons:
• Need good distance function
• Slow at test time
• Linear pros:
• Low-dimensional parametric representation
• Very fast at test time
• Linear cons:
• Works for two classes
• How to train the linear function?
• What if data is not linearly separable? 48
Outline
• Visual recognition tasks
• “Shallow” recognition pipeline
• “Bag of features” representation
• Classifiers: K-NN, linear classifier
• “Deep” recognition pipeline
• Multi-layer neural networks
• Training multi-layer neural networks

49
“Shallow” recognition pipeline

Image Feature Trainable Class

Pixels representation classifier label

• Hand-crafted feature representation

• Off-the-shelf trainable classifier

50
“Deep” recognition pipeline

Image Simple
pixels Layer 1 Layer 2 Layer 3 Classifier

• Learn a feature hierarchy from pixels to classifier

• Each layer extracts features from the output of
previous layer
• Train all layers jointly

51
Neural networks vs. SVMs
(a.k.a. “deep” vs. “shallow” learning)

52
Brief history of neural network

53
Important events
• AlexNet: Winner of ImageNet 2012
• Microsoft: Speech Recognition Breakthrough for the
Spoken, Translated Word, 2012.
• MIT 10 Breakthrough Technologies, 2013.
• Explosive Growth of AI Startups, since 2013.
• Deep learning based face recognition surpasses human,
2014
• Wide deployment of face recognition techniques, 2015
• AlphaGo, 2016.
• ResNet, 2016.
• Coming … (it can be your work!)

54
DALL-E 2

“Teddy bears working on new AI research “Rabbits attending a college seminar on “A wise cat meditating in the Himalayas
on the moon in the 1980s.” human anatomy.” searching for enlightenment.”

Image source: Sam Altman, https://openai.com/dall-e-2/, https://twitter.com/sama/status/1511724264629678084 55

DALL-E 3

In a fantastical setting, a
highly detailed furry
humanoid skunk with
piercing eyes confidently
poses in a medium shot,
wearing an animal hide
jacket. The artist has
masterfully rendered the
character in digital art,
capturing the intricate details
of fur and clothing texture.

Betker, James, et al. "Improving image

generation with better captions.”
Computer Science. https://cdn. openai.
com/papers/dall-e-3. pdf (2023).

56
An illustration from a graphic novel.
A bustling city street under the shine
of a full moon. The sidewalks
bustling with pedestrians enjoying
the nightlife. At the corner stall, a
young woman with fiery red hair,
dressed in a signature velvet cloak, is
haggling with the grumpy old
vendor. The grumpy vendor, a tall,
sophisticated man wearing a sharp
suit, who sports a noteworthy
mustache is animatedly conversing
on his steampunk telephone.

Betker, James, et al. "Improving image

generation with better captions.”
Computer Science. https://cdn. openai.
com/papers/dall-e-3. pdf (2023).

57
GPT-4

Image source: https://openai.com/research/gpt-4

58
Segment Anything Model

Kirillov et al., Segment Anything, 2023 59

Sora

Introducing Sora — OpenAI’s text-to-video model Lecture 4 - 60

https://www.youtube.com/watch?v=bfmFfD2RIcg 61
62
https://www.youtube.com/watch?v=aircAruvnKk But what is a neural network? | Chapter 1, Deep learning (youtube.com)
Nervous System comprises of millions of nerve cells or
neurons. A neuron has the following structure:
Starting from a
Simple Neuron

Neural Networks
(NN), also called as
Artificial Neural
Network is named
after its artificial
representation of https://www.analyticsvidhya.com/blog/2016/03/introduction-deep-
working of a human learning-fundamentals-neural-networks/
being’s nervous In simple terms, each neuron takes input from numerous
system. other neurons through the dendrites. It then performs the
required processing on the input and sends another
electrical pulse through the axon into the terminal nodes
from where it is transmitted to numerous other neurons.
63
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

This image by Fotis Bobolas is licensed under CC-BY 2.0

cell body

Impulses carried away

from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

sigmoid activation function

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 64

Biological Neurons: Neurons in a neural network:
Complex connectivity patterns Organized into regular layers for
computational efficiency

This image is CC0 Public Domain

Slide adapted from https://cs231n.stanford.edu/

65
How a Single
Artificial Neuron
Works?

• A neuron is a function f,
as shown in the formula, 1. x1, x2,…, xN: Inputs to the neuron. These can either be
which is known as the actual observations from input layer or an
activation function. It intermediate value from one of the hidden layers.
makes a neural network 2. x0: Bias unit. This is a constant value added to the input of
the activation function. It works similar to an intercept
extremely flexible and
term and typically has +1 value.
imparts the capability to 3. w0,w1, w2,…,wN: Weights on each input. Note that even
estimate complex non- bias unit has a weight.
linear relationships in 4. a: Output of the neuron which is calculated as:
data.

66
We will model these like linear classifiers
with the following activation function:
Fundamental
Function Examples

• Let us implement a
fundamental functions –
AND using Neural
Networks.
• This will help us
understand how they work.
You can assume these to
be like a classification
problem where we’ll
predict the output (0 or 1)
for different combination
of inputs.

67
We will model these like linear classifiers with
the following activation function:
Fundamental
Function Examples

• Similarly, we will have

OR and NOT using
Neural Networks.

OR NOT
a = f( -0.5 + x1 + x2 ) a = f( 1 – 2*x1 )

68
From
A Neuron
to
Multilayer Perceptron (MLP)

• ANN works in a very

similar fashion. A
multilayer perceptron
(MLP) is a class of
feedforward artificial
neural network. An MLP 1. Input Layer: The training observations are fed through these
consists of, at least, three neurons
2. Hidden Layers: These are the intermediate layers between
layers of nodes: an input input and output which help the Neural Network learn the
layer, a hidden layer and complicated relationships involved in data.
an output layer. 3. Output Layer: The final output is extracted from previous two
layers. For Example: In case of a classification problem with 5
classes, the output later will have 5 neurons.
69
Neural networks: also called fully connected network
original linear classifier

(Before) Linear score function:

(Now) 2-layer Neural Network
or 3-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 70

Neural networks: hierarchical computation
(Before) Linear score function:
(Now) 2-layer Neural Network
Share templates
between classes

h W2 s
x W1
100 10
3072 Learn 100 lower-level
templates instead of 10.

Slide adapted from https://cs231n.stanford.edu/

71
Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network

The function is called the activation function.

Q: What if we try to build a neural network without one?

A: We end up with a linear classifier again!

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 72

Activation functions

Sigmoid Maxout

ELU
tanh

ReLU (Rectified Linear Unit) Leaky ReLU

Range: 0 to infinity Range:-infinity to infinity

ReLU is a good default choice for most problems The leak helps to increase the range of the ReLU function.
Slide adapted from https://cs231n.stanford.edu/
73
Neural networks: Architectures

“Fully-connected” layers “3-layer Neural Net”, or

“2-hidden-layer Neural Net”
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”

Slide adapted from https://cs231n.stanford.edu/

74
Example feed-forward computation of a neural network

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 75

Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Calculate the analytical gradients

Gradient descent

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 76

Training multi-layer networks: Loss Function
• Find network weights to minimize the prediction loss between
true and estimated labels of training examples:

𝐸 𝐰 = ෍ 𝑙(𝐱𝑖 , 𝑦𝑖 ; 𝐰)
𝑖

• Possible losses (for binary problems):

• Quadratic loss: 𝑙 𝐱𝑖 , 𝑦𝑖 ; 𝐰 = 𝑓𝐰 (𝐱𝑖 ) − 𝑦𝑖 2

• Log likelihood loss: 𝑙 𝐱 𝑖 , 𝑦𝑖 ; 𝐰 = −log 𝑃𝐰 𝑦𝑖 | 𝐱𝑖

• Hinge loss: 𝑙 𝐱𝑖 , 𝑦𝑖 ; 𝐰 = max(0,1 − 𝑦𝑖 𝑓𝐰 𝐱 𝑖 )

77
Dealing with multiple classes
• If we need to classify inputs into C different classes, we put
C units in the last layer to produce C one-vs.-others scores
𝑓1 , 𝑓2 , … , 𝑓𝐶
• Apply softmax function to convert these scores to
probabilities:
exp(𝑓1 ) exp(𝑓𝐶 )
softmax 𝑓1 , … , 𝑓𝑐 = ,…,
σ𝑗 exp(𝑓𝑗 ) σ𝑗 exp(𝑓𝑗 )
• If one of the inputs is much larger than the others, then the
corresponding softmax value will be close to 1 and others will be
close to 0
• Use log likelihood (cross-entropy) loss:
𝑙 𝐱𝑖 , 𝑦𝑖 ; 𝐰 = −log 𝑃𝐰 𝑦𝑖 | 𝐱 𝑖

78
Cross-Entropy Loss

Input image source: Photo by Victor Grabarczyk on Unsplash . Diagram by author of

https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e.
79
Cross-Entropy Loss
• Cross-entropy is defined as

By author of https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e.

80
Cross-Entropy Loss
• The categorical cross-entropy is computed as follows

81
By author of https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e.
Cross-Entropy Loss

• Assume that after some iterations of model training the model

outputs the following vector of logits

By author of https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e. 82
Training multi-layer networks: Gradient Descent
• Find network weights to minimize the prediction loss between
true and estimated labels of training examples:

𝐸 𝐰 = ෍ 𝑙(𝐱 𝑖 , 𝑦𝑖 ; 𝐰)
𝑖
E
• Update weights by gradient descent: w  w −
w

w2
w1
83
Vanilla Gradient Descent

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c 84
Vanilla Gradient Descent

Gradient descent is a simple method

to find the minimum of a function,
where at each iteration a small step is
made in the direction of the steepest
descent. It tends to get stuck in a
local minimum, so it is often run with
several initial conditions.

https://commons.wikimedia.org/wiki/File:Gradient_descent.gif

85
Training multi-layer networks: Gradient Descent
• Find network weights to minimize the prediction loss between true
and estimated labels of training examples:

𝐸 𝐰 = ෍ 𝑙(𝐱 𝑖 , 𝑦𝑖 ; 𝐰)
E
𝑖 w  w −
• Update weights by gradient descent: w

• Back-propagation: gradients are computed in the direction from

output to input layers and combined using chain rule
• Stochastic gradient descent: compute the weight update w.r.t.
one training example (or a small batch of examples) at a time, cycle
through training examples in random order in multiple epochs

86
Learning rate

Bad learning rate

87
Learning rate

88
A Visual Explanation of Gradient Descent Methods
(Momentum, AdaGrad, RMSProp, Adam)

Animation of 5 gradient descent

methods on a surface: gradient
descent (cyan), momentum
(magenta), AdaGrad (white),
RMSProp (green), Adam (blue).
Left well is the global minimum;
right well is a local minimum.

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c
89
90
https://www.youtube.com/watch?v=IHZwWFHWa-w Gradient descent, how neural networks learn | Chapter 2, Deep learning (youtube.com)
Training multi-layer networks: Back Propagation

91
Backpropagation: a simple example