0% found this document useful (0 votes)
10 views123 pages

Visual Recognition

The document outlines a lecture on visual recognition in computer vision, covering both shallow and deep recognition pipelines. It discusses various tasks such as image classification, object detection, and activity recognition, along with classifiers like K-NN and linear classifiers. Additionally, it highlights the importance of training models, setting hyperparameters, and the evolution of neural networks in the field.

Uploaded by

cks454436164
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views123 pages

Visual Recognition

The document outlines a lecture on visual recognition in computer vision, covering both shallow and deep recognition pipelines. It discusses various tasks such as image classification, object detection, and activity recognition, along with classifiers like K-NN and linear classifiers. Additionally, it highlights the importance of training models, setting hyperparameters, and the evolution of neural networks in the field.

Uploaded by

cks454436164
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

COMP 5523 Lecture 712

Computer Vision and Image Processing


Visual Recognition
Fall, 2024
Instructor: WU, Xiao-Ming
For internal use only,
please do not distribute!
Outline
• Visual recognition tasks
• “Shallow” recognition pipeline
• “Bag of features” representation
• Classifiers: K-NN, linear classifier
• “Deep” recognition pipeline
• Multi-layer neural networks
• Training multi-layer neural networks

2
Introduction to image recognition

Source: Charley Harper 3


Common recognition tasks

Adapted from Fei-Fei Li


4
Image classification and tagging
• outdoor
• mountains
• city
• Asia
• Lhasa
•…

Adapted from Fei-Fei Li


5
Object detection
• find pedestrians

Adapted from Fei-Fei Li


6
Activity recognition
• walking
• shopping
• rolling a cart
• sitting
• talking
•…

Adapted from Fei-Fei Li


7
Semantic segmentation

Adapted from Fei-Fei Li


8
Semantic segmentation
sky
mountain

building
tree
building
lamp
lamp
umbrella
umbrella
person
person market stall
person Adapted from Fei-Fei Li
person person ground 9
Detection, semantic segmentation,
instance segmentation

image classification object detection

semantic segmentation instance segmentation


Image source 10
Image description This is a busy street in an Asian city.
Mountains and a large palace or
fortress loom in the background. In the
foreground, we see colorful souvenir
stalls and people walking around and
shopping. One person in the lower left
is pushing an empty cart, and a couple
of people in the middle are sitting,
possibly posing for a photograph.

Adapted from Fei-Fei Li


11
Outline
• Visual recognition tasks
• “Shallow” recognition pipeline
• “Bag of features” representation
• Classifiers: K-NN, linear classifier
• “Deep” recognition pipeline
• Multi-layer neural networks
• Training multi-layer neural networks

12
Image classification

13
The statistical learning framework
• Apply a prediction function to a feature representation of
the image to get the desired output:

f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
14
Classical statistical learning methods
• Some classical statistical learning methods include:
• Linear Regression
• Logistic Regression
• Decision Trees
• Naive Bayes
• Support Vector Machines
• K-Nearest Neighbors
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)

15
The statistical learning framework

y = f(x)
output prediction feature
function representation

• Training: given a training set of labeled examples


{(x1,y1), …, (xN,yN)}, estimate the prediction function f by
minimizing the prediction error on the training set
• Testing: apply f to a never before seen test example x and
output the predicted value y = f(x)
16
Steps
Training Training
Labels
Training
Images
Image Learned
Training
Features model

Learned
model
Testing

Image
Prediction
Features
Test Image
Slide credit: D. Hoiem 17
“Classic” recognition pipeline

Image Feature Trainable Class


Pixels representation classifier label

• Hand-crafted feature representation


• Off-the-shelf trainable classifier

18
Hand-crafted feature representation:
Bag of words features

Visual words: main idea


19
Visual words

• Example: each
group of patches
belongs to the
same visual word

Figure from Sivic & Zisserman, ICCV 2003 20


Visual vocabularies


Appearance codebook
Source: B. Leibe 21
Bag of features: Outline
1. Extract local features
2. Learn “visual vocabulary”
3. Quantize local features using visual vocabulary
4. Represent images by frequencies of “visual words”

22
“Classic” recognition pipeline

Image Feature Trainable Class


Pixels representation classifier label

• Hand-crafted feature representation


• Off-the-shelf trainable classifier

23
Classifiers: Nearest neighbor

Test Training
Training examples
examples example
from class 2
from class 1

f(x) = label of the training example nearest to x

All we need is a distance or similarity function for our inputs


No training required!
24
Functions for comparing histograms
N

• L1 distance: D(h1 , h2 ) =  | h1 (i ) − h2 (i ) |
i =1

D(h1 , h2 ) = 
N
(h1 (i) − h2 (i) )2
• χ2 distance:
i =1 h1 (i ) + h2 (i )

• Quadratic distance (cross-bin distance):


D(h1 , h2 ) =  Aij (h1 (i) − h2 ( j )) 2
i, j

• Histogram intersection (similarity function):


N
I (h1 , h2 ) =  min( h1 (i ), h2 (i ))
i =1
25
What does this look like?

Slide adapted from https://cs231n.stanford.edu/


1-nearest neighbor 26
K-Nearest Neighbors: Distance Metric
L1 (Manhattan) distance L2 (Euclidean) distance

K=1 K=1
Slide adapted from https://cs231n.stanford.edu/ 27
K-nearest neighbor classifier
• For a new point, find the k closest points from training data
• Vote for class label with labels of the k points

k=5

28
K-Nearest Neighbors

K=1 K=3 K=5

Instead of copying label from nearest neighbor,


take majority vote from K closest points
Slide adapted from https://cs231n.stanford.edu/ 29
K-Nearest Neighbors: try it yourself!

http://vision.stanford.edu/teaching/cs231n-demos/knn/
Slide adapted from https://cs231n.stanford.edu/ 30
K-nearest neighbor classifier

Which classifier is more robust to outliers?

Credit: Andrej Karpathy, http://cs231n.github.io/classification/ 31


Hyperparameters
• What is the best value of k to use? What is the
best distance metric to use?

• These are hyperparameters: choices about the algorithms


themselves.

• Very problem/dataset-dependent.
• Must try them all out and see what works best.

Slide adapted from https://cs231n.stanford.edu/ 32


Setting Hyperparameters
Idea #1: Choose hyperparameters that BAD: K = 1 always works
work best on the training data perfectly on training data

train

Idea #2: choose hyperparameters BAD: No idea how algorithm


that work best on test data will perform on new data
train test

Idea #3: Split data into train, val; choose


Better!
hyperparameters on val and evaluate on test
train validation test
Slide adapted from https://cs231n.stanford.edu/
33
Setting Hyperparameters
train
Idea #4: Cross-Validation: Split data into folds, try
each fold as validation and average the results

fold 1 fold 2 fold 3 fold 4 fold 5 test


fold 1 fold 2 fold 3 fold 4 fold 5 test
fold 1 fold 2 fold 3 fold 4 fold 5 test
fold 1 fold 2 fold 3 fold 4 fold 5 test
fold 1 fold 2 fold 3 fold 4 fold 5 test

Useful for small datasets, but not used too frequently in deep learning
Slide adapted from https://cs231n.stanford.edu/ 34
Best practices for training classifiers

• Goal: obtain a classifier with good generalization or


performance on never before seen data

1. Learn parameters on the training set


2. Tune hyperparameters (implementation choices) on
the held-out validation set
3. Evaluate performance on the test set
• Crucial: do not peek at the test set when iterating
steps 1 and 2!

35
Best practices for training classifiers
Best Hyperparameters

Best Hyperparameters

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-and-model-selection 36
Example Dataset: CIFAR10
10 classes
50,000 training images
10,000 testing images Test images and nearest neighbors

Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.

Slide adapted from https://cs231n.stanford.edu/


37
Setting Hyperparameters
Example of
5-fold cross-validation
for the value of k.

Each point: single


outcome.

The line goes


through the mean, bars
indicated standard
deviation

(Seems that k ~= 7 works best


for this data)
Slide adapted from https://cs231n.stanford.edu/ 38
k-Nearest Neighbor with pixel distance never used.
- Distance metrics on pixels are not informative
Original Occluded Shifted (1 pixel) Tinted

(All three images on the right have the same pixel distances to the one on the left)

39
Original image is CC0 public domain Slide adapted from https://cs231n.stanford.edu/
Parametric Approach: Linear Classifier

Find a linear function to separate the classes:

f(x) = sgn(w  x + b)
40
Support vector machines
• Find hyperplane that maximizes the margin between the
positive and negative examples
xi positive ( yi = 1) : xi  w + b  1
xi negative ( yi = −1) : xi  w + b  −1

For support vectors, x i  w + b = 1

Distance between point | xi  w + b |


and hyperplane: || w ||

Therefore, the margin is 2 / ||w||


Support vectors Margin

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
41
and Knowledge Discovery, 1998
Parametric Approach: Linear Classifier
3072x1

Image f(x,W) = Wx + b 10x1


10x1 10x3072
10 numbers giving
f(x,W) class scores

Array of 32x32x3
numbers (3072
numbers total) W
Cat image by Nikita is licensed under CC-BY 2.0
parameters
or weights
Slide adapted from https://cs231n.stanford.edu/ 53
Example with an image with 4 pixels, and 3 classes
(cat/dog/ship) Algebraic Viewpoint
Flatten tensors into a vector

56
0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score
56 231
231

24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9 Dog score

0 0.25 0.2 -0.3 -1.2 61.95 Ship score


Input image
2
W b
Slide adapted from https://cs231n.stanford.edu/ 58
Interpreting a Linear Classifier

Slide adapted from https://cs231n.stanford.edu/ 59


Interpreting a Linear Classifier: Visual Viewpoint

Slide adapted from https://cs231n.stanford.edu/ 60


Interpreting a Linear Classifier: Visual Viewpoint

Plot created using Wolfram Cloud

Slide adapted from https://cs231n.stanford.edu/ 61


Hard cases for a linear classifier
Class 1: Class 1: Class 1:
First and third quadrants 1 <= L2 norm <= 2 Three modes
Class 2: Class 2:
Class 2:
Second and fourth quadrants Everything else Everything else

Slide adapted from https://cs231n.stanford.edu/


47
Nearest neighbor vs. linear classifiers
• NN pros:
• Simple to implement
• Decision boundaries not necessarily linear
• Works for any number of classes
• Nonparametric method
• NN cons:
• Need good distance function
• Slow at test time
• Linear pros:
• Low-dimensional parametric representation
• Very fast at test time
• Linear cons:
• Works for two classes
• How to train the linear function?
• What if data is not linearly separable? 48
Outline
• Visual recognition tasks
• “Shallow” recognition pipeline
• “Bag of features” representation
• Classifiers: K-NN, linear classifier
• “Deep” recognition pipeline
• Multi-layer neural networks
• Training multi-layer neural networks

49
“Shallow” recognition pipeline

Image Feature Trainable Class


Pixels representation classifier label

• Hand-crafted feature representation


• Off-the-shelf trainable classifier

50
“Deep” recognition pipeline

Image Simple
pixels Layer 1 Layer 2 Layer 3 Classifier

• Learn a feature hierarchy from pixels to classifier


• Each layer extracts features from the output of
previous layer
• Train all layers jointly

51
Neural networks vs. SVMs
(a.k.a. “deep” vs. “shallow” learning)

52
Brief history of neural network

53
Important events
• AlexNet: Winner of ImageNet 2012
• Microsoft: Speech Recognition Breakthrough for the
Spoken, Translated Word, 2012.
• MIT 10 Breakthrough Technologies, 2013.
• Explosive Growth of AI Startups, since 2013.
• Deep learning based face recognition surpasses human,
2014
• Wide deployment of face recognition techniques, 2015
• AlphaGo, 2016.
• ResNet, 2016.
• Coming … (it can be your work!)

54
DALL-E 2

“Teddy bears working on new AI research “Rabbits attending a college seminar on “A wise cat meditating in the Himalayas
on the moon in the 1980s.” human anatomy.” searching for enlightenment.”

Image source: Sam Altman, https://openai.com/dall-e-2/, https://twitter.com/sama/status/1511724264629678084 55


DALL-E 3

In a fantastical setting, a
highly detailed furry
humanoid skunk with
piercing eyes confidently
poses in a medium shot,
wearing an animal hide
jacket. The artist has
masterfully rendered the
character in digital art,
capturing the intricate details
of fur and clothing texture.

Betker, James, et al. "Improving image


generation with better captions.”
Computer Science. https://cdn. openai.
com/papers/dall-e-3. pdf (2023).

56
An illustration from a graphic novel.
A bustling city street under the shine
of a full moon. The sidewalks
bustling with pedestrians enjoying
the nightlife. At the corner stall, a
young woman with fiery red hair,
dressed in a signature velvet cloak, is
haggling with the grumpy old
vendor. The grumpy vendor, a tall,
sophisticated man wearing a sharp
suit, who sports a noteworthy
mustache is animatedly conversing
on his steampunk telephone.

Betker, James, et al. "Improving image


generation with better captions.”
Computer Science. https://cdn. openai.
com/papers/dall-e-3. pdf (2023).

57
GPT-4

Image source: https://openai.com/research/gpt-4

58
Segment Anything Model

Kirillov et al., Segment Anything, 2023 59


Sora

Introducing Sora — OpenAI’s text-to-video model Lecture 4 - 60


https://www.youtube.com/watch?v=bfmFfD2RIcg 61
62
https://www.youtube.com/watch?v=aircAruvnKk But what is a neural network? | Chapter 1, Deep learning (youtube.com)
Nervous System comprises of millions of nerve cells or
neurons. A neuron has the following structure:
Starting from a
Simple Neuron

Neural Networks
(NN), also called as
Artificial Neural
Network is named
after its artificial
representation of https://www.analyticsvidhya.com/blog/2016/03/introduction-deep-
working of a human learning-fundamentals-neural-networks/
being’s nervous In simple terms, each neuron takes input from numerous
system. other neurons through the dendrites. It then performs the
required processing on the input and sends another
electrical pulse through the axon into the terminal nodes
from where it is transmitted to numerous other neurons.
63
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

This image by Fotis Bobolas is licensed under CC-BY 2.0


cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

sigmoid activation function

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 64


Biological Neurons: Neurons in a neural network:
Complex connectivity patterns Organized into regular layers for
computational efficiency

This image is CC0 Public Domain

Slide adapted from https://cs231n.stanford.edu/


65
How a Single
Artificial Neuron
Works?

• A neuron is a function f,
as shown in the formula, 1. x1, x2,…, xN: Inputs to the neuron. These can either be
which is known as the actual observations from input layer or an
activation function. It intermediate value from one of the hidden layers.
makes a neural network 2. x0: Bias unit. This is a constant value added to the input of
the activation function. It works similar to an intercept
extremely flexible and
term and typically has +1 value.
imparts the capability to 3. w0,w1, w2,…,wN: Weights on each input. Note that even
estimate complex non- bias unit has a weight.
linear relationships in 4. a: Output of the neuron which is calculated as:
data.

66
We will model these like linear classifiers
with the following activation function:
Fundamental
Function Examples

• Let us implement a
fundamental functions –
AND using Neural
Networks.
• This will help us
understand how they work.
You can assume these to
be like a classification
problem where we’ll
predict the output (0 or 1)
for different combination
of inputs.

67
We will model these like linear classifiers with
the following activation function:
Fundamental
Function Examples

• Similarly, we will have


OR and NOT using
Neural Networks.

OR NOT
a = f( -0.5 + x1 + x2 ) a = f( 1 – 2*x1 )

68
From
A Neuron
to
Multilayer Perceptron (MLP)

• ANN works in a very


similar fashion. A
multilayer perceptron
(MLP) is a class of
feedforward artificial
neural network. An MLP 1. Input Layer: The training observations are fed through these
consists of, at least, three neurons
2. Hidden Layers: These are the intermediate layers between
layers of nodes: an input input and output which help the Neural Network learn the
layer, a hidden layer and complicated relationships involved in data.
an output layer. 3. Output Layer: The final output is extracted from previous two
layers. For Example: In case of a classification problem with 5
classes, the output later will have 5 neurons.
69
Neural networks: also called fully connected network
original linear classifier

(Before) Linear score function:


(Now) 2-layer Neural Network
or 3-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 70


Neural networks: hierarchical computation
(Before) Linear score function:
(Now) 2-layer Neural Network
Share templates
between classes

h W2 s
x W1
100 10
3072 Learn 100 lower-level
templates instead of 10.

Slide adapted from https://cs231n.stanford.edu/


71
Neural networks: why is max operator important?
(Before) Linear score function:
(Now) 2-layer Neural Network

The function is called the activation function.

Q: What if we try to build a neural network without one?

A: We end up with a linear classifier again!

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 72


Activation functions

Sigmoid Maxout

ELU
tanh

ReLU (Rectified Linear Unit) Leaky ReLU

Range: 0 to infinity Range:-infinity to infinity


ReLU is a good default choice for most problems The leak helps to increase the range of the ReLU function.
Slide adapted from https://cs231n.stanford.edu/
73
Neural networks: Architectures

“Fully-connected” layers “3-layer Neural Net”, or


“2-hidden-layer Neural Net”
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”

Slide adapted from https://cs231n.stanford.edu/


74
Example feed-forward computation of a neural network

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 75


Full implementation of training a 2-layer Neural Network needs ~20 lines:

Define the network

Forward pass

Calculate the analytical gradients

Gradient descent

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 76


Training multi-layer networks: Loss Function
• Find network weights to minimize the prediction loss between
true and estimated labels of training examples:

𝐸 𝐰 = ෍ 𝑙(𝐱𝑖 , 𝑦𝑖 ; 𝐰)
𝑖

• Possible losses (for binary problems):


• Quadratic loss: 𝑙 𝐱𝑖 , 𝑦𝑖 ; 𝐰 = 𝑓𝐰 (𝐱𝑖 ) − 𝑦𝑖 2

• Log likelihood loss: 𝑙 𝐱 𝑖 , 𝑦𝑖 ; 𝐰 = −log 𝑃𝐰 𝑦𝑖 | 𝐱𝑖


• Hinge loss: 𝑙 𝐱𝑖 , 𝑦𝑖 ; 𝐰 = max(0,1 − 𝑦𝑖 𝑓𝐰 𝐱 𝑖 )

77
Dealing with multiple classes
• If we need to classify inputs into C different classes, we put
C units in the last layer to produce C one-vs.-others scores
𝑓1 , 𝑓2 , … , 𝑓𝐶
• Apply softmax function to convert these scores to
probabilities:
exp(𝑓1 ) exp(𝑓𝐶 )
softmax 𝑓1 , … , 𝑓𝑐 = ,…,
σ𝑗 exp(𝑓𝑗 ) σ𝑗 exp(𝑓𝑗 )
• If one of the inputs is much larger than the others, then the
corresponding softmax value will be close to 1 and others will be
close to 0
• Use log likelihood (cross-entropy) loss:
𝑙 𝐱𝑖 , 𝑦𝑖 ; 𝐰 = −log 𝑃𝐰 𝑦𝑖 | 𝐱 𝑖

78
Cross-Entropy Loss

Input image source: Photo by Victor Grabarczyk on Unsplash . Diagram by author of


https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e.
79
Cross-Entropy Loss
• Cross-entropy is defined as

By author of https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e.

80
Cross-Entropy Loss
• The categorical cross-entropy is computed as follows

81
By author of https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e.
Cross-Entropy Loss

• Assume that after some iterations of model training the model


outputs the following vector of logits

By author of https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e. 82
Training multi-layer networks: Gradient Descent
• Find network weights to minimize the prediction loss between
true and estimated labels of training examples:

𝐸 𝐰 = ෍ 𝑙(𝐱 𝑖 , 𝑦𝑖 ; 𝐰)
𝑖
E
• Update weights by gradient descent: w  w −
w

w2
w1
83
Vanilla Gradient Descent

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c 84
Vanilla Gradient Descent

Gradient descent is a simple method


to find the minimum of a function,
where at each iteration a small step is
made in the direction of the steepest
descent. It tends to get stuck in a
local minimum, so it is often run with
several initial conditions.

https://commons.wikimedia.org/wiki/File:Gradient_descent.gif

85
Training multi-layer networks: Gradient Descent
• Find network weights to minimize the prediction loss between true
and estimated labels of training examples:

𝐸 𝐰 = ෍ 𝑙(𝐱 𝑖 , 𝑦𝑖 ; 𝐰)
E
𝑖 w  w −
• Update weights by gradient descent: w

• Back-propagation: gradients are computed in the direction from


output to input layers and combined using chain rule
• Stochastic gradient descent: compute the weight update w.r.t.
one training example (or a small batch of examples) at a time, cycle
through training examples in random order in multiple epochs

86
Learning rate

Bad learning rate

87
Learning rate

88
A Visual Explanation of Gradient Descent Methods
(Momentum, AdaGrad, RMSProp, Adam)

Animation of 5 gradient descent


methods on a surface: gradient
descent (cyan), momentum
(magenta), AdaGrad (white),
RMSProp (green), Adam (blue).
Left well is the global minimum;
right well is a local minimum.

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c
89
90
https://www.youtube.com/watch?v=IHZwWFHWa-w Gradient descent, how neural networks learn | Chapter 2, Deep learning (youtube.com)
Training multi-layer networks: Back Propagation

91
Backpropagation: a simple example

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 59


Backpropagation: a simple example

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 93


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 61


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 62


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 63


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 64


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 65


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 66


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 67


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 68


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 69


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 70


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want: Upstream Local


gradient gradient

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 71


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want: Upstream Local


gradient gradient

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 72


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want: Upstream Local


gradient gradient

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 73


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want: Upstream Local


gradient gradient

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 74


f

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 75


“local gradient”

Lecture 4 - 76
“local gradient”

“Upstream
gradient”

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 77


“local gradient”

“Downstream
gradients”
f

“Upstream
gradient”

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 78


“local gradient”

“Downstream
gradients”
f

“Upstream
gradient”

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 79


“local gradient”

“Downstream
gradients”
f

“Upstream
gradient”

Slide adapted from https://cs231n.stanford.edu/ Lecture 4 - 80


114
https://www.youtube.com/watch?v=Ilg3gGewQ5U What is backpropagation really doing? | Chapter 3, Deep learning (youtube.com)
Training multi-layer networks: Regularization

• Neural networks with at least one hidden


layer are universal function approximators

115
Training multi-layer networks: Regularization

• Hidden layer size and network capacity:

Network with a single hidden layer

Source: http://cs231n.github.io/neural-networks-1/ 116


Underfitting and overfitting
• Underfitting: training and test error are both high
• Model does an equally poor job on the training and the test set
• The model is too “simple” to represent the data or the model
is not trained well
• Overfitting: Training error is low but test error is high
• Model fits irrelevant characteristics (noise) in the training data
• Model is too complex or amount of training data is insufficient
Underfitting Good tradeoff Overfitting

117
Figure source
118
Stay away from overfitting: L2-norm Regularization, Weight Decay and L1-norm Regularization techniques | by Inara Koppert-Anisimova | unpack | Medium
Underfitting and overfitting

119
Training multi-layer networks: Regularization
• It is common to add a penalty (e.g., quadratic) on weight
magnitudes to the objective function:

𝐸 𝐰 = ෍ 𝑙(𝐱𝑖 , 𝑦𝑖 ; 𝐰) + 𝜆 𝐰 2

𝑖
• Quadratic penalty encourages network to use all of its inputs “a little” rather
than a few inputs “a lot”

Source: http://cs231n.github.io/neural-networks-1/ 120


Neural networks: Pros and cons
• Pros
• Flexible and general function approximation framework
• Can build extremely powerful models by adding more layers
• Cons
• Hard to analyze theoretically (e.g., training is prone to local
optima)
• Huge amount of training data, computing power may be required
to get good performance
• The space of implementation choices is huge (network
architectures, parameters)

121
Multi-Layer Network Demo

http://playground.tensorflow.org/
122
References
Many slides, images and contents of this
lecture are adapted from:
• CS 231n: Deep Learning for Computer Vision
https://cs231n.stanford.edu/schedule.html
• CS 376: Computer Vision
http://vision.cs.utexas.edu/376-
spring2018/#Syllabus
• 16-385: Computer Vision
http://www.cs.cmu.edu/~16385/

123

You might also like