0% found this document useful (0 votes)
9 views64 pages

Lec14 CNNRNNModels

The document provides an introduction to neural networks, focusing on training methodologies such as feedforward networks, backpropagation, and various gradient descent techniques. It discusses the architecture of convolutional neural networks (CNNs) and their applications in image processing, as well as the use of recurrent neural networks (RNNs) for sequential prediction tasks. Additionally, it covers autoencoders for unsupervised learning and dimensionality reduction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views64 pages

Lec14 CNNRNNModels

The document provides an introduction to neural networks, focusing on training methodologies such as feedforward networks, backpropagation, and various gradient descent techniques. It discusses the architecture of convolutional neural networks (CNNs) and their applications in image processing, as well as the use of recurrent neural networks (RNNs) for sequential prediction tasks. Additionally, it covers autoencoders for unsupervised learning and dimensionality reduction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

INTRODUCTION TO MACHINE LEARNING

Neural Networks II

Giovanni Iacca

(credits: Elisa Ricci)


Training a Neural
Network
Feedforward networks
The function f is a composition of multiple functions:
Feedforward networks
● Goal: Approximate some unknown ideal function
● Feedforward Network:
○ Define parametric mapping f (xi ; Θ)
○ Learn parameters to get a good approximation of from available sample
● The computation can be described by a Directed Acyclic Graph (DAG)
○ Information flow in function evaluation begins at input, and then flows through
intermediate computations to produce the output
Modeling Choices
● Need to choose:
○ Cost function
○ Form of output
○ Activation functions
○ Architecture (number of layers etc)
○ Optimizer (for training)
Training a Neural Network
● Learning = Optimization
● Main idea: Given training samples T={(x1 ,y1), (x2 ,y2), …., (xN ,yN)}, adjust all
the weights of the network Θ such that a cost function is minimized

minΘ Σi L(yi , f (xi ; Θ))

● Choose your loss function (e.g., square loss, cross-entropy loss, etc.)
● Update the weights of each layer with gradient descent
● Use the backpropagation to compute the gradient efficiently
So Far: Backpropagation
1. Forward propagation: sum inputs, produce activations, feed-forward
2. Error estimation
3. Back propagate the error signal and used it to update weights

yi

f (xi ; Θ)
Gradient Descent
Feedforward neural networks can be trained with Vanilla Gradient Descent

Gradient
descent
update rule
Gradient
● The gradient is the vector of partial derivatives
wrt to all the coordinates of the weights:

● Each partial derivative measures how fast the


loss changes in one direction.
● When the gradient is zero, i.e., all the partials
derivatives are zero, the loss is not changing in
any direction.
● Issues: local minima, saddle points
Gradient Descent
● Gradient Descent finds the set of parameters that makes the loss as small as
possible
● The change of parameters depends on the gradients of the loss with respect to the
network weights
● Backpropagation is a method for computing gradients
● What we will see now: Stochastic Gradient Descent (SGD) and other
optimization methods
Batch Gradient Descent (BGD)
Input: Learning rate , initial parameters w

while stopping criteria not met do


Compute gradient estimate over N examples

Apply update:
end while

The learning rate changes at each step, typically is decayed linearly.


Batch Gradient Descent

● Pros: Gradient estimates are stable


● Cons: Need to compute gradients over the entire training dataset for one update
Stochastic gradient descent (SGD)
Input: Learning rate , initial parameters w

while stopping criteria not met do


Sample one datapoint from training set
Compute gradient estimate

Apply update:
end while

The learning rate changes at each step, typically is decayed linearly.


BGD vs. SGD
BGD
SGD
BGD vs. SGD
BGD

SGD
MiniBatches
● Problem : gradient estimates can be very noisy
● One obvious solution is to use mini-batches (small sets of samples)
● Advantage:
○ Computation time per update does not depend on number of training examples N
○ It permits computation on extremely large datasets
○ Often parallel implementation
○ Using GPUs, it is common to use power of 2 batch sizes to offer better runtime (some kinds of
hardware achieve better runtime with specific sizes of arrays)
Momentum
Problem with SGD: with some error surfaces, very slow progress along flat direction,
jitter along steep one!
Momentum
Introduce a new variable v: the velocity
The velocity is an exponentially decaying moving average of the negative gradient

Input: Learning rate , initial parameters w,


initial velocity v , momentum parameter

while stopping criteria not met do


Sample one datapoint from training set
Compute gradient estimate

Compute velocity update

Apply update:
end while
Adaptive Learning Rate Methods
● So far we have assigned the same learning rate to all features
● If the features vary in importance and frequency, is this a good idea?
● The learning rate is one of the hyperparameters most difficult to set in neural networks

Easier: all the features important Harder!


Different Methods

http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
Convolutional Neural
Networks
Structured Data
● Some applications naturally deal with an input space which is locally structured,
i.e., spatial or temporal.
● Images, language, etc. vs. arbitrary input features.
● Neural networks are extremely powerful in this case.
From Pixels to Labels
● Learn a hierarchy of features
● Each layer of hierarchy extracts features from output of previous layer
● Train all layers jointly

Layer1 Layer2 Layer3 Female

Simple
Classifier
In Convolutional Neural Networks…

Layer1 Layer2 Layer3 Female


Convolutional Neural Networks
Convolutional networks are simply neural networks that use convolution in place of
general matrix multiplication in at least one of their layers.
What is a convolution?
Recap: Convolution
● Convolution is a general purpose filter operation for images.
● A kernel matrix is applied to an image.
● It works by determining the value of a central pixel by adding the weighted
values of all its neighbors together.
● The output is a new modified filtered image.

● Can be used to smooth, sharpen, enhance…


● It is a commutative operation.
Recap: Convolution
Recap: Convolution
Convolutional Neural Networks
Inspired by mammalian visual cortex.

https://neurdiness.wordpress.com/2018/05/17/deep-convolutional-neural-networks-as-models-of-the-visual-system-qa/
Visual Cortex
● The visual cortex contains a complex arrangement of cells, which are sensitive to
small sub-regions of the visual field, called a receptive field.
● These cells act as local filters over the input space and are well-suited to exploit
the strong spatially local correlation present in natural images.
● Two basic cell types:
○ Simple cells respond maximally to specific edge-like patterns within their receptive field.
○ Complex cells have larger receptive fields and are locally invariant to the exact position of the
pattern.
CNN: Architecture
● Feedforward neural network with specialized connectivity structure
● Typically CNN layers transform the input matrix into an output class
prediction.
● There are a few distinct types of operations:
○ Convolution
○ Non-linearity
○ Pooling
CNN MOTIF
Convolution Spatial
Nonlinearity
(Learned) pooling

Input Feature Activation Map


Convolution
● Convolutional layer : core layer of CNNs.
● Consists of a set of learned filters.
● Each filter covers a spatially small portion of the input data (receptive field).
● Each filter is convolved across the dimensions of the input data, producing a multi-
dimensional feature map.
● Intuition: the network will learn filters that activate when they see some specific
type of feature at some spatial position in the input.
CNN:architecture

Convolution Spatial
Nonlinearity
(Learned) pooling

Apply elementwise

Increase the nonlinearity of the entire architecture without affecting the receptive fields
of the convolution layer.
CNN:architecture
Convolution Spatial
Nonlinearity
(Learned) pooling

Pooling: to provide invariance to translations


Pooling
By progressively reducing the spatial size of the representation we reduce the amount
of parameters and computation in the network and also control overfitting.
Example: max pooling
Convolutional
Neural Networks
Architectures
LeNet - 1998

[LeCun, Bottou, Bengio, Haffner 1998]


AlexNet - 2012
● Similar framework to LeNet but…
● Bigger model (7 hidden layers, 650K units, 60M params)
● More data (106 vs. 103 images)
● GPU implementation (50x speedup over CPU) - Trained on two GPUs for a week

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
Going Deeper
Classification: ImageNet Challenge top-5 error
VGG - 2014
Similar motif to AlexNet
GoogLeNet
● Has 12x fewer parameters than AlexNet
● Gets rid of fully connected layers
● Inception Module
ResNet
● Residual Block: improved performance of very
deep nets
● Solve the degradation problem enabling the
deeper layers to propagate the information
from the shallow layers directly with identity
mapping.
● Introduces batch normalization to improve
training
Beyond Classification
Detection
● First approach: R-CNN (Regions with CNN features)
● Trained on ImageNet classification
● Fine-tune CNN on PASCAL-VOC
● Nowadays more sophisticated methods exist

[Girshick et al. CVPR 2014]


Beyond Classification
● Semantic Segmentation

[Long et al. CVPR 2015]


Beyond Classification
● Structured Regression

[Toshev and Szegedy CVPR 2014]


CNN: SUMMARY
● In a feedforward neural network, units are organized into layers and the units at
a given layer only get input from units in the layer below.
● CNNs are feedforward networks. However, unlike standard vanilla feedforward
networks, units in a CNN have a spatial arrangement.
● At each layer, units are organized into 2D grids, the feature maps.
● Each feature map is the result of a convolution. The same convolutional filter is
applied at each location. The weights are different across feature maps.
● A unit at a particular location on the 2D grid can only receive input from units at a
similar location at the layer below.
● Need (a lot of) labeled data: supervised learning model!
● Flexible to many applications.
Other
Neural Networks
Many Models for different needs

https://www.asimovinstitute.org/neural-network-zoo/
Sequential Prediction Tasks
● So far, we focused mainly prediction problems with fixed-sized inputs and outputs.
● We discussed the flexibility of CNNs to address a wide range of tasks.
● But what if the input and/or output is a variable-length sequence ?
● Many applications where we need this...

Document classification Sentiment Analysis Image Captioning


Example: Video Frame Prediction
What is new?

Single2Single Feedforward Network

Multiple2Multiple Recurrent Network


Video Frame Prediction
Recurrent Neural Network (RNN)
RNN can address a wide range of tasks

Multiple2Single
Sentiment Analysis

Single2Multiple
Image Captioning

Multiple2Multiple

Machine Translation
Recurrent Neural Network (RNN)
● Introduces cycles, recurrences

Output at time t yt

Classifier
Hidden
representation at ht
time t new function input at old
Hidden layer state of W time t state

Input at time t xt
Recurrent Neural Network (RNN)
y3
RNN can be trained with backpropagation

y2
Classifier
y1 h3
Classifier
h2 Hidden layer
Classifier
h1 Hidden layer
x3
Hidden layer
x2 t=3
h0 x1 t=2
t=1
Unsupervised Learning: Autoencoders

https://www.asimovinstitute.org/neural-network-zoo/
Autoencoders: Dimensionality Reduction
● Unsupervised approach for learning a lower-dimensional feature representation
from unlabeled training data
● Features should capture meaningful factors of variation in data: z usually smaller
than x

Features
Encoder
Input data
Autoencoders

Originally: Linear + nonlinearity (sigmoid)


Later: Deep, fully-connected
Later: CNN with ReLU
Features
Encoder
Input data
Autoencoders
How to learn this feature representation?
● Train such that features can be used to reconstruct original data
● “Autoencoding” - encoding itself

Reconstructed
input data
Decoder
Features
Encoder
Input data
Autoencoders
Originally: Linear + nonlinearity (sigmoid)
Later: Deep, fully-connected
Later: ReLU CNN (upconv)

Reconstructed
input data
Decoder
Features
Encoder
Input data
Autoencoders
Reconstructed data

Doesn’t use labels!


L2 Loss function:

Reconstructed Encoder: 4-layer conv


input data Decoder: 4-layer upconv
Input data
Features

Input data
Autoencoders
● After training, we throw away decoder
● Encoder can be used to initialize a supervised model

Reconstructed
input data

Features

Input data
Autoencoders
● After training, we throw away decoder
● Encoder can be used to initialize a supervised model

Loss function

Predicted Label
Classifier Fine-tune
Features encoder
jointly with
classifier
Input data
QUESTIONS?

You might also like