Artificial Neural Network
and Deep Learning
                  Lecture 7
         Convolution Neural Networks
                    (CNN)
                Neural Networks - Lecture 7                1
               Convolution Layer
               ReLU
               Pooling Layers
               Fully Connected layer & Classification
Agenda
               Training
               Dropout
               Neural Networks in Practice: Mini-batches
               Batch Norm layer
                                                               1
           ●   A CNN consists of an input and an output layer, as well as multiple hidden
               layers. The hidden layers of a CNN typically consist of convolution layers,
CNN            pooling layers, fully connected layers and normalization layer.
 Before:                                       Now:
All Neural Net activations arranged in 3 dimensions:
For example, a CIFAR-10 image is a 32x32x3 volume
32 width, 32 height, 3 depth (RGB channels)
                                                                                      4
                                                                                                 2
     Convolution Layer
         The filter depth must have the                                      Connect neurons
         same depth as the input.                                            only to local
                                                                             receptive fields.
                                           image: 32x32x3 volume
                                           before: full connectivity: 32x32x3 weights
                                           now: one neuron will connect to, e.g. 5x5x3
                                           chunk and only have 5x5x3 weights.
Convolve the filter with the image         1 number: The result of taking a dot product
i.e. “slide over the image spatially,      between the filter and a small 5x5x3 chunk of the
computing dot products”.                   image (i.e. 5*5*3 = 75-dimensional dot product +
                                           bias)                                       5
   Convolution Layer
                                          • Produces a new mapping of the image named an
Input volume: 32x32x3,                      activation map or feature map which is a 28x28
                                            sheet of neuron outputs:
filter 5x5x3, stride 1
                                          1. Each is connected to a small region in the input.
Output size: (N - F) / stride + 1
                                          2. All of them share parameters.
             = (32 - 5) / 1 + 1
             = 28                         “5x5 filter” -> “5x5 receptive field for each neuron”
                                                                                          6
                                                                                                  3
                                             • For example, if we had 6 5x5 filters, we’ll
   Convolution Layer                           get 6 separate activation maps:
• Consider a second filter
                                          • E.g. with 5 filters, CONV layer consists of
                                            neurons arranged in a 3D grid (28x28x5).
                                          There will be 5 different neurons all looking
Each filter focuses on specific           at the same region in the input volume.
patterns in the image (e.g. vertical
edges, horizontal edges, color, etc.)
and produces a new mapping of the
image named activation map or
feature map.                                                                           7
  Preview: ConvNet is a sequence of Convolution Layers, interspersed
  with activation functions
              • Output size:            • Output size:
              (32 - 5) / 1 + 1 = 28     (28 - 5) / 1 + 1 = 24
                                                                                             4
Higher-level features
• In general, the more convolution steps we have,
  the more complicated features our network will
  be able to learn to recognize.
• In Image Classification, a ConvNet may learn to
  detect edges from raw pixels in the first layer,
  then use the edges to detect simple shapes in
  the second layer, and then use these shapes to
  detect higher-level features, such as facial
  shapes in higher layers.
                                                                                10
Example 1
Input volume: 32x32x3
Receptive fields FxF: 5x5, stride 1
Number of filters: 10
• Output size: (N - F) / stride + 1
Output volume size: (32 - 5) / 1 + 1 = 28, so: 28x28x10
• Number of parameters in this layer?
each filter has 5*5*3 + 1 = 76 params    (+1 for bias)            Now: CNN
                                                                  (parameter
76*10 = 760                                                      sharing)
Before:
Number of weights in such layer: (32*32*3)*10*76 = 30720 *76 ~= 3 million :\
                                                                                11
                                                                                     5
Example 1, cont.
Input volume: 32x32x3
Receptive fields FxF: 5x5, stride 2
Number of filters: 10
Output volume: ? Cannot: (32-5)/2 + 1 = 14.5 :\
                                                              12
Example 1, cont.
Input volume: 32x32x3
Receptive fields FxF: 5x5, stride 3
Number of filters: 10
• Output volume size: ? (32 - 5) / 3 + 1 = 10, so: 10x10x10
• Number of parameters in this layer?
each filter has 5*5*3 + 1 = 76 params    (+1 for bias)
76*10 = 760
(unchanged)
                                                              13
                                                                   6
Example 2
Assume input 32x32x3 image
If we had 30 filters with receptive fields 5x5, applied stride 1 and pad 2:
=> output volume: [32x32x30] (32*32*30 = 30720 neurons)
Each neuron has 5*5*3 +1 (=76) weights
=> Number of weights in the layer: (30 * 75) + 30 = 2280 (+30 Biases,
one for each neuron).
                                                                           14
Output volume size
Input volume of size [W1 x H1 x D1]
using K filters with receptive fields F x F and applying them at strides
of S gives
Output volume size: [W2, H2, D2]
F*F*D1 weights per filter, for a total of (F*F*D1*K) weights and K
biases.
                                                                           15
                                                                                7
              • It's common to apply a linear rectication nonlinearity: yi = max(zi ; 0)
       ReLU
                                                        Why might we do this?
                                                        • Convolution is a linear
                                                          operation.
                                                        • Therefore, we need a non-
                                                          linearity, otherwise 2
                                                          convolution layers would be no
                                                          more powerful than 1.
                                                        • ReLU has been used after every
                                                          Convolution operation.
                                                                                       16
 What are the advantages of ReLU over
 sigmoid function in deep neural networks?
                      ReLU is h(a)=max(0,a) where a=Wx+b.
1- ReLU More computationally efficient to compute than Sigmoid functions
since ReLU just needs to pick max(0,a) and not perform expensive exponential
operations as in Sigmoids.
2- In practice, networks with ReLU tend to show better convergence
performance than sigmoid.
3- Sigmoid: tend to vanish gradient (cause there is a mechanism to reduce the
gradient as “u" increases, where “u" is the input of a sigmoid function).
Gradient of Sigmoid: S′(u)=S(u)(1−S(u)). When "a" grows to infinite
large, S′(u)=S(u)(1−S(u))=1×(1−1)=0.
Relu : not vanishing gradient (reduced likelihood of vanishing gradient).
                                                                                       17
                                                                                            8
• Advantage:
  • Sigmoid: not blowing up activation.
  • Relu : not vanishing gradient
  • Relu : More computationally efficient to compute than Sigmoid functions since
    Relu just needs to pick max(0, x) and not perform expensive exponential
    operations as in Sigmoids
  • Relu : In practice, networks with Relu tend to show better convergence
    performance than sigmoid. (Krizhevsky et al.)
• Disadvantage:
  • Sigmoid: tend to vanish gradient.
  • Relu : tend to blow up activation (there is no mechanism to constrain the output of
    the neuron, as “a" itself is the output)
  • Relu : Dying Relu problem - if too many activations get below zero then most of the
    units (neurons) in network with Relu will simply output zero, in other words, die
    and thereby prohibiting learning.(This can be handled, to some extent, by using
    Leaky-Relu instead.)
                                                                                    19
     Pooling
     Layers
                     • A pooling layer is another building block of a CNN.
                     • These layers reduce the spatial dimensionality of each feature
                       map (but not depth) (reduce the amount of parameters and
                       computation in the network) and build in invariance to small
                       transformations.
                                                                                    20
                                                                                          9
MAX Pooling
• Pooling retains the most important information.
• Spatial Pooling can be of different types: Max, Average, Sum, L2 norm, Weighted
  average based on the distance from the central pixel, etc.
• The most common type of pooling is the max-pooling layer, which slides a
  window, like a normal convolution, and get the biggest value on the window as
  the output.
                                                                                    21
Pooling layer
• Pooling operation is applied separately to each feature map.
Input volume of size [W1 x H1 x D1]
Pooling unit receptive fields F x F and applying them at strides of S gives:
       Output volume: [W2, H2, D1]
               W2 = (W1-F)/S+1
               H2 = (H1-F)/S+1
                                                                                    22
                                                                                         10
Advantage of Pooling layer
• Makes the input representations (feature dimension) smaller and more
  manageable.
• Reduces the number of parameters and computations in the network, therefore,
  controlling Overfitting.
• Makes the network invariant to small transformations, distortions and
  translations in the input image (a small distortion in input will not change the
  output of Pooling – since we take the maximum / average value in a local
  neighborhood).
                                                                                     24
                     • The Fully Connected layer is a traditional Multilayer Perceptron
                       MLP that uses a Softmax activation function in the output layer.
                     • The output from the convolutional and pooling layers represent
                       high-level features of the input image. The purpose of the Fully
                       Connected layer is to use these features for classifying the input
                       image into various classes based on the training dataset.
       Fully
   Connected         • Apart from classification, adding a fully-connected layer is also a
                       (usually) cheap way of learning non-linear combinations of these
     layer &           features. Most of the features from convolutional and pooling
  Classification       layers may be good for the classification task, but combinations
                       of those features might be even better.
                                                                                     25
                                                                                             11
    Training
                     • Step1: We initialize all filters and parameters, weights with random
                     values.
                     • Step2: Compute convolution, ReLU and pooling operations along
                     with forward propagation in the fully connected layer and finds the
                     output probabilities for each class.
                     • Step3: Calculate the total error at the output layer (summation
                     over all 4 classes) Total Error = ∑ ½ (target probability – output
                     probability)²
                     • Step4: Use Backpropagation to calculate gradients, update the
                     weights.
                     • Step5: Repeat step2 to step4 with all images in the training set.
                                                                                     26
Training
• Note:
Parameters like
       number of filters,
       filter sizes,
       architecture of the network etc.
have all been fixed before Step 1 and do not change during training process.
only the
       values of the filter matrix and connection weights
get updated.
                                                                                     27
                                                                                              12
 Test
• When a new (unseen) image is input into the ConvNet, the network
  would go through the forward propagation step and output a
  probability for each class (for a new image, the output probabilities
  are calculated using the weights which have been optimized to
  correctly classify all the previous training examples).
• If our training set is large enough, the network will
  (hopefully) generalize well to new images and classify them into
  correct categories.
                                                                     28
Typical ConvNets look like:
[CONV-RELU-POOL]xN,[FC-RELU]xM,FC,SOFTMAX
or
[CONV-RELU-CONV-RELU-POOL]xN,[FC-RELU]xM,FC,SOFTMAX
N >= 0, M >=0
Note:
(last FC layer should not have RELU - these are the class scores)
                                                                     29
                                                                          13
    CIFAR-10 example
                                                                                  30
    CIFAR-10 example
input: [32x32x3]
CONV with 10 3x3 filters, stride 1, pad 1:
gives: [32x32x10]
new parameters: (3*3*3)*10 + 10 = 280        CONV with 10 3x3 filters, stride 1:
                                             gives: [16x16x10]
RELU
                                             new parameters: (3*3*10)*10 + 10 = 910
CONV with 10 3x3 filters, stride 1, pad 1:   RELU
gives: [32x32x10]                            CONV with 10 3x3 filters, stride 1:
new parameters: (3*3*10)*10 + 10 = 910       gives: [16x16x10]
RELU                                         new parameters: (3*3*10)*10 + 10 = 910
POOL with 2x2 filters, stride 2:             RELU
                                             POOL with 2x2 filters, stride 2:
gives: [16x16x10]
                                             gives: [8x8x10]
parameters: 0                                parameters: 0                        31
                                                                                       14
                 Neural    Networks
                    NN in Practice:           in
                                    Mini Batches
                Practice: Mini-batches
                                                                        33
                   • Error function (cost function) is minimized by
                     moving from current solution in direction of the
                     negative of gradient.
                   • Cost function often decompose as a
                      sum per sample loss function.
Cost function
                   • As training set size grows to billions, time
                     taken for single gradient step becomes
                      prohibitively long.
                                                                        34
                                                                             15
         Gradient Descent
© Alexander Amini and Ava Soleimany
MIT 6.S191: Introduction to Deep Learning
IntroToDeepLearning.com                     35
         Stochastic Gradient Descent
© Alexander Amini and Ava Soleimany
MIT 6.S191: Introduction to Deep Learning
IntroToDeepLearning.com                     36
                                                 16
         Stochastic Gradient Descent
© Alexander Amini and Ava Soleimany
MIT 6.S191: Introduction to Deep Learning
IntroToDeepLearning.com                     37
         Stochastic Gradient Descent
© Alexander Amini and Ava Soleimany
MIT 6.S191: Introduction to Deep Learning
IntroToDeepLearning.com                     38
                                                 17
Mini-batches while training
• Mini-batch: Only use a small portion of the training set to compute
  the gradient.
• Common mini-batch sizes are ~ 100 examples.
• More accurate estimation of gradient.
   • Smoother convergence.
   • Allows for large learning rates.
• Mini-batches lead to fast training
   • Can parallelize computation + achieve significant increases on GPU’s.
                                                                             39
            Batch Norm layer - Motivation
                                                   In general, Gradient
     The range of values of                         descent converges
    raw training data often                          much faster with
         varies widely.                            feature scaling than
                                                        without it.
                                                                             40
                                                                                  18
 Internal covariate shift
   That is also the input for layer ‘k’. In other words, that layer receives input data
    that has a different distribution than before.
   It is now forced to learn to fit to this new input.
   As we can see, each layer ends up trying to learn from a constantly shifting input,
    thus taking longer to converge and slowing down the training.
                                                                                    43
Common normalizations
 Two methods are usually used for rescaling or normalizing data:
 1. Scaling data all numeric variables to the range [0,1]. One possible
    formula is given below:
 2. To have zero mean and unit variance:
 • In the NN community this is call Whitening
                                                                                    44
                                                                                           19
                    • Batch Normalization (BN) is a normalization method/layer for
                      neural networks.
                    • Batch Norm is a normalization technique done between the
                      layers of a Neural Network instead of in the raw data.
   Proposed         • Moreover, the Batch Norm helps to stabilize these shifting
                      distributions from one iteration to the next, and thus speeds
    Solution:         up training.
       Batch        • Batch Normalization – Is a process normalize each scalar
                      feature independently, by making
Normalization           • it have the mean of zero and the variance of 1
         (BN)           • and then scale and shift the normalized value for each
                          training mini-batch
                      thus reducing internal covariate shift fixing the distribution of
                      the layer inputs x as the training progresses.
                                                                                 45
  Batch
                               Sigmoid
          𝑥1   𝑊1        𝑧1              𝑎1         𝑊2       ……
                               Sigmoid
          𝑥2   𝑊1       𝑧2               𝑎2         𝑊2       ……
                               Sigmoid
          𝑥3   𝑊1        𝑧3              𝑎3         𝑊2       ……
                                                                                 46
                                                                                          20
Batch normalization
                                            3
                                        1
                                     𝜇=           𝑧𝑖
     𝑥   1
               𝑊   1
                         𝑧   1          3
                                            𝑖=1
                                                 3
                                            1
     𝑥2        𝑊1        𝑧2          𝜎=                𝑧𝑖 − 𝜇    2
                                            3
                                                𝑖=1
                                            Note: Batch normalization
         3         1         3
     𝑥         𝑊         𝑧                  cannot be applied on
                                            small batch.
             𝜇 and 𝜎
                                 𝜇           𝜎
             depends on 𝑧 𝑖
                                                                                    47
                                                        𝑖
                                                           𝑧𝑖 − 𝜇
Batch normalization                                    𝑧 =
                                                              𝜎
                                                                     Sigmoid
     𝑥1        𝑊1        𝑧1                                 𝑧1                 𝑎1
                                                                     Sigmoid
     𝑥2        𝑊1        𝑧2                                 𝑧2                 𝑎2
                                                                     Sigmoid
     𝑥3        𝑊1        𝑧3                                 𝑧3                 𝑎3
     𝜇 and 𝜎                     𝜇           𝜎         How to do
     depends on 𝑧 𝑖                                    backpropogation?
                                                                                    48
                                                                                         21
Batch normalization
• It is done along mini-batches instead of the full data set. It serves to
  speed up training and use higher learning rates, making learning
  easier.
• Normally, large learning rates may increase the scale of layer
  parameters, which then amplify the gradient during backpropagation
  and lead to the model explosion.
• However, with Batch Normalization, backpropagation through a layer
  is unaffected by the scale of its parameters.
• The output of the batch norm layer, has 𝛾 𝑎𝑛𝑑 𝛽 parameters.
  Those parameters will be learned to best represent your
  activations. Those parameters allows a learnable (scale and
  shift) factor.                                          𝑧 𝑖 = 𝛾⨀𝑧 𝑖 + 𝛽
• They shift the mean and standard deviation, respectively
                                                                           49
Batch normalization                          𝑖
                                                𝑧𝑖 − 𝜇
                                            𝑧 =                   𝑧 𝑖 = 𝛾⨀𝑧 𝑖 + 𝛽
                                                   𝜎
         𝑥1      𝑊1       𝑧1                         𝑧1              𝑧1
        𝑥2       𝑊1       𝑧2                         𝑧2              𝑧2
         𝑥3      𝑊1       𝑧3                         𝑧3              𝑧3
         𝜇 and 𝜎                    𝜇            𝜎        𝛽   𝛾
         depends on 𝑧 𝑖                                                    50
                                                                                    22
The proposed
solution: To
add an extra
layer
                               A new layer is added so the gradient can “see” the
                               normalization and make adjustments if needed.
                                                                                    51
 Where to use the Batch-Norm layer in CNN
 • The batch norm layer is used after linear layers (ie: FC, conv), and
   before the non-linear layers (Relu).
 • There is actually 2 batch norm implementations one for FC layer and
   the other for conv layers.
                                                                                    52
                                                                                         23
    Batch normalization:
    Other benefits in practice
• BN reduces training times (Because of less
  Covariate Shift, less exploding/vanishing
  gradients.) , and make very deep net
  trainable.
• BN reduces demand for regularization (for
  generalization), e.g. dropout or L2 norm.
• BN enables training with saturating
  nonlinearities in deep networks, e.g.
  sigmoid. (Because the normalization
  prevents them from getting stuck in
  saturating ranges, e.g. very high/low values
  for sigmoid.)
                                                 53
                                                      24