DUnit - III
DUnit - III
PANDHARPUR
Unit – III
Syllabus
Convolution Neural Networks Introduction, Convolution Operation, Motivation, Pooling,
Normalization, Applications in Computer Vision – ImageNet, Sequence Modelling –VGGNet,
LeNet, Recurrent Neural Networks, RNN topologies- Difficulty in Training RNN, Long Short
Term Memory, Bidirectional LSTMs, Bidirectional RNNs, Application case study - Handwritten
digits recognition using deep learning.
Convolution Operation: -
Convolutional Operation means for a given input we re-estimate it as the weighted average
of all the inputs around it. We have some weights assigned to the neighbor values and we take the
weighted sum of the neighbor values to estimate the value of the current input/pixel.
For a 2D input, the classic input would be an image, where we re-calculate the value of every
pixel by taking the weighted sum of pixels(neighbors) around it for example: let’s say the input
image is as given below
Input Image
Now in this input image, we calculate the value of each and every pixel by considering the
weighted sum of pixels around it
Here we are calculating the value of circled pixel considering 3 neighbors around it, assume that
the weights w1, w2, w3, w4 are associated with these 4 pixels respectively
Now, this matrix of weights is referred to as the Kernel or Filter. In the above case, we have
the kernel of size 2X2.
We compute the output(re-estimated value of current pixel) using the following formula:
Here m refers to the number of rows(which is 2 in this case) and n refers to the number of
columns(which is 2 i this case).
Now we place the 2X2 filter over the first 2X2 portion of the image and take the weighted sum
and that would give the new value of the first pixel.
We map the 2X2 kernel/filter over the 2X2 portion of the input.
The output of this operation would be: (aw + bx + ey + fz)
Then we move the filter horizontally by one and place it over the next 2 X 2 portion of the input;
in this case pixels of interest would be b, c, f, g and we compute the output using the same
technique and we would get:
And then again we move the kernel/filter by 1 in the horizontal direction and take the weighted
sum.
So, after this, the output from the first layer would look like:
Then we move the kernel by 1 down in the vertical direction, calculate the output, move the
kernel in the horizontal direction and in general we move the kernel like this: first, we start off
with the starting portion of the image, move the filter in the horizontal direction and cover this
row completely then we move the filter in the vertical direction (by some amount respective to
top left portion of image), again stride it horizontally through the entire row and continue like
this. In essence, we move the kernel left to right top to bottom.
Instead of considering pixels only in the forward direction, we consider previous neighbors as
well
And to consider the previous neighbors, the formula for computing the output would be:
We take the limits from -m/2 to m/2 i.e we take half of the rows from previous neighbors and the
other half from the forward direction(forward neighbors) and the same is the case in the vertical
direction(-n/2 to n/2).
Typically, we take the odd-dimensional kernel.
Convolutional Operation in practice
Let the input image be as given below:
and we use kernel/filter of size 3X3 and for each pixel, we take the 3 X 3 neighborhood around
it(pixel itself is a part of this 3 X 3 neighborhood and would be at the center) just like in the
below image:
Input Image, we consider 3X3 portions of this image as the kernel is of size 3X3
Let’s say this input is a 30X30 image, we go over every pixel systematically, place the filter such
that the pixel is at the center of the kernel and re-estimate the value of that pixel as the weighted
sum of pixels around it.
So, in this way, we get back the re-estimated value of all the pixels.
We all have seen the convolutional operation in practice. Let’s say the kernel that we are using is
as below:
Kernel
So, we move this kernel all over the image and re-compute every pixel as the weighted sum of the
neighborhood. In this case, since all the weights are 1/9 that means the re-estimated value of
each and every pixel would be 1/9th of its original value. This kernel is taking the average of
all the 9 pixels over which this kernel would be placed.
That means for each pixel/color in the image, if we take the average(divide the weighted sum
value by 9), it would dilute the value/blurs the image and the output we get by applying this
convolutional operation is:
So, the blur operation that we all might have used in any of the photo editing application actually
applies the convolution operation behind the scenes.
Now in the below-mentioned scenario, we are using 5 as the weight for the central pixel and 0 for
the all the boundary pixels and -1 for the remaining pixels, so the net effect would be that the
value/color intensity of the central pixel is boosted and its neighborhood information is getting
subtracted so the result of this is that it sharpens the image.
Let’s take one more example: in the below case, the value for the central pixel is -8 and for all
So, wherever we have the same color in the 3X3 portion(some sample regions marked in the
below image) or to say the neighbors are exactly the same as the current pixel, we get the output
intensity as 0.
So, in effect, what will happen is that where ever there is a boundary(yellow highlighted in the
below image), there the neighboring pixels can not be the same as the current pixel, only in such
regions we get the non-zero value, everywhere else we get a zero value. So, in effect, we end up
Once we complete the entire row, we slide the kernel vertically in downwards direction and start
from the left side
In the case of 3D input (image is also a 3D input as it has 3 channels corresponding to Red,
Green, Blue, all these channels are superimposed on each other and that’s how we get the final
image. In other words, every pixel in the image has 3 values associated with it, so we can look at
that as the depth), we have 3 channels(depth) one corresponding to each of the RGB in the image,
we use the filter of the same depth as the input and place the filter over the input and compute the
In most cases when we use convolution for 3D inputs, we use a 3D convolution filter (as depicted
in the below image) that means if we place the filter at a given location in the image, we would
take a weighted average of its 3D neighborhood but we are not going to slide it along the depth.
What this conveys is that the kernel would have the same depth as the original input and that’s
why there is no scope to move it through the depth/input. For example, the input image depth is 3
and the kernel depth is also 3 so there is no scope to move it along the depth. There is no
movement available there
In this case, also, we move the filter horizontally and vertically as in the 2D case. We don’t move
the filter along the depth as the input image depth is the same as the filter depth and there is no
scope to move across the depth.
So, what we do in practice is we have this 3D kernel, we will start moving it, we will move it
along the horizontal direction first, and we keep doing this through the entire image and once we
reach the last box(we move from left to right and top to bottom), at the end of this, although our
input was 3 dimensional, we get back a 2D output.
Points to consider:
Input is 3D
The filter is also 3D
The convolutional operation that we perform is 2D as we are sliding the filter horizontally
and vertically and not along the depth
This is because the depth of the filter is the same as the depth of the input
In practice, we apply multiple kernels/filters to the same input and get the different
representations/output from the same input as per the kernel used for example one filter might
detect the vertical edges in the input, second might detect the horizontal edges in the image,
another filter might blur the image and so on.
In the above image, we are using 3 different filters and we are getting 3 outputs corresponding to
each filter. We can combine these different outputs representations into one single volume (each
of the output representation would have width and height and after combining all of the
representations we get the depth as well). So, if we apply 3 filters to the input, we get an output of
depth 3, if we apply 100 filters to the input, we get the output of depth 100.
Points to consider:
Each filter applied to a 3D input would give a 2D output.
Combining the output of multiple such filters would result in a 3D output.
Terminology
Let’s define some terminology and find out the relation between the input dimensions and the
output dimensions:
The spatial extent(extent of the neighborhood we are looking at) of a filter(F) means the
dimension of the filter, it would be ‘F X F’. Usually, we have an odd-dimensional filter and the
depth of the filter would be the same as the depth of the input(Di in this case).
Now we want to relate the output dimensions with the input dimensions:
Let’s take 2D input of dimension ‘7 X 7’ and we have a filter of size ‘3 X 3’ over it.
As we slide the filter over it(from left to right and top to bottom), we keep computing the output
values, and it's very clear that the output is smaller than the input.
This is how we slide the filter over the image:
The reason is obvious why this is happening, we can’t place the kernel at the corners as it will
cross the boundary
We can’t place the filter at the crossed pixel(below image) because if we place it there then
yellow highlighted portion would be undefined:
And in practice, we would stop at the crossed pixel(as in the below image) when the filter
completely lies inside the image:
And this is why we get the smaller output because we would not be able to apply the filter in any
part in the shaded region in the below image:
Hence for every pixel in the input, we are not computing the re-estimated value and therefore the
number of pixels in the output is less than the number of pixels in the input.
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 19
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
This was the case for ‘3 X 3’ kernel, now let’s see what happens when we have ‘5 X 5’ kernel:
Now we can not place the kernel at the crossed pixel in the above image. We can not place the
kernel at the yellow highlighted pixel as well. So, in this case, we can not place the kernel at any
of the shaded regions in the below image:
Here we pad the input with 0 all over the input image and apply the 3X3 filter over the input and
we get the output of the same dimension as the input
If we place the kernel at the crossed pixel in the below image, we now have 5 artificial pixels
with a value of 0 and we would be able to re-estimate the value of this crossed pixel.
Now the output would be again ‘7 X 7’ as we have introduced this artificial boundary around the
original input and this boundary contains all the values as 0.
If we have a ‘5 X 5’ filter, it would still go outside the image even after this artificial padding
So, in this case, we need to increase padding. Earlier we added padding of 1(meaning 1 row at the
top, 1 at the bottom, 1 at the left and 1 at the right). And it’s obvious from the above image that if
we want to use a ‘5 X 5’, then we should use the padding of 2.
The bigger the kernel size the larger is the padding required and the updated formula for the
relation between input and output dimension is:
Stride(S): Stride defines the interval at which the filter is applied, till now we discussed all the
cases considering stride to be 1 as we’re moving the filter by 1 in the horizontal and vertical
direction as depicted in the below image:
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 23
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
In some cases, we may not want this to say we don’t want a full replica of the image and just need
a summary of it. In that case, we may choose to apply the filter only at alternate locations in the
input.
Here we use S = 2 i.e we move the filter by 2 in the horizontal as well as the vertical direction
This interval between two successive pixels where we apply the kernel is termed as the Stride.
And in the above case, the output would be roughly half the input as we are skipping part of the
image by 1 every time.
Now, if we are using a stride ‘S’, then the formula to compute the width and height is given by:
Deep learning finally leads to multiple trainable stages, so that the internal
representation is structured hierarchically. Especially for images, it turned out that such a
representation is very powerful. Low-level stages are used to detected primary edges. High-level
stages lastly connect information on where and how objects are positioned regarding the scene.
A common CNN model architecture is to have a number of convolution and pooling layers
stacked one after the other.
Why to use Pooling Layers?
Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the
number of parameters to learn and the amount of computation performed in the network.
The pooling layer summarises the features present in a region of the feature map
generated by a convolution layer. So, further operations are performed on summarised
features instead of precisely positioned features generated by the convolution layer. This
makes the model more robust to variations in the position of the features in the input
image.
1. Max Pooling
Max pooling is a pooling operation that selects the maximum element from the region of
the feature map covered by the filter. Thus, the output after max-pooling layer would be a feature
map containing the most prominent features of the previous feature map.
2.Average Pooling
Average pooling computes the average of the elements present in the region offeature map
covered by the filter. Thus, while max pooling gives the most prominent feature in a particular
patch of the feature map, average pooling gives the average of features present in a patch.
3. Global Pooling :-
Global pooling reduces each channel in the feature map to a single value. Thus, an nh x nw
x nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to using a filter of
dimensions nh x nw i.e. the dimensions of the feature map.
Further, it can be either global max pooling or global average pooling.
Code #3 : Performing Global Pooling using keras.
Normalization :-
To fully understand how Batch Norm works and why it is important, let’s start by talking
about normalization.
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 28
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
Normalization is a pre-processing technique used to standardize data. In other words,
having different sources of data inside the same range. Not normalizing the data before training
can cause problems in our network, making it drastically harder to train and decrease its learning
speed.
For example, imagine we have a car rental service. Firstly, we want to predict a fair price
for each car based on competitors’ data. We have two features per car: the age in years and the
total amount of kilometers it has been driven for. These can have very different ranges, ranging
from 0 to 30 years, while distance could go from 0 up to hundreds of thousands of kilometers. We
don’t want features to have these differences in ranges, as the value with the higher range might
bias our models into giving them inflated importance.
There are two main methods to normalize our data. The most straightforward method is to scale it
to a range from 0 to 1:
X the data point to normalize, m the mean of the data set, Xmax the highest value, and Xmin
the lowest value. This technique is generally used in the inputs of the data. The non-normalized
data points with wide ranges can cause instability in Neural Networks. The relatively large inputs
can cascade down to the layers, causing problems such as exploding gradients.
The other technique used to normalize data is forcing the data points to have a mean of 0 and
a standard deviation of 1, using the following formula:
being X the data point to normalize, m the mean of the data set, and S the standard deviation
of the data set. Now, each data point mimics a standard normal distribution. Having all the
features on this scale, none of them will have a bias, and therefore, our models will learn better.
In Batch Norm, we use this last technique to normalize batches of data inside the network
itself.
Most computer vision algorithms use something called a convolution neural network, or
CNN. A CNN is a model used in machine learning to extract features, like texture and edges,
from spatial data.
ImageNet :-
The ImageNet project is a large visual database designed for use in visual object recognition
software research. More than 14 million images have been hand-annotated by the project to
indicate what objects are pictured and in at least one million of the images, bounding boxes are
also provided.
LeNet :-
This is also known as the Classic Neural Network that was designed by Yann LeCun,
Leon Bottou, Yosuha Bengio and Patrick Haffner for handwritten and machine-printed character
recognition in 1990’s which they called LeNet-5. The architecture was designed to identify
handwritten digits in the MNIST data-set. The architecture is pretty straightforward and simple to
understand. The input images were gray scale with dimension of 32*32*1 followed by two pairs
of Convolution layer with stride 2 and Average pooling layer with stride 1. Finally, fully
connected layers with Softmax activation in the output layer. Traditionally, this network had
60,000 parameters in total. Refer to the original paper.
Recurrent Neural Network(RNN) are a type of Neural Network where the output from
previous step are fed as input to the current step. In traditional neural networks, all the inputs and
outputs are independent of each other, but in cases like when it is required to predict the next
word of a sentence, the previous words are required and hence there is a need to remember the
previous words. Thus RNN came into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN is Hidden state, which remembers
some information about a sequence.
RNN have a “memory” which remembers all information about what has been calculated.
It uses the same parameters for each input as it performs the same task on all the inputs or hidden
layers to produce the output. This reduces the complexity of parameters, unlike other neural
networks.
How RNN works
where:
ht -> current state
ht-1 -> previous state
xt -> input state
where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron
Yt -> output
Why -> weight at output layer
RNN topologies :-
Long short-term memory (LSTM) is an artificial neural network used in the fields of
artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has
feedback connections. Such a recurrent neural network can process not only single data points
(such as images), but also entire sequences of data (such as speech or video). For example, LSTM
is applicable to tasks such as unsegmented, connected handwriting recognition,speech
recognition,machine translation robot control, video games, and healthcare. LSTM has become
the most cited neural network of the 20th century.
A common LSTM unit is composed of a cell, an input gate, an output gate and a
forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate
the flow of information into and out of the cell.
LSTM networks are well-suited to classifying, processing and making predictions
based on time series data, since there can be lags of unknown duration between important events
in a time series. LSTMs were developed to deal with the vanishing gradient problem that can be
encountered when training traditional RNNs. Relative insensitivity to gap length is an advantage
of LSTM over RNNs, hidden Markov models and other sequence learning methods in numerous
applications.
In the diagram, we can see the flow of information from backward and forward layers. BI-LSTM
is usually employed where the sequence to sequence tasks are needed. This kind of network can
be used in text classification, speech recognition and forecasting models. Next in the article, we
are going to make a bi-directional LSTM model using python.
Bidirectional RNNs :
Bidirectional recurrent neural networks (BRNN) connect two hidden layers of opposite
directions to the same output. With this form of generative deep learning, the output layer can get
information from past (backwards) and future (forward) states simultaneously. Invented in 1997
by Schuster and Paliwal, BRNNs were introduced to increase the amount of input information
available to the network. For example, multilayer perceptron (MLPs) and time delay neural
BRNN are especially useful when the context of the input is needed. For example, in handwriting
recognition, the performance can be enhanced by knowledge of the letters located before and after
the current letter.
Application case study - Handwritten digits recognition using deep learning
The ability of computers to recognize human handwritten digits is referred to as handwritten digit
recognition. Handwritten digits are not perfect and can be made in any shape as a result, making
it a tedious task for machines to recognize the digits. So in this, we will use the image of the digit
and recognize the digit present in that image.
About the project we are going to create:
In this project, we will be using a Convolutional Neural Network to create our model which will
predict the digits present in the image. And in this, we are using the MNIST dataset, with the help
of which we will create our project that is handwritten digit recognition.
Project Prerequisites:
The libraries that should be installed on your computer are:
Tensorflow
Let’s start Building our deep learning project that is Handwritten Digit Recognition:
1) Import required libraries and load Dataset:
Let’s go step by step. We will import the libraries whenever we require, so first, we only import
tensorflow so that we can load our dataset, as I have told you that the MNIST dataset is already
present in tensorflow. So we can easily import the dataset and start working on it.
import tensorflow as tf
mnist = tf.keras.datasets.mnist
2) Splitting of Data:
Now we will split our training and testing data, and its corresponding labels, using
mnist.load_data() method. And by using x_train.shape we will get the shape of our training data
that is (60,000, 28, 28).
(x_train, y_train),(x_test , y_test) = mnist.load_data()
x_train.shape
3) Visualisation of data:
Let’s visualize our data using matplotlib library, so firstly we have to import matplotlib, and then
we are able to see the first image of our training data using plt.imshow().
import matplotlib.pyplot as plt
plt.imshow(x_train[0])
4) Normalize Data:
We cannot feed our image directly into our model, so we have to perform some operations to
process the data to make it ready for our neural network. Firstly we have to normalize our data,
we will do this with the help of tf.keras.utils.normalize() method.
x_train = tf.keras.utils.normalize(x_train , axis = 1)
x_test = tf.keras.utils.normalize(x_test , axis = 1)
plt.imshow(x_train[0] , cmap = plt.cm.binary)
Our model accuracy is more than 99% on training data and more than 98% on our validation data.
11)Now, let’s evaluate our model on our test data:
test_loss, test_acc = model.evaluate(x_tester, y_test)
print('Test loss on 10,000 test samples' , test_loss)
print('Validation Accuracy on 10,000 samples' , test_acc)
12) Predictions:
Our model is ready and now we can predict the digits present in an image. Let’s see predictions
made by our model and what is the actual number in an image.
predictions = model.predict([x_tester])
print(np.argmax(predictions[54]))
plt.imshow(x_test[54])