0% found this document useful (0 votes)
16 views46 pages

DUnit - III

The document discusses convolutional neural networks and convolutional operations. It explains how CNNs work by applying convolution operations to input data like images through the use of kernels or filters. The convolution operation involves taking a weighted sum of neighboring pixel values to estimate the value of the current pixel. This is done by sliding the kernel over the input data.

Uploaded by

39- Aarti Omane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views46 pages

DUnit - III

The document discusses convolutional neural networks and convolutional operations. It explains how CNNs work by applying convolution operations to input data like images through the use of kernels or filters. The convolution operation involves taking a weighted sum of neighboring pixel values to estimate the value of the current pixel. This is done by sliding the kernel over the input data.

Uploaded by

39- Aarti Omane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

SKN SINHGAD COLLEGE OF ENGINEERING

PANDHARPUR
Unit – III

Syllabus
Convolution Neural Networks Introduction, Convolution Operation, Motivation, Pooling,
Normalization, Applications in Computer Vision – ImageNet, Sequence Modelling –VGGNet,
LeNet, Recurrent Neural Networks, RNN topologies- Difficulty in Training RNN, Long Short
Term Memory, Bidirectional LSTMs, Bidirectional RNNs, Application case study - Handwritten
digits recognition using deep learning.

 Convolution Neural Networks Introduction: -


Convolutional networks (LeCun, 1989), also known as convolutional neural networks
or CNNs, are a specialized kind of neural network for processing data that has a known, grid-like
topology. Examples include time-series data, which can be thought of as a 1D grid taking samples
at regular time intervals, and image data, which can be thought of as a 2D grid of pixels.
Convolutional networks have been tremendously successful in practical applications. The name
“convolutional neural network” indicates that the network employs a mathematical operation
called convolution. Convolution is a specialized kind of linear operation. Convolutional networks
are simply neural networks that use convolution in place of general matrix multiplication in at
least one of their layers.

 Convolution Operation: -
Convolutional Operation means for a given input we re-estimate it as the weighted average
of all the inputs around it. We have some weights assigned to the neighbor values and we take the
weighted sum of the neighbor values to estimate the value of the current input/pixel.
For a 2D input, the classic input would be an image, where we re-calculate the value of every
pixel by taking the weighted sum of pixels(neighbors) around it for example: let’s say the input
image is as given below

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 1


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Input Image
Now in this input image, we calculate the value of each and every pixel by considering the
weighted sum of pixels around it

Here we are calculating the value of circled pixel considering 3 neighbors around it, assume that
the weights w1, w2, w3, w4 are associated with these 4 pixels respectively
Now, this matrix of weights is referred to as the Kernel or Filter. In the above case, we have
the kernel of size 2X2.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 2


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

We compute the output(re-estimated value of current pixel) using the following formula:

Here m refers to the number of rows(which is 2 in this case) and n refers to the number of
columns(which is 2 i this case).
Now we place the 2X2 filter over the first 2X2 portion of the image and take the weighted sum
and that would give the new value of the first pixel.

We map the 2X2 kernel/filter over the 2X2 portion of the input.
The output of this operation would be: (aw + bx + ey + fz)
Then we move the filter horizontally by one and place it over the next 2 X 2 portion of the input;
in this case pixels of interest would be b, c, f, g and we compute the output using the same
technique and we would get:

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 3


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

And then again we move the kernel/filter by 1 in the horizontal direction and take the weighted
sum.

So, after this, the output from the first layer would look like:

Then we move the kernel by 1 down in the vertical direction, calculate the output, move the
kernel in the horizontal direction and in general we move the kernel like this: first, we start off
with the starting portion of the image, move the filter in the horizontal direction and cover this
row completely then we move the filter in the vertical direction (by some amount respective to
top left portion of image), again stride it horizontally through the entire row and continue like
this. In essence, we move the kernel left to right top to bottom.
Instead of considering pixels only in the forward direction, we consider previous neighbors as
well

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 4


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

And to consider the previous neighbors, the formula for computing the output would be:

We take the limits from -m/2 to m/2 i.e we take half of the rows from previous neighbors and the
other half from the forward direction(forward neighbors) and the same is the case in the vertical
direction(-n/2 to n/2).
Typically, we take the odd-dimensional kernel.
Convolutional Operation in practice
Let the input image be as given below:

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 5


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

and we use kernel/filter of size 3X3 and for each pixel, we take the 3 X 3 neighborhood around
it(pixel itself is a part of this 3 X 3 neighborhood and would be at the center) just like in the
below image:

Input Image, we consider 3X3 portions of this image as the kernel is of size 3X3
Let’s say this input is a 30X30 image, we go over every pixel systematically, place the filter such
that the pixel is at the center of the kernel and re-estimate the value of that pixel as the weighted
sum of pixels around it.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 6


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

So, in this way, we get back the re-estimated value of all the pixels.
We all have seen the convolutional operation in practice. Let’s say the kernel that we are using is
as below:

Kernel
So, we move this kernel all over the image and re-compute every pixel as the weighted sum of the
neighborhood. In this case, since all the weights are 1/9 that means the re-estimated value of
each and every pixel would be 1/9th of its original value. This kernel is taking the average of
all the 9 pixels over which this kernel would be placed.
That means for each pixel/color in the image, if we take the average(divide the weighted sum
value by 9), it would dilute the value/blurs the image and the output we get by applying this
convolutional operation is:

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 7


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

So, the blur operation that we all might have used in any of the photo editing application actually
applies the convolution operation behind the scenes.
Now in the below-mentioned scenario, we are using 5 as the weight for the central pixel and 0 for
the all the boundary pixels and -1 for the remaining pixels, so the net effect would be that the
value/color intensity of the central pixel is boosted and its neighborhood information is getting
subtracted so the result of this is that it sharpens the image.

The output of the above convolutional is:

Let’s take one more example: in the below case, the value for the central pixel is -8 and for all

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 8


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
other pixels it is 1, so if we have the same color in the 3X3 portion of the image(just like for the
marked pixel in the below image), let say the pixel intensity for this current pixel is denoted by
‘x’ then we get (8x from the central pixel and -8x from the weighted sum of all other pixels and
summation of the these results into 0).

So, wherever we have the same color in the 3X3 portion(some sample regions marked in the
below image) or to say the neighbors are exactly the same as the current pixel, we get the output
intensity as 0.

So, in effect, what will happen is that where ever there is a boundary(yellow highlighted in the
below image), there the neighboring pixels can not be the same as the current pixel, only in such
regions we get the non-zero value, everywhere else we get a zero value. So, in effect, we end up

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 9


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
detecting all the edges in the input image.

2D Convolution with 3D filter:


Below is a complete picture of how the 2D convolutional operation is performed over the input,
we start with the top left corner, apply the kernel over that area, move the kernel horizontally
towards right and once we have reached the end(completed the entire row) on the right side, we
move the kernel downwards by some steps and again start from the left side and move towards
right:

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 10


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

We slide the kernel horizontally

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 11


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Once we complete the entire row, we slide the kernel vertically in downwards direction and start
from the left side

We move from left to right and from top to bottom.

In the case of 3D input (image is also a 3D input as it has 3 channels corresponding to Red,
Green, Blue, all these channels are superimposed on each other and that’s how we get the final
image. In other words, every pixel in the image has 3 values associated with it, so we can look at
that as the depth), we have 3 channels(depth) one corresponding to each of the RGB in the image,
we use the filter of the same depth as the input and place the filter over the input and compute the

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 12


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
weighted sum across all the 3 dimensions.

In most cases when we use convolution for 3D inputs, we use a 3D convolution filter (as depicted
in the below image) that means if we place the filter at a given location in the image, we would
take a weighted average of its 3D neighborhood but we are not going to slide it along the depth.
What this conveys is that the kernel would have the same depth as the original input and that’s
why there is no scope to move it through the depth/input. For example, the input image depth is 3
and the kernel depth is also 3 so there is no scope to move it along the depth. There is no
movement available there

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 13


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

In this case, also, we move the filter horizontally and vertically as in the 2D case. We don’t move
the filter along the depth as the input image depth is the same as the filter depth and there is no
scope to move across the depth.
So, what we do in practice is we have this 3D kernel, we will start moving it, we will move it
along the horizontal direction first, and we keep doing this through the entire image and once we
reach the last box(we move from left to right and top to bottom), at the end of this, although our
input was 3 dimensional, we get back a 2D output.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 14


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Points to consider:
 Input is 3D
 The filter is also 3D
 The convolutional operation that we perform is 2D as we are sliding the filter horizontally
and vertically and not along the depth
 This is because the depth of the filter is the same as the depth of the input

In practice, we apply multiple kernels/filters to the same input and get the different
representations/output from the same input as per the kernel used for example one filter might
detect the vertical edges in the input, second might detect the horizontal edges in the image,
another filter might blur the image and so on.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 15


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

In the above image, we are using 3 different filters and we are getting 3 outputs corresponding to
each filter. We can combine these different outputs representations into one single volume (each
of the output representation would have width and height and after combining all of the
representations we get the depth as well). So, if we apply 3 filters to the input, we get an output of
depth 3, if we apply 100 filters to the input, we get the output of depth 100.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 16


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Points to consider:
 Each filter applied to a 3D input would give a 2D output.
 Combining the output of multiple such filters would result in a 3D output.

Terminology
Let’s define some terminology and find out the relation between the input dimensions and the
output dimensions:

The spatial extent(extent of the neighborhood we are looking at) of a filter(F) means the
dimension of the filter, it would be ‘F X F’. Usually, we have an odd-dimensional filter and the
depth of the filter would be the same as the depth of the input(Di in this case).

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 17


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Now we want to relate the output dimensions with the input dimensions:
Let’s take 2D input of dimension ‘7 X 7’ and we have a filter of size ‘3 X 3’ over it.

As we slide the filter over it(from left to right and top to bottom), we keep computing the output
values, and it's very clear that the output is smaller than the input.
This is how we slide the filter over the image:

The reason is obvious why this is happening, we can’t place the kernel at the corners as it will
cross the boundary
We can’t place the filter at the crossed pixel(below image) because if we place it there then
yellow highlighted portion would be undefined:

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 18


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

And in practice, we would stop at the crossed pixel(as in the below image) when the filter
completely lies inside the image:

And this is why we get the smaller output because we would not be able to apply the filter in any
part in the shaded region in the below image:

Hence for every pixel in the input, we are not computing the re-estimated value and therefore the
number of pixels in the output is less than the number of pixels in the input.
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 19
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

This was the case for ‘3 X 3’ kernel, now let’s see what happens when we have ‘5 X 5’ kernel:

Now we can not place the kernel at the crossed pixel in the above image. We can not place the
kernel at the yellow highlighted pixel as well. So, in this case, we can not place the kernel at any
of the shaded regions in the below image:

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 20


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

The bigger the kernel used, the smaller is the output.


So, the output dimension in terms of the input is:

What if we want the output to be of the same size as the input?


If we want the output to be the same size as the input, then we need to pad the input
appropriately:

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 21


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Here we pad the input with 0 all over the input image and apply the 3X3 filter over the input and
we get the output of the same dimension as the input
If we place the kernel at the crossed pixel in the below image, we now have 5 artificial pixels
with a value of 0 and we would be able to re-estimate the value of this crossed pixel.

Now the output would be again ‘7 X 7’ as we have introduced this artificial boundary around the
original input and this boundary contains all the values as 0.
If we have a ‘5 X 5’ filter, it would still go outside the image even after this artificial padding

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 22


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

So, in this case, we need to increase padding. Earlier we added padding of 1(meaning 1 row at the
top, 1 at the bottom, 1 at the left and 1 at the right). And it’s obvious from the above image that if
we want to use a ‘5 X 5’, then we should use the padding of 2.

The bigger the kernel size the larger is the padding required and the updated formula for the
relation between input and output dimension is:

Stride(S): Stride defines the interval at which the filter is applied, till now we discussed all the
cases considering stride to be 1 as we’re moving the filter by 1 in the horizontal and vertical
direction as depicted in the below image:
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 23
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

In some cases, we may not want this to say we don’t want a full replica of the image and just need
a summary of it. In that case, we may choose to apply the filter only at alternate locations in the
input.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 24


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Here we use S = 2 i.e we move the filter by 2 in the horizontal as well as the vertical direction
This interval between two successive pixels where we apply the kernel is termed as the Stride.
And in the above case, the output would be roughly half the input as we are skipping part of the
image by 1 every time.
Now, if we are using a stride ‘S’, then the formula to compute the width and height is given by:

Higher the stride, the smaller is the size of the output.


The depth of the output is going to be the same as the number of filters that we have.
Each 3D filter applied over 3D input would give one 2D output if we use K such filters, we get K
such 2D outputs and if we stack up all these K outputs we get the depth of the output as K. So,
the depth of the output is the same as the no. of filters used.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 25


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Deep learning finally leads to multiple trainable stages, so that the internal
representation is structured hierarchically. Especially for images, it turned out that such a
representation is very powerful. Low-level stages are used to detected primary edges. High-level
stages lastly connect information on where and how objects are positioned regarding the scene.

Figure 1: Typical convolutional neural network with two feature stages


Pooling :-
The pooling operation involves sliding a two-dimensional filter over each channel of
feature map and summarising the features lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 26
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
pooling layer is

(nh - f + 1) / s x (nw - f + 1)/s x nc


where,
-> nh - height of feature map
-> nw - width of feature map
-> nc - number of channels in the feature map
-> f - size of filter
-> s - stride length.

A common CNN model architecture is to have a number of convolution and pooling layers
stacked one after the other.
 Why to use Pooling Layers?
 Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the
number of parameters to learn and the amount of computation performed in the network.
 The pooling layer summarises the features present in a region of the feature map
generated by a convolution layer. So, further operations are performed on summarised
features instead of precisely positioned features generated by the convolution layer. This
makes the model more robust to variations in the position of the features in the input
image.

 Types of Pooling Layers:

1. Max Pooling
Max pooling is a pooling operation that selects the maximum element from the region of
the feature map covered by the filter. Thus, the output after max-pooling layer would be a feature
map containing the most prominent features of the previous feature map.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 27


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

2.Average Pooling
Average pooling computes the average of the elements present in the region offeature map
covered by the filter. Thus, while max pooling gives the most prominent feature in a particular
patch of the feature map, average pooling gives the average of features present in a patch.

3. Global Pooling :-
Global pooling reduces each channel in the feature map to a single value. Thus, an nh x nw
x nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to using a filter of
dimensions nh x nw i.e. the dimensions of the feature map.
Further, it can be either global max pooling or global average pooling.
Code #3 : Performing Global Pooling using keras.

 Normalization :-
To fully understand how Batch Norm works and why it is important, let’s start by talking
about normalization.
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 28
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
Normalization is a pre-processing technique used to standardize data. In other words,
having different sources of data inside the same range. Not normalizing the data before training
can cause problems in our network, making it drastically harder to train and decrease its learning
speed.
For example, imagine we have a car rental service. Firstly, we want to predict a fair price
for each car based on competitors’ data. We have two features per car: the age in years and the
total amount of kilometers it has been driven for. These can have very different ranges, ranging
from 0 to 30 years, while distance could go from 0 up to hundreds of thousands of kilometers. We
don’t want features to have these differences in ranges, as the value with the higher range might
bias our models into giving them inflated importance.
There are two main methods to normalize our data. The most straightforward method is to scale it
to a range from 0 to 1:

X the data point to normalize, m the mean of the data set, Xmax the highest value, and Xmin
the lowest value. This technique is generally used in the inputs of the data. The non-normalized
data points with wide ranges can cause instability in Neural Networks. The relatively large inputs
can cascade down to the layers, causing problems such as exploding gradients.
The other technique used to normalize data is forcing the data points to have a mean of 0 and
a standard deviation of 1, using the following formula:

being X the data point to normalize, m the mean of the data set, and S the standard deviation
of the data set. Now, each data point mimics a standard normal distribution. Having all the
features on this scale, none of them will have a bias, and therefore, our models will learn better.
In Batch Norm, we use this last technique to normalize batches of data inside the network
itself.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 29


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

 Applications in Computer Vision :-

Most computer vision algorithms use something called a convolution neural network, or
CNN. A CNN is a model used in machine learning to extract features, like texture and edges,
from spatial data.

 ImageNet :-
The ImageNet project is a large visual database designed for use in visual object recognition
software research. More than 14 million images have been hand-annotated by the project to
indicate what objects are pictured and in at least one million of the images, bounding boxes are
also provided.

 Sequence Modelling –VGGNet :-


The major shortcoming of too many hyper-parameters of AlexNet was solved by VGG
Net by replacing large kernel-sized filters (11 and 5 in the first and second convolution layer,
respectively) with multiple 3×3 kernel-sized filters one after another. The architecture developed
by Simonyan and Zisserman was the 1st runner up of the Visual Recognition Challenge of 2014.
The architecture consist of 3*3 Convolutional filters, 2*2 Max Pooling layer with a stride of 1,
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 30
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
keeping the padding same to preserve the dimension. In total, there are 16 layers in the network
where the input image is RGB format with dimension of 224*224*3, followed by 5 pairs of
Convolution(filters: 64, 128, 256,512,512) and Max Pooling. The output of these layers is fed
into three fully connected layers and a softmax function in the output layer. In total there are 138
Million parameters in VGG Net.

Drawbacks of VGG Net:


1. Long training time
2. Heavy model
3. Computationally expensive
4. Vanishing/exploding gradient problem

 LeNet :-
This is also known as the Classic Neural Network that was designed by Yann LeCun,
Leon Bottou, Yosuha Bengio and Patrick Haffner for handwritten and machine-printed character
recognition in 1990’s which they called LeNet-5. The architecture was designed to identify
handwritten digits in the MNIST data-set. The architecture is pretty straightforward and simple to
understand. The input images were gray scale with dimension of 32*32*1 followed by two pairs
of Convolution layer with stride 2 and Average pooling layer with stride 1. Finally, fully
connected layers with Softmax activation in the output layer. Traditionally, this network had
60,000 parameters in total. Refer to the original paper.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 31


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

 Recurrent Neural Networks :-

Recurrent Neural Network(RNN) are a type of Neural Network where the output from
previous step are fed as input to the current step. In traditional neural networks, all the inputs and
outputs are independent of each other, but in cases like when it is required to predict the next
word of a sentence, the previous words are required and hence there is a need to remember the
previous words. Thus RNN came into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN is Hidden state, which remembers
some information about a sequence.

RNN have a “memory” which remembers all information about what has been calculated.
It uses the same parameters for each input as it performs the same task on all the inputs or hidden
layers to produce the output. This reduces the complexity of parameters, unlike other neural
networks.
How RNN works

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 32


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
The working of a RNN can be understood with the help of below example:
Example:
Suppose there is a deeper network with one input layer, three hidden layers and one output
layer. Then like other neural networks, each hidden layer will have its own set of weights and
biases, let’s say, for hidden layer 1 the weights and biases are (w1, b1), (w2, b2) for second
hidden layer and (w3, b3) for third hidden layer. This means that each of these layers are
independent of each other, i.e. they do not memorize the previous outputs.

Now the RNN will do the following:


RNN converts the independent activations into dependent activations by providing the same
weights and biases to all the layers, thus reducing the complexity of increasing parameters and
memorizing each previous outputs by giving each output as input to the next hidden layer.
Hence these three layers can be joined together such that the weights and bias of all the
hidden layers is the same, into a single recurrent layer.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 33


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

 Formula for calculating current state:

where:
ht -> current state
ht-1 -> previous state
xt -> input state

 Formula for applying Activation function(tanh):-

where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron

 Formula for calculating output:-

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 34


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Yt -> output
Why -> weight at output layer

Training through RNN


A single time step of the input is provided to the network.
Then calculate its current state using set of current input and the previous state.
The current ht becomes ht-1 for the next time step.
One can go as many time steps according to the problem and join the information from all the
previous states.
Once all the time steps are completed the final current state is used to calculate the output.
The output is then compared to the actual output i.e the target output and the error is generated.
The error is then back-propagated to the network to update the weights and hence the network
(RNN) is trained.
Advantages of Recurrent Neural Network
An RNN remembers each and every information through time. It is useful in time series
prediction only because of the feature to remember previous inputs as well. This is called Long
Short Term Memory.
Recurrent neural network are even used with convolutional layers to extend the effective pixel
neighborhood.
Disadvantages of Recurrent Neural Network
Gradient vanishing and exploding problems.
Training an RNN is a very difficult task.
It cannot process very long sequences if using tanh or relu as an activation function.

 RNN topologies :-

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 35


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Sample RNN topologies: (a) Elman and (b) Jordan.

Difficulty in Training RNN :-


There are two widely known issues with properly training Recurrent Neural Networks,
the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper
we attempt to improve the understanding of the underlying issues by exploring these problems
from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to
justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with
exploding gradients and a soft constraint for the vanishing gradients problem. We validate
empirically our hypothesis and proposed solutions in the experimental section.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 36


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

 Long Short Term Memory :-


Long short-term memory (LSTM) is an artificial neural network used in the fields of
artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has
feedback connections. Such a recurrent neural network can process not only single data points
(such as images), but also entire sequences of data (such as speech or video). For example, LSTM
is applicable to tasks such as unsegmented, connected handwriting recognition,speech
recognition,machine translation robot control, video games, and healthcare. LSTM has become
the most cited neural network of the 20th century.
A common LSTM unit is composed of a cell, an input gate, an output gate and a
forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate
the flow of information into and out of the cell.
LSTM networks are well-suited to classifying, processing and making predictions
based on time series data, since there can be lags of unknown duration between important events
in a time series. LSTMs were developed to deal with the vanishing gradient problem that can be
encountered when training traditional RNNs. Relative insensitivity to gap length is an advantage
of LSTM over RNNs, hidden Markov models and other sequence learning methods in numerous
applications.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 37


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
Applications of LSTM include:
 Robot control
 Time series prediction
 Speech recognition
 Rhythm learning
 Music composition]
 Grammar learning
 Handwriting recognition
 Human action recognition
 Sign language translation
 Protein homology detection
 Predicting subcellular localization of proteins
 Time series anomaly detection
 Several prediction tasks in the area of business process management
 Prediction in medical care pathways
 Semantic parsing
 Object co-segmentation
 Airport passenger management
 Short-term traffic forecast
 Drug design
 Market Prediction
 Bidirectional LSTMs :-
Bidirectional long-short term memory(bi-lstm) is the process of making any neural network o
have the sequence information in both directions backwards (future to past) or forward(past to
future).
In bidirectional, our input flows in two directions, making a bi-lstm different from the regular
LSTM. With the regular LSTM, we can make input flow in one direction, either backwards or
forward. However, in bi-directional, we can make the input flow in both directions to preserve the
future and the past information. For a better explanation, let’s have an example.
In the sentence “boys go to …..” we can not fill the blank space. Still, when we have a future
sentence “boys come out of school”, we can easily predict the past blank space the similar thing
we want to perform by our model and bidirectional LSTM allows the neural network to perform
this.
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 38
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Image for bi-LSTM image source

In the diagram, we can see the flow of information from backward and forward layers. BI-LSTM
is usually employed where the sequence to sequence tasks are needed. This kind of network can
be used in text classification, speech recognition and forecasting models. Next in the article, we
are going to make a bi-directional LSTM model using python.

 Bidirectional RNNs :

Bidirectional recurrent neural networks (BRNN) connect two hidden layers of opposite
directions to the same output. With this form of generative deep learning, the output layer can get
information from past (backwards) and future (forward) states simultaneously. Invented in 1997
by Schuster and Paliwal, BRNNs were introduced to increase the amount of input information
available to the network. For example, multilayer perceptron (MLPs) and time delay neural

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 39


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
network (TDNNs) have limitations on the input data flexibility, as they require their input data to
be fixed. Standard recurrent neural network (RNNs) also have restrictions as the future input
information cannot be reached from the current state. On the contrary, BRNNs do not require
their input data to be fixed. Moreover, their future input information is reachable from the current
state.

BRNN are especially useful when the context of the input is needed. For example, in handwriting
recognition, the performance can be enhanced by knowledge of the letters located before and after
the current letter.

 Applications of BRNN include :


 Speech Recognition (Combined with Long short-term memory)
 Translation
 Handwritten Recognition
 Protein Structure Prediction
 Part-of-speech tagging
 Dependency Parsing
 Entity Extraction


Application case study - Handwritten digits recognition using deep learning

The ability of computers to recognize human handwritten digits is referred to as handwritten digit
recognition. Handwritten digits are not perfect and can be made in any shape as a result, making
it a tedious task for machines to recognize the digits. So in this, we will use the image of the digit
and recognize the digit present in that image.
About the project we are going to create:
In this project, we will be using a Convolutional Neural Network to create our model which will
predict the digits present in the image. And in this, we are using the MNIST dataset, with the help
of which we will create our project that is handwritten digit recognition.
Project Prerequisites:
The libraries that should be installed on your computer are:
 Tensorflow

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 40


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
 Numpy
 Matplotlib
 Keras
 Opencv
If you don’t have these libraries installed, install them by using pip. (For ex: pip install numpy,
pip install matplotlib, pip install tensorflow, etc)
About Dataset:
We will be using MNIST dataset which is very famous or we can say it is very popular among
machine learning and deep learning enthusiasts. In this dataset, there are 60,000 training images
of handwritten digits from zero to nine and 10,000 images for testing. So in this, we have 10
different classes. In this, the images are represented as a 28 x 28 matrix where each cell contains
grayscale pixel value.
We don’t have to download the dataset as it is already available in tensorflow datasets we just
have to write(tf.keras.datasets.mnist). We will learn how to use it, so chill

Let’s start Building our deep learning project that is Handwritten Digit Recognition:
1) Import required libraries and load Dataset:
Let’s go step by step. We will import the libraries whenever we require, so first, we only import
tensorflow so that we can load our dataset, as I have told you that the MNIST dataset is already
present in tensorflow. So we can easily import the dataset and start working on it.
import tensorflow as tf
mnist = tf.keras.datasets.mnist
2) Splitting of Data:
Now we will split our training and testing data, and its corresponding labels, using
mnist.load_data() method. And by using x_train.shape we will get the shape of our training data
that is (60,000, 28, 28).
(x_train, y_train),(x_test , y_test) = mnist.load_data()
x_train.shape
3) Visualisation of data:
Let’s visualize our data using matplotlib library, so firstly we have to import matplotlib, and then
we are able to see the first image of our training data using plt.imshow().
import matplotlib.pyplot as plt
plt.imshow(x_train[0])

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 41


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
plt.show()
plt.imshow(x_train[0] , cmap = plt.cm.binary)

4) Normalize Data:
We cannot feed our image directly into our model, so we have to perform some operations to
process the data to make it ready for our neural network. Firstly we have to normalize our data,
we will do this with the help of tf.keras.utils.normalize() method.
x_train = tf.keras.utils.normalize(x_train , axis = 1)
x_test = tf.keras.utils.normalize(x_test , axis = 1)
plt.imshow(x_train[0] , cmap = plt.cm.binary)

5) Let’s check our data is normalized or not:


print(x_train[0])

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 42


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

6) Reshape the data:


We have to preprocess the data so that we can use that data to create a model, so for that as we
have seen the shape of our data is (60000,28,28). Our model will require one more dimension so
we reshape the data to (60000,28,28,1).
import numpy as np
img_size = 28
x_trainer = np.array(x_train).reshape(-1,img_size,img_size,1)
x_tester = np.array(x_test).reshape(-1,img_size,img_size,1)
print('Training shape' , x_trainer.shape)
print('Testing shape' , x_tester.shape)
7) Creating Model:
Now let’s create our model, we will be importing required libraries to create models, we will
create our Convolutional neural network (CNN) model. CNN models generally consist of
convolutional layers and pooling layers. As we know, CNN works very well for image
classification problems. The dropout layer is used to deactivate some of the neurons while
training, or we can say it reduces overfitting of the model.
So you can see below how we create our model using various layers.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout , Activation, Flatten , Conv2D,
MaxPooling2D
model = Sequential()
model.add(Conv2D(32 , (3,3) , activation = 'relu' , input_shape= x_trainer.shape[1:]))
# model.add(MaxPooling2D((2,2)))
model.add(Conv2D(64 , (3,3) , activation = 'relu'))
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 43
SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
model.add(MaxPooling2D((2,2)))
model.add(Dropout(0.25))
# model.add(Conv2D(64 , (3,3) , activation = 'relu'))
# model.add(MaxPooling2D((2,2)))
model.add(Flatten())
model.add(Dense(256, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation = 'softmax'))
8) Our model summary that we have created above:
model.summary()

9) Compile our model:


# compile model that we have created for handwritten digit recognition project
model.compile(optimizer = 'adam' , loss = 'sparse_categorical_crossentropy' , metrics =
['accuracy'])
10) Model Training:
Train the model, the model.fit() function will start training the model. It will take the training
data, validation data, epochs, and batch size.
# fit x_trainer , y_train to the model to see accuracy of model:
model.fit(x_trainer,y_train, epochs = 10 , validation_split = 0.3 , batch_size = 128,verbose=1)

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 44


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR

Our model accuracy is more than 99% on training data and more than 98% on our validation data.
11)Now, let’s evaluate our model on our test data:
test_loss, test_acc = model.evaluate(x_tester, y_test)
print('Test loss on 10,000 test samples' , test_loss)
print('Validation Accuracy on 10,000 samples' , test_acc)

12) Predictions:
Our model is ready and now we can predict the digits present in an image. Let’s see predictions
made by our model and what is the actual number in an image.
predictions = model.predict([x_tester])
print(np.argmax(predictions[54]))
plt.imshow(x_test[54])

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 45


SKN SINHGAD COLLEGE OF ENGINEERING
PANDHARPUR
13) Save the model:
Our model is working accurately so now we can save our model and can use it anywhere to
predict the handwritten digit.
model.save("digit_recogniser_model.h5")
14) Now let’s predict the digit on our custom image to see how model is working:
import cv2
img = cv2.imread('3.jpg')
gray = cv2.cvtColor(img , cv2.COLOR_BGR2GRAY)
resize = cv2.resize(gray,(28,28), interpolation = cv2.INTER_AREA)
new_img = tf.keras.utils.normalize(resize, axis=1)
new_img = np.array(new_img).reshape(-1,img_size,img_size,1)
predictions = model.predict(new_img)
print(np.argmax(predictions))
Summary
We have successfully built our handwritten digit recognition project. In this project, we have built
and trained the Convolutional neural network model which is very effective for image
classification purposes. And we have correctly predicted the digit in a particular custom image
also.

Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 46

You might also like