ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
UNIT- 4
MULTILAYER PERCEPTRON (or) ANN (Artificial Neural Network) (or) Feed Forward:
       The Perceptron consists of an input layer and an output layer which are fully connected.
       A fully connected Multi-Layered Neural Network is known as Multi-Layer Perceptron.
       A Multi-Layered Neural Network consists of multiple layers of artificial neurons or nodes.
      MLPs have the same input and output layers but may have multiple hidden layers in between
 the aforementioned layers, as seen below.
Sigmoid: takes real-valued input and squashes it to range between 0 and 1.
     When we plot the output from sigmoid units given various weighted sums as input, it looks remarkably
     like a step function:
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                    Page 1
                         ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
tanh: takes real-valued input and squashes it to the range [-1, 1 ].
ReLu: ReLu stands for Rectified Linear Units. It takes real-valued input and thresholds it to 0 (replaces
negative values to 0).
    Example Multi-layer ANN with Sigmoid Units:
     We will concern ourselves here with ANNs containing only one hidden layer, as this makes
    describing the back propagation routine easier.
     Note that networks where you can feed in the input on the left and propagate it forward to get an
    output are called feed forward networks.
     Below is such an ANN, with two sigmoid units in the hidden layer. The weights have been set
    arbitrarily between all the units.
     Note that the sigma units have been identified with sigma signs in the node on the graph. As we did
    with perceptrons, we can give this network an input and determine the output. We can also look to see
    which units "fired", i.e., had a value closer to 1 than to 0.
     Suppose we input the values 10, 30, 20 into the three input units, from top to bottom. Then the
    weighted sum coming into H1 will be:
    SH1 = (0.2 * 10) + (-0.1 * 30) + (0.4 * 20) = 2 -3 + 8 = 7.
     Then the σ function is applied to S H1 to give:
    σ(SH1) = 1/(1+e-7) = 1/(1+0.000912) = 0.999
     [Don't forget to negate S]. Similarly, the weighted sum coming into H2 will be:
    SH2 = (0.7 * 10) + (-1.2 * 30) + (1.2 * 20) = 7 - 36 + 24 = -5
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                       Page 2
                     ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
     and σ applied to S H2 gives:
    σ(SH2) = 1/(1+e5) = 1/(1+148.4) = 0.0067
     From this, we can see that H1 has fired, but H2 has not. We can now calculate that the weighted
    sum going in to output unit O1 will be:
    SO1 = (1.1 * 0.999) + (0.1*0.0067) = 1.0996
     and the weighted sum going in to output unit O2 will be:
    SO2 = (3.1 * 0.999) + (1.17*0.0067) = 3.1047
     The output sigmoid unit in O1 will now calculate the output values from the network for O1:
    σ(SO1) = 1/(1+e-1.0996) = 1/(1+0.333) = 0.750
     and the output from the network for O2:
    σ(SO2) = 1/(1+e-3.1047) = 1/(1+0.045) = 0.957
     Therefore, if this network represented the learned rules for a categorisation problem, the input triple
    (10,30,20) would be categorised into the category associated with O2, because this has the larger
    output
BACK PROPAGATION:
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                    Page 3
                    ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
     So, with back propagation you basically try to change the weights of your model while training.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                  Page 4
                    ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                              Page 5
                    ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
  LOSS FUNCTIONS:
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                              Page 6
                     ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
   Loss functions can be classified into two major categories depending upon the type of learning task
  we are dealing with Regression losses and Classification losses.
  Loss functions for Classification:
  1. Binary Cross Entropy Loss:
  It gives the probability value between 0 and 1 for a classification task. Cross-Entropy calculates the
  average difference between the predicted and actual probabilities.
  Mathematical formulation:-
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                    Page 7
                       ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
  2. Hinge Loss:
      This type of loss is used when the target variable has 1 or -1 as class labels. It penalizes the model
  when there is a difference in the sign between the actual and predicted class values.
      Hinge loss is used for maximum-margin classification.
  Mathematical formulation:-
EPOCHS AND BATCH SIZES:
         An epoch means training the neural network with all the training data for one cycle.
         In an epoch, we use all of the data exactly once. A forward pass and a backward pass together are
  counted as one pass.
      An epoch is made up of one or more batches, where we use a part of the data set to train the neural network. We call
  passing through the training examples in batch iteration.
      An epoch is sometimes mixed with iteration. To clarify the concepts, let’s consider a simple example where we
  have1000 data points as presented in the figure below:
      If the batch size is 1000, we can complete an epoch with a single iteration. Similarly, if the batch size is
  500, an epoch takes two iterations. So, if the batch size is 100, an epoch takes10 iterations to complete. Simply,
  for each epoch, the required number of iterations times the batch size gives the number of data points.
      We can use multiple epochs in training. In this case, the neural network is fed the same data more than
  once.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                           Page 8
                    ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                              Page 9
                      ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
  RECURRENT NEURAL NETWORK (RNN):
    Types of Recurrent Neural Networks:
    There are four types of Recurrent Neural Networks:
    1. One to One
    2. One to Many
    3. Many to One
    4. Many to Many
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                     Page 10
                     ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
  Applications of Recurrent Neural Networks:
      Image Captioning: RNNs are used to caption an image by analyzing the activities present.
   Time Series Prediction: Any time series problem, like predicting the prices of stocks in a
  particular month, can be solved using an RNN.
      Natural Language Processing: Text mining and Sentiment analysis can be carried out using an
  RNN for Natural Language Processing (NLP).
      Machine Translation: Given an input in one language, RNNs can be used to translate the input
  into different languages as output.
  Advantages of Recurrent Neural Network:
  1.   An RNN remembers each and every information through time.
  2.   RNN used with convolutional layers to extend the effective pixel neighborhood.
  Disadvantages of Recurrent Neural Network:
  1.   Gradient vanishing and exploding problems.
  2.   Training an RNN is a very difficult task.
  3.   It cannot process very long sequences if using tanh or relu as an activation function.
  LONG SHORT-TERM MEMORY (LSTM):
      Long Short-Term Memory (LSTM) networks are an extension of RNN that extend the memory,
  which makes it easier to remember past data in memory.
      LSTM are used as the building blocks for the layers of a RNN.
      LSTMs assign data “weights” which helps RNNs to either let new information in, forget
  information or give it importance enough to impact the output.
      The units of an LSTM are used as building units for the layers of a RNN, often called an LSTM
  network.
      LSTMs enable RNNs to remember inputs over a long period of time. This is because LSTMs
  contain information in a memory, much like the memory of a computer. The LSTM can read, write and
  delete information from its memory.
      In an LSTM you have three gates: input, forget and output gate. These gates determine whether or
  not to let new input in (input gate), delete the information because it isn’t important (forget gate), or let
  it impact the output at the current time step (output gate).
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                     Page 11
                      ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
  Architecture of LSTM network:
      LSTM network have a sequence like structure, but the recurring network has a different module.
  Instead of having single neural network layer, they have small parts connected to each other which
  function in storing and removal of memory.
  1.   Input   gate- It   discover    which     value    from     input   should   be   used    to   modify    the
  memory. Sigmoid function decides which values to let through 0 or 1. And tanh function gives
  weightage to the values which are passed, deciding their level of importance ranging from -1 to 1.
  2. Forget gate- It discover the details to be discarded from the block. A sigmoid function decides it. It
  looks at the previous state (ht-1) and the content input (Xt) and outputs a number between 0(omit this)
  and 1(keep this) for each number in the cell state Ct-1.
  3. Output gate- The input and the memory of the block are used to decide the output. Sigmoid function decides
  which values to let through 0 or 1. And tanh function decides which values to let through 0, 1. And tanh function
  gives weightage to the values which are passed, deciding their level of importance ranging from -1 to 1 and
  multiplied with an output of sigmoid.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                        Page 12
                     ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
  CONVOLUTIONAL NEURAL NETWORK (CNN):
     Convolutional Neural Network is a special kind of multi-layer neural networks.
     Convolutional Neural Network is one of the main categories to do image classification and image
  recognition in neural networks. Scene labeling, objects detections, and face recognition, etc., are some
  of the areas where convolutional neural networks are widely used.
     CNN takes an image as input, which is classified and process under a certain category such as dog,
  cat, lion, tiger, etc. The computer sees an image as an array of pixels and depends on the resolution of
  the image. Based on image resolution, it will see as h * w * d, where h= height w= width and d=
  dimension.
     Fully-connected network architecture does not take into account the spatial structure.
     In CNN, each input image will pass through a sequence of convolution layers along with pooling,
  fully connected layers, filters (Also known as kernels). After that, we will apply the Soft-max function
  to classify an object with probabilistic values 0 and 1.
  Why Convolutions:
     Parameter sharing: a feature detector (such as a vertical edge detector) that’s useful in one part of
  the image is probably useful in another part of the image.
     Sparsity of connections: In each layer, each output value depends only on small number of inputs.
  Convolution Layer:
  Convolution layer is the first layer to extract features from an input image. By learning image features
  using a small square of input data, the convolutional layer preserves the relationship between pixels. It is
  a mathematical operation which takes two inputs such as image matrix and a kernel or filter.
  o   The dimension of the image matrix is h×w×d.
  o   The dimension of the filter is f h×fw×d.
  o   The dimension of the output is (h-f h+1)×(w-fw+1)×1.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                     Page 13
                     ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
  Stride:
  Stride means how many cells the filter is moved in the input to calculate the next cell in the result.
  When the stride is equaled to 1, then we move the filters to 1 pixel at a time and similarly, if the stride is
  equaled to 2, then we move the filters to 2 pixels at a time. The following figure shows that the
  convolution would work with a stride of 2.
  Padding:
  1. It allows us to use a CONV layer without necessarily shrinking the height and width of the
  volumes. This is important for building deeper networks, since otherwise the height/width would shrink
  as we go to deeper layers.
  2. It helps us keep more of the information at the border of an image. Without padding, very few
  values at the next layer would be affected by pixels as the edges of an image.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                     Page 14
                     ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
   Pooling Layer:
     Pooling layer is used to reduce the size of the representations and to speed up calculations.
     In conventional CNNs, the feature map from the convolutional layer is subsample in a pooling
  layer before being passed on to the next convolutional layer.
     The pooling layer works to replace a small patch in the feature map with its summary statistic.
     For example, the popular max-pooling layer reduces the input patch to a single value, the
  maximum of all values within that patch. Other alternative pooling strategies involve taking the average,
  weighted average of the patch as a sub sampling technique.
     Average Pooling: Down-scaling will perform through average pooling by dividing the input into
  rectangular pooling regions and computing the average values of each region.
  Advantages:
     Good at detecting patterns and features in images, videos, and audio signals and Robust transulate.
     Very High accuracy in image recognition problems.
     Automatically detects the important features without any human supervision.
  Disadvantages:
     Computationally expensive to train and require a lot of memory.
     Requires large amounts of labeled data.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
                                                    Page 15