AIML Unit-5
AIML Unit-5
2 – MARKS
1. Differentiate computer & human brain [N/D-23]
2. Define neuron & neural networks & its categories of neural network structures?
3. Show the perceptions that calculates parity of its 3 inputs. [N/D-23]
4. Define Multi-Layer Perceptron with advantages & architectural diagram [A/M-
23]
5. Define Activation function & its types [A/M-23]
6. Define Stochastic Gradient Descent (SGD). With pros & cons [A/M-24]
7. Why Rectified linear unit (ReLU) is better than softmax? Give equation [A/M-
24]
8. Define Normalization & Batch Normalization.
9. Define Grid Search CV & Randomized Search CV.
10. Define Overfitting.
11. Difference between Shallow and Deep neural network.
12. What is meant by Training set & test set?
13. Difference between Data Mining and Machine learning.
14. Define Forward Pass & Backward Pass.
15. Define Tanh Function & Sigmoid Function
16. What is meant by Feed forward neural network?
17. Define Bias & Dropout
16 – MARKS
1. Explain in detail about single-Layer Perceptron & Multi-Layer Perceptron. With
architectural diagram [A/M-24]
2. Explain in Detail about Activation function.
3. Discuss in detail about how the network is training.
4. Discuss in detail about Gradient descent optimization Algorithm.
5. Explain in detail about Stochastic gradient descent.
6. Explain in detail about error backpropagation with its steps. [A/M-23]
7. Explain in detail about Unit saturation (aka the vanishing gradient problem).
8. Explain in detail about Rectified linear unit (ReLU). Elaborate the process of training
hidden layers. [N/D-23]
1
DR.NNCE II & III YR / II & IV SEM AIML QB
2. Define neuron & neural networks & its categories of neural network
structures?
A neuron is a cell in the brain whose principal function is the collection, processing, and
dissemination of electrical signals.
The brain's information-processing capacity is thought to emerge primarily from networks of such
neurons. For this reason, some of the earliest A1 work aimed to create artificial neural networks.
o acyclic or feed-forward net-works
o cyclic or recurrent networks
To compute parity with 3 inputs, you need a 2-layer perceptron (i.e., an MLP):
Structure:
Input layer: 3 neurons (A, B, C)
Hidden layer: 2 neurons to compute XORs
Output layer: 1 neuron for final XOR
Logic:
Hidden Neuron 1: computes A XOR B
Hidden Neuron 2: computes (A XOR B) XOR C → Final parity
2
DR.NNCE II & III YR / II & IV SEM AIML QB
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes. A schematic diagram of a Multi-Layer Perceptron
(MLP).
Advantages
o It can be used to solve complex nonlinear problems.
o It handles large amounts of input data well.
o Makes quick predictions after training.
o The same accuracy ratio can be achieved even with smaller samples.
Architectural diagram
6. Define Stochastic Gradient Descent (SGD). With pros & cons [A/M-24]
In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for
each iteration.
In Gradient Descent, there is a term called “batch” which denotes the total number of samples
from a dataset that is used for calculating the gradient for each iteration.
In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the
whole dataset.
Advantages:
Speed: SGD is faster than other variants of Gradient Descent.
Memory Efficiency:it is memory-efficient and can handle large datasets that cannot fit into
memory.
Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape
3
DR.NNCE II & III YR / II & IV SEM AIML QB
Disadvantages:
Noisy updates: The updates in SGD are noisy and have a high variance, which can make the
optimization process less stable and lead to oscillations around the minimum.
Slow Convergence: SGD may require more iterations to converge to the minimum since it updates
the parameters for each training example one at a time.
7. Why Rectified linear unit (ReLU) is better than softmax? Give equation [A/M-
24]
4
DR.NNCE II & III YR / II & IV SEM AIML QB
5
DR.NNCE II & III YR / II & IV SEM AIML QB
Data mining is more of a research using Self learned and trains system to do the
methods like machine learning intelligent task
14. Define Forward Pass & Backward Pass.
Forward Propagation is the way to move from the Input layer (left) to the Output layer (right) in the
neural network. A neural network can be understood by a collection of connected input/output nodes.
In the backward pass, the flow is reversed so that we start by propagating the error to the output layer
until reaching the input layer passing through the hidden layer(s).
The process of propagating the network error from the output layer to the input layer is called
backward propagation, or simple backpropagation.
Tanh Function
Sigmoid Function
6
DR.NNCE II & III YR / II & IV SEM AIML QB
16 – MARKS
1. Explain in detail about single-Layer Perceptron & Multi-Layer Perceptron. With
architectural diagram [A/M-24]
1. Single-Layer Perceptron (SLP)
➤ Definition:
A Single-Layer Perceptron is the simplest form of a neural network, consisting of only one layer of
output nodes connected directly to the input layer. It is primarily used for binary classification tasks.
➤ Architecture:
Here’s a simple architectural diagram:
➤ Working Principle:
1. Weighted Sum: Multiply each input with its corresponding weight and add the bias.
2. Activation Function: Apply an activation function to the sum to produce the output.
7
DR.NNCE II & III YR / II & IV SEM AIML QB
3. Training: Update weights using a learning rule like Perceptron Learning Rule or Gradient
Descent.
➤ Limitations:
Can only solve linearly separable problems.
➤ Architecture:
Here’s a typical MLP architecture diagram:
Detailed Example:
Input Layer → Hidden Layer 1 → Hidden Layer 2 → Output Layer
[x1] [h1, h2] [h3, h4] [y1, y2]
Each node in a layer is connected to every node in the next layer (fully connected).
Activation functions (e.g., ReLU, Sigmoid, Tanh) are used in hidden layers.
Final layer might use Softmax (for classification) or no activation (for regression).
➤ Working Principle:
1. Forward Propagation:
2. Backpropagation:
➤ Advantages:
Can learn non-linear and complex patterns.
8
DR.NNCE II & III YR / II & IV SEM AIML QB
➤ Applications:
Image recognition
Financial predictions
Medical diagnostics
✅ Comparison Table
Feature SLP MLP
9
DR.NNCE II & III YR / II & IV SEM AIML QB
2. Sigmoid Function
It is a function which is plotted as ‘S’ shaped graph (Refer Figure 5.7) .
Equation : A = 1/(1 + e-x)
Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are very steep.
This means, small changes in x would also bring about large changes in the value of Y.
Value Range : 0 to 1
Uses : Usually used in output layer of a binary classification, where result is either 0 or 1,
as value for sigmoid function lies between 0 and 1 only so, result can be predicted easily
to be 1 if value is greater than 0.5 and 0 otherwise.
10
DR.NNCE II & III YR / II & IV SEM AIML QB
3. Tanh Function
The activation that works almost always better than sigmoid function is Tanh function also
known as Tangent Hyperbolic function. It’s actually mathematically shifted version of the
sigmoid function. Both are similar and can be derived from each other (see in Figure 5.8).
Equation :-
Value Range :- -1 to +1
Nature :- non-linear
Uses :- Usually used in hidden layers of a neural network as it’s values lies between -1 to 1
hence the mean for the hidden layer comes out be 0 or very close to it, hence helps in centering
the data by bringing mean close to 0. This makes learning for the next layer much easier.
4. ReLU Function
It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of Neural network.
11
DR.NNCE II & III YR / II & IV SEM AIML QB
In simple words, RELU learns much faster than sigmoid and Tanh function.
5. Softmax Function
It is a subclass of the sigmoid function, the softmax function comes in handy when dealing with
multiclass classification issues.
Used frequently when managing several classes. In the output nodes of image classification
issues, the softmax was typically present. The softmax function would split by the sum of the
outputs and squeeze all outputs for each category between 0 and 1.
The output unit of the classifier, where we are actually attempting to obtain the probabilities to
determine the class of each input, is where the softmax function is best applied.
12
DR.NNCE II & III YR / II & IV SEM AIML QB
Example:
1. Decide on the number of output classes (meaning the number of image classes – for example
two for cat vs dog)
2. Draw as many computation units as the number of output classes (congrats you just create the
Output Layer of the ANN)
3. Add as many Hidden Layers as needed within the defined architecture.
4. Stack those Hidden Layers to the Output Layer using Neural Connections
5. It is important to understand that the Input Layer is basically a layer of data ingestion
6. Add an Input Layer that is adapted to ingest your data
7. Assemble many Artificial Neurons together in a way where the output (axon) an
Neuron on a given Layer is (one) of the input of another Neuron on a subsequent
Layer. As a consequence, the Input Layer is linked to the Hidden Layers which are
then linked to the Output Layer using Neural Connections (also shown in Figure 5.12).
13
DR.NNCE II & III YR / II & IV SEM AIML QB
14
DR.NNCE II & III YR / II & IV SEM AIML QB
SCD Algorithm
In SGD, we find out the gradient of the cost function of a single example at each iteration instead of
the sum of the gradient of the cost function of all the examples.
In SGD, since only one sample from the dataset is chosen at random for each iteration, the path
taken by the algorithm to reach the minima is usually noisier than your typical Gradient Descent
algorithm. But that doesn’t matter all that much because the path taken by the algorithm does not
matter, as long as we reach the minima and with a significantly shorter training time (see in Figure
15
DR.NNCE II & III YR / II & IV SEM AIML QB
SGD is generally noisier than typical Gradient Descent, it usually took a higher number of
iterations to reach the minima, because of its randomness in its descent.
Even though it requires a higher number of iterations to reach the minima than typical Gradient
Descent, it is still computationally much less expensive than typical Gradient Descent. Hence, in
most scenarios, SGD is preferred over Batch Gradient Descent for optimizing a learning algorithm.
Advantages:
Speed: SGD is faster than other variants of Gradient Descent.
Memory Efficiency:it is memory-efficient and can handle large datasets that cannot fit into
memory.
Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape
from local minima and converge to a global minimum.
Disadvantages:
Noisy updates: The updates in SGD are noisy and have a high variance, which can make the
optimization process less stable and lead to oscillations around the minimum.
Slow Convergence: SGD may require more iterations to converge to the minimum since it
updates the parameters for each training example one at a time.
Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since using a high
learning rate can cause the algorithm to overshoot the minimum, while a low learning rate can make
the algorithm converge slowly.
Less Accurate: Due to the noisy updates, SGD may not converge to the exact global minimum and
can result in a suboptimal solution. This can be mitigated by using techniques such as learning rate
scheduling and momentum-based updates.
16
DR.NNCE II & III YR / II & IV SEM AIML QB
computation. It computes the gradient, but it does not define how the gradient is used. It generalizes
the computation in the delta rule.(see in Figure 5.16)
Figure 5.16
Back propagation
neural network
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.
Advantages:
It does not have any parameters to tune except for the number of inputs.
It is highly adaptable and efficient and does not require any prior knowledge about the network.
It is a standard process that usually works well.
Disadvantages:
The performance of backpropagation relies very heavily on the training data.
Backpropagation needs a very large amount of time for training.
Backpropagation requires a matrix-based method instead of mini-batch.
17
DR.NNCE II & III YR / II & IV SEM AIML QB
7. Explain in detail about Unit saturation (aka the vanishing gradient problem).
The vanishing gradient problem is an issue that sometimes arises when training machine learning
algorithms through gradient descent. This most often occurs in neural networks that have several
neuronal layers such as in a deep learning system, but also occurs in recurrent neural networks.
The key point is that the calculated partial derivatives used to compute the gradient as one goes
deeper into the network. Since the gradients control how much the network learns during training, the
18
DR.NNCE II & III YR / II & IV SEM AIML QB
gradients are very small or zero, then little to no training can take place, leading to poor predictive
performance.
The problem:
As more layers using certain activation functions are added to neural networks, the gradients of the
loss function approaches zero, making the network hard to train.
Why:
Certain activation functions, like the sigmoid function, squishes a large input space into a small input
space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a
small change in the output. Hence, the derivative becomes small.
The sigmoid function and its derivative
As an example, the below Figure 5.17 is the sigmoid function and its derivative. Note how when the
inputs of the sigmoid function becomes larger or smaller (when |𝑥| becomes bigger), the derivative
becomes close to zero.
Figure
5.17 The
sigmoid
function and
its derivative
Solution:
The simplest solution is to use other activation functions, such as ReLU, which doesn't cause a
small derivative. Residual networks are another solution, as they provide residual connections
straight to earlier layers.
The residual connection directly adds the value at the beginning of the block, x, to the end of the
block (F(x) + x). This residual connection doesn't go through activation functions that
"squashes" the derivatives, resulting in a higher overall derivative of the block.(see in Figure
5.18)
19
DR.NNCE II & III YR / II & IV SEM AIML QB
Finally, batch normalization layers can also resolve the issue. As stated before, the problem
arises when a large input space is mapped to a small one, causing the derivatives to disappear.
8. Explain in detail about Rectified linear unit (ReLU). Elaborate the process of
training hidden layers. [N/D-23]
In simple words, RELU learns much faster than sigmoid and Tanh function.
An activation function for hidden units that has become popular recently with deep networks is the
rectified linear unit (ReLU), which is defined as
Though ReLU is not differentiable at a = 0, we use it anyway; we use the left derivative:
Leaky ReLU
In the leaky ReLU (the output is also linear on the negative side but with a smaller slope,
just enough to make sure that there will be updates for negative activations, albeit small:
20
DR.NNCE II & III YR / II & IV SEM AIML QB
Advantage:
it does not saturate (unlike sigmoid and tanh), updates can still be done for large positive a for some
inputs, some hidden unit activations will be zero, meaning that we will have a sparse
representation
Sparse representations lead to faster Training
Disadvantage:
The derivative is zero for a ≤ 0, there is no further training if, for a hidden unit, the weighted sum
somehow becomes negative. This implies that one should be careful in initializing the weights so
that the initial activation for all hidden units is positive.
21
DR.NNCE II & III YR / II & IV SEM AIML QB
22
DR.NNCE II & III YR / II & IV SEM AIML QB
1. GridSearchCV
2. RandomizedSearchCV
GridSearchCV
In GridSearchCV approach, the machine learning model is evaluated for a range of hyperparameter
values. This approach is called GridSearchCV, because it searches for the best set of
hyperparameters from a grid of hyperparameters values.
For example, if we want to set two hyperparameters C and Alpha of the Logistic Regression
Classifier model, with different sets of values. The grid search technique will construct many
versions of the model with all possible combinations of hyperparameters and will return the best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination
of C=0.3 and Alpha=0.2, the performance score comes out to be 0.726(Highest),
therefore it is selected. (see in Figure 5.20)
# Necessary imports
from sklearn.linear_model import Logistic Regression
from sklearn.model_selection import GridSearchCV
Output:
Tuned Logistic Regression Parameters: {'C': 3.7275937203149381) Best score is
0.7708333333333334
23
DR.NNCE II & III YR / II & IV SEM AIML QB
Drawback:
GridSearch CV will go through all the intermediate combinations of hyperparameters which makes grid
search computationally very expensive.
RandomizedSearchCV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed
number of hyperparameter settings. It moves within the grid in a random fashion to find the best
set of hyperparameters. This approach reduces unnecessary computation.
Regularization:
Regularization is one of the most important concepts of machine learning. It is a technique to
prevent the model from overfitting by adding extra information to it. Regularization helps
choose a simple model rather than a complex one.
Generalization error is "a measure of how accurately an algorithm can predict outcome values
for previously unseen data." Regularization refers to the modifications that can be made to a
leaming algorithm that helps to reduce this generalization error and not the training error.
The identity of the object does not change when it is translated, rotated, or scaled. Note that this may
not always be true, or may be true up to a point: ‘b’ and ‘q’ are rotated versions of each other. These
are hints that can be incorporated into the learning process to make learning easier.
In image recognition, there are invariance hints: The identity of an object does not change when it is
rotated, translated, or scaled (see Figure 5.21). Hints are auxiliary information that can be used to
guide the learning process and are especially useful when the training set is limited.
There are different ways in which hints can be used:
Hints can be used to create virtual examples.
The hint may be incorporated into the network structure.
2. Weight decay:
Incentivize the network to use smaller weights by adding a penalty to the loss function.
Even if we start with a weight close to zero, because of some noisy instances, it may move away
from zero; the idea in weight decay is to add some small constant background force that always
pulls a weight toward zero, unless it is absolutely necessary that it be large (in magnitude) to
24
DR.NNCE II & III YR / II & IV SEM AIML QB
3. Ridge Regression
The Ridge regression technique is used to analyze the model where the variables may be having
multicollinearity.
It reduces the insignificant independent variables though it does not remove them completely.
This type of regularization uses the L₂ norm for regularization.
4. Lasso Regression
Least Absolute Shrinkage and Selection Operator (or LASSO) Regression penalizes the
coefficients to the extent that it becomes zero. It eliminates the insignificant independent
variables. This regularization technique uses the L1 norm for regularization.
5. Dropout
"Dropout" in machine learning refers to the process of randomly ignoring certain nodes in a
layer during training.
In the Figure 5.22, the neural network on the left represents a typical neural network where all
units are activated. On the right, the red units have been dropped out of the model- the values of
their weights and biases are not considered during training.
Figure 5.23 In dropout, the output of a random subset of the units are set to zero, and
25
DR.NNCE II & III YR / II & IV SEM AIML QB
In each batch or minibatch, for each unit independently we decide randomly to keep it or not.
Let us say that p = 0.25. So, on average, we remove a quarter of the units and we do
backpropagation as usual on the remaining network for that batch or minibatch. We need to
make up for the loss of units, though: In each layer, we divide the activation of the remaining
units by 1 − p to make sure that they provide a vector of similar magnitude to the next layer.
There is no dropout during testing.
In each batch or minibatch, a smaller network (with smaller variance) is trained. Thus
dropout is effectively sampling from a pool of possible networks of different depths and
widths.
There is a version called drop connect that drops out or not connections independently, which
allows a larger set of possible networks to sample from, and this may be preferable in smaller
networks.
26