QB1 DL
QB1 DL
DEEP LEARNING
UNIT 1 DEEP NETWORKS BASICS
1
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
Ans
Supervised learning Unsupervised learning Reinforcement
Learning
a It requires target Typically either target This is learning what to
variable to be well variable is unknown or do and how to map
defined with sufficient available for few cases situations to actions.
values given only The learner is
uninformed on what
action to take
b Deals with Deals with clustering and Deals with exploitation
classification and associative rule mining or exploration, Markov
regression problems decision processes,
policy learning, etc.
c Input data in Uses unlabelled data Data is not predefined
supervised learning in
labelled data
d Learns by using Trained using unlabelled Works on interacting
labelled data data with environment
e Maps labelled inputs to Understands patterns Follows trial and
known outputs and discovers output
7. What is Generalization and Training error? CO1 U 2
Ans The ability to perform well on previously unobserved inputs is called generalization.
When a machine learning model, we have access to training set, we can compute
some error measure on training set called training error.
8. What are Hyperparameters? CO1 R 2
Ans The parameters that are used to define machine learning models are known as
Hyperparameters.
9. What is Hyperparameter tuning? CO1 R 2
Ans Rigorous search for hyperparameters to build an optimized model is known as
Hyperparameter tuning.
10. What is activation function? CO1 R 2
Ans The purpose of the activation function is to introduce non-linearity into the
output of a neuron. The activation function in Deep Learning calculates a
weighted total and adds bias to decide whether a neuron should be activated
or not.
11. What are tasks in supervised learning? CO1 U 2
Ans Tasks:
Classification – When labels are categorical
Regression – When labels are real-valued
Structured prediction - When labels are complicated
12. What are tasks in unsupervised learning? CO1 U 2
Ans Tasks:
Clustering – Finding groups from data
Anomaly detection – Finding unusual instances
Representation learning – How dense are the data in different parts of
instance space
Topic discovery – Find a way to describe each instance as covering one or several
“topics”
13. What is use of softmax unit? CO1 R 2
2
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
Ans Softmax functions are often as output of classifier, to represent the probability
distribution over n different classes.
14. Give central challenges of Machine learning and when will CO1 U 2
they occur?
Ans Underfitting
Occurs when the model is not able to obtain sufficiently low error values on the
training set.
Overfitting
Occurs when the gap between training error and testing error is too large
15. Draw the balanced graph for the following overfitted CO1 AP 2
graph?
Ans
It involves adding a regularization term to the loss function, which penalizes large
weights or complex model architectures. Regularization methods such as L1 and L2
regularization, dropout, and batch normalization help control model complexity and
improve its ability to generalize to unseen data.
17. What is L1 regularization and L2 regularization? CO1 U 2
Ans L1 regularization, also known as Lasso regularization, is a
method in deep learning that adds the sum of absolute values of
the weights to the loss function. It encourages sparsity by driving
some weights to zero, resulting in feature selection.
3
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
19. Give Properties of convex function CO1 U 2
Ans 1. A convex function refers to a function whose graph is shaped like a cup U
2. A twice differential function of single variable is convex, if and only if, its second
derivate is non-negative.
Example: quadratic function ( x 2)
3. A strictly convex function has exactly one local minimum point, which is also the
global minimum point
4. The sum of 2 convex function is also convex function
20. Suppose you have inputs as x, y, and z with values -2, 5, and -4 CO1 AP 2
respectively. You have a neuron ‘q’ and neuron ‘f’ with functions:
q=x+y
f=q*z
PART-B
S. QUESTION AND ANSWERS CO RBT MARKS
No
1. Compare and Contrast Single layer and CO1 U 16
4
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
Multiplayer Perceptron.
Ans Single Layer Perceptron:
5
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
step function.
How does Perceptron work?
Step-1
In the first step first, multiply all input values with corresponding weight values
and then add them to determine the weighted sum. Mathematically, we can
calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous
value as follows:
Y = f(∑wi*xi + b)
This is one of the easiest Artificial neural networks (ANN) types.
Single-layered perceptron model consists feed-forward network and also
includes a threshold transfer function inside the model.
The main objective of the single-layer perceptron model is to analyze the
linearly separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data,
so it begins with inconstantly allocated input for weight parameters. Further, it
sums up all inputs (weight). After adding all inputs, if the total sum of all inputs
is more than a pre-determined value, the model gets activated and shows the
output value as +1.
If the outcome is same as pre-determined or threshold value, then the
performance of this model is stated as satisfied, and weight demand does not
change. However, this model consists of a few discrepancies triggered when
multiple weight inputs values are fed into the model.
Hence, to find desired output and minimize errors, some changes should be
necessary for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."
Multilayer Perceptrons
Multilayer Perceptrons are feedforward artificial neural networks that generate outputs
from a set of inputs. In a Multilayer Perceptron, multiple layers of input nodes are
connected as a directed graph between the input and output layers. The Multilayer
Perceptron is a deep learning method that uses backpropagation to train the network.
Though Perceptrons are widely recognized as algorithms, they were originally
designed for image recognition. It gets its name from performing the human-like
function of perceiving, seeing, and identifying images.
Multilayer Perceptrons are essentially feed-forward neural networks with three
types of layers: input, output, and hidden. The input layer receives the input signal
for processing. The output layer performs tasks such as classification
and prediction. Multilayer Perceptrons' accurate computational engine consists of
an arbitrary number of hidden layers between input and output layers. Similarly,
the data flow from the input layer to the output layer in a Multilayer Perceptron.
The neurons in the Multilayer Perceptrons are trained using the backpropagation
learning algorithm. Multilayer Perceptrons are designed to approximate any
continuous function and can solve problems that are not linearly separable.
6
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
7
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
8
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
Initialization: Randomly initialize the parameters of the model.
Set Parameters: Determine the number of iterations and the learning rate
(alpha) for updating the parameters.
Stochastic Gradient Descent Loop: Repeat the following steps until the model
converges or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the shuffled order.
c. Compute the gradient of the cost function with respect to the model parameters
using the current training example (or batch).
d. Update the model parameters by taking a step in the direction of the negative
gradient, scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the cost function
between iterations of the gradient.
Return Optimized Parameters: Once the convergence criteria are met or the
maximum number of iterations is reached, return the optimized model
parameters.
Random selection introduces randomness into the optimization process, hence the
term “stochastic” in stochastic Gradient Descent
Ans Bias means assumptions made by a model to make a function easier to learn. It is
actually the error rate of the training data. When the error rate has a high value, we
call it High Bias and when the error rate has a low value, we call it low Bias.
Variance is The difference between the error rate of training data and testing data. If
the difference is high then it’s called high variance and when the difference in errors is
low then it’s called low variance. Usually, we want to make a low variance for
generalized our model.
Mathematically, Let the variable we are trying to predict as Y and other covariates
(independent variables) as X. We assume there is a relationship between the two such
that Y=f(X) + e
Where e is the error term and it’s normally distributed with a mean of 0. We will make
a model ^f (x) of f(X) using linear regression or any other modeling technique.
So the expected squared error at a point x is
◦ In the below diagram, center of the target is a model that perfectly predicts
correct values.
◦ As we move away from the bulls-eye our predictions become get worse and
worse. We can repeat our process of model building to get separate hits on the
target.
◦ In supervised learning, underfitting happens when a model unable to capture
the underlying pattern of the data. These models usually have high bias and
low variance.
◦ Linear and logistic regression
9
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
10
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
11
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
– They break input space into as many regions as there are leaves and
use a separate parameter in each region
Manifold Learning
A manifold is a connected region –
Mathematically it is a set of points in a neighborhood –
It appears to be in a Euclidean space
E.g., we experience the world as a 2-D plane while it is a spherical
manifold in 3-D space
Introduced for continuous data and in unsupervised learning, the
probability concentration idea can be generalized to discrete and
unsupervised settings
Although manifold is mathematically defined, in machine learning it is
loosely defined: – A connected set of points that can be approximated
well by considering only a small no of degrees of freedom embedded in
a higher-dimensional space
5. (a) Solve to find Eigen Values and Eigen CO1 AP 16
Vector for:
y 1=−5 x 1 +2 x 2
y 2=−9 x 1 +6 x 2
(b)Find local minima for the function
2
y=( x +5) starting from x=3. Do at least 3
iterations assuming learning rate =0.01
Ans In matrix form, the equations 1 and 2 can be written as
[ ][ ][ ]
. y 1 −5 2 x 1
=
y 2 −9 6 x 2
In the form y= Av
| A−λI |=0
[ −5−λ
−9 ]
2
6−λ
2
=λ −λ−12=0
EIGEN VALUES
λ 1=−3 ; λ2=−4
Case 1: λ 1=−3
[−5+3
−9
2
][
=
−2 2
6+3 −9 9
=0
]
Corresponding equations;
−2 x 1+2 x 2=0
−9 x 1+ 9 x2 =0
Solving we get, −11 x 1+ 11 x2=0
x 1−x 2=0
12
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
STEP 2:
Initialize parameters:
x 0=3
L.R=0.01
dy
=2(x 0 +5)
dx
dy
Iteration 1: x 1=x 0−L . R = 3-{0.01(2(3+5))}=2.84
dx
dy
Iteration 2: x 2=x 1−L . R = 3-{0.01(2(2.84+5))}=2.68
dx
dy
Iteration 1: x 3=x 2−L . R = 3-{0.01(2(2.68+5))}=2.57
dx
13
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
Linear regression models try to optimize the β 0 and b to minimize the cost function.
The equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is called
as RSS or Residual sum of squares.
Regularization types:
14
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
1. L2 Regularization / Ridge Regression
2. L1 Regularization / Lasso Regression
3. Dropout
4. Dropconnect
15
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
16
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
neural network training. It is quite common to invest days to months of time on
hundreds of machines in order to solve even a single instance of the neural network
training problem.
1. Ill-Conditioning
2. Local Minima
3. Plateaus, Saddle Points and Other Flat Regions
4. Cliffs and Exploding Gradients
5. Long-Term Dependencies
Ill-Conditioning
The most prominent is ill-conditioning of the Hessian matrix H. The ill-conditioning
problem is generally believed to be present in neural network training problems. Ill-
conditioning can manifest by causing SGD to get “stuck” in the sense that even very
small steps increase the cost function.
Local Minima
Some convex functions have a flat region at the bottom rather than a single global
minimum point, but any point within such a flat region is an acceptable solution.
Neural networks and any models with multiple equivalently parametrized latent
variables all have multiple local minima because of the model identifiability problem. A
model is said to be identifiable if a sufficiently large training set can rule out all but
one setting of the model’s parameters.
Plateaus, Saddle Points and Other Flat Regions
At a saddle point, the Hessian matrix has both positive and negative eigenvalues.
Points lying along eigenvectors associated with positive eigenvalues have greater cost
than the saddle point, while points lying along negative eigenvalues have lower value.
The gradient can often become very small near a saddle point. On the other hand,
gradient descent empirically seems to be able to escape saddle points in many cases.
Cliffs and Exploding Gradients
Neural networks with many layers often have extremely steep regions resembling
cliffs, These result from the multiplication of several large weights together. On the
face of an extremely steep cliff structure, the gradient update step can move the
parameters extremely far, usually jumping off of the cliff structure altogether.
Long-Term Dependencies
Another difficulty that neural network optimization algorithms must overcome
arises when the computational graph becomes extremely deep. Feedforward networks
with many layers have such deep computational graphs.
Vanishing gradients make it difficult to know which direction the parameters should
move to improve the cost function, while exploding gradients can make learning
unstable. The cliff structures described earlier that motivate gradient clipping are an
example of the exploding gradient phenomenon
8. Deep Networks CO1 R 16
(1)Methods and Variations
(2)Reasons
(3)Applications
(4)Differentiate AI,ML,DL
17
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
Ans
(3) Applications
18
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
19
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
DEEP LEARNING
Networks,
Convolutional Neural
Artificial Super
Networks, Recurrent Neural
Intelligence (ASI)
Networks and Recursive
Neural Networks
20