0% found this document useful (0 votes)
9 views20 pages

QB1 DL

The document covers fundamental concepts in deep learning, including definitions of tensors, reinforcement learning, and the differences between supervised and unsupervised learning. It also discusses key techniques such as hyperparameter tuning, regularization, and activation functions, along with the properties of convex functions. Additionally, it compares single-layer and multilayer perceptrons, highlighting their structures, functionalities, and applications in machine learning.

Uploaded by

mithila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views20 pages

QB1 DL

The document covers fundamental concepts in deep learning, including definitions of tensors, reinforcement learning, and the differences between supervised and unsupervised learning. It also discusses key techniques such as hyperparameter tuning, regularization, and activation functions, along with the properties of convex functions. Additionally, it compares single-layer and multilayer perceptrons, highlighting their structures, functionalities, and applications in machine learning.

Uploaded by

mithila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
UNIT 1 DEEP NETWORKS BASICS

S.N QUESTION AND ANSWERS CO RB MARK


o T S
1. What are Tensors? Give tensor rank for the following: CO1 R 2
1. Vector
2. Matrix
Ans An array of numbers arranged on a regular grid with a variable number of axes is
known as a tensor.
Tensor ranks:
1. Vector -1
2. Matrix -2
2. What do Deep learning frameworks leverage and for what CO1 U 2
purpose?
Ans Deep learning frameworks leverage parallelism and hardware acceleration (e.g.,
GPUs) to perform computations on large tensors efficiently.

3. Define Broadcasting CO1 R 2


Ans In deep learning context, Vector can be added to matrix i.e., In C=A+b
Where C i . j =A i , j+ b j
Vector b can be added to each row in matrix
Implicit copying of b to many locations is called Broadcasting,
4. Define Reinforcement learning with its learning elements. CO1 R 2
Ans Reinforcement learning deals with agents that must sense and act upon their
environment.
Learning elements:

1. Policy Defines learning agent


behaviour for given time
period
2. Reward To define a goal and maps
function each perceived state of
environment to single number.
3. Value Total amount of reward an
function agent can expect to
accumulate over the future.
4. Model of the Used for planning
environment
5. Differentiate Supervised and Unsupervised learning. CO1 R 2
Ans Supervised learning Unsupervised learning
a Use training data to infer No training data is used
model
b Prediction from labelled Target output not
data presented to the network
c Desired output is given Desired output is not given
d Classification is Clustering is unsupervised
supervised learning learning
6. Differentiate Supervised, Unsupervised learning and CO1 R 2
Reinforcement Learning

1
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
Ans
Supervised learning Unsupervised learning Reinforcement
Learning
a It requires target Typically either target This is learning what to
variable to be well variable is unknown or do and how to map
defined with sufficient available for few cases situations to actions.
values given only The learner is
uninformed on what
action to take
b Deals with Deals with clustering and Deals with exploitation
classification and associative rule mining or exploration, Markov
regression problems decision processes,
policy learning, etc.
c Input data in Uses unlabelled data Data is not predefined
supervised learning in
labelled data
d Learns by using Trained using unlabelled Works on interacting
labelled data data with environment
e Maps labelled inputs to Understands patterns Follows trial and
known outputs and discovers output
7. What is Generalization and Training error? CO1 U 2
Ans The ability to perform well on previously unobserved inputs is called generalization.
When a machine learning model, we have access to training set, we can compute
some error measure on training set called training error.
8. What are Hyperparameters? CO1 R 2
Ans The parameters that are used to define machine learning models are known as
Hyperparameters.
9. What is Hyperparameter tuning? CO1 R 2
Ans Rigorous search for hyperparameters to build an optimized model is known as
Hyperparameter tuning.
10. What is activation function? CO1 R 2
Ans The purpose of the activation function is to introduce non-linearity into the
output of a neuron. The activation function in Deep Learning calculates a
weighted total and adds bias to decide whether a neuron should be activated
or not.
11. What are tasks in supervised learning? CO1 U 2
Ans Tasks:
 Classification – When labels are categorical
 Regression – When labels are real-valued
Structured prediction - When labels are complicated
12. What are tasks in unsupervised learning? CO1 U 2
Ans Tasks:
 Clustering – Finding groups from data
 Anomaly detection – Finding unusual instances
 Representation learning – How dense are the data in different parts of
instance space
Topic discovery – Find a way to describe each instance as covering one or several
“topics”
13. What is use of softmax unit? CO1 R 2

2
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
Ans Softmax functions are often as output of classifier, to represent the probability
distribution over n different classes.
14. Give central challenges of Machine learning and when will CO1 U 2
they occur?
Ans Underfitting
Occurs when the model is not able to obtain sufficiently low error values on the
training set.
Overfitting
Occurs when the gap between training error and testing error is too large
15. Draw the balanced graph for the following overfitted CO1 AP 2
graph?

Ans

16. What is regularization? CO1 R 2


Regularization is a technique used in machine learning and deep learning to prevent
overfitting and improve the generalization performance of a model. It involves adding
a penalty term to the loss function during training.

It involves adding a regularization term to the loss function, which penalizes large
weights or complex model architectures. Regularization methods such as L1 and L2
regularization, dropout, and batch normalization help control model complexity and
improve its ability to generalize to unseen data.
17. What is L1 regularization and L2 regularization? CO1 U 2
Ans L1 regularization, also known as Lasso regularization, is a
method in deep learning that adds the sum of absolute values of
the weights to the loss function. It encourages sparsity by driving
some weights to zero, resulting in feature selection.

L2 regularization, also called Ridge regularization, adds the


sum of squared weights to the loss function, promoting smaller
but non-zero weights and preventing extreme values.
18. What is dropout in neural network? CO1 U 2
Ans Dropout is a regularization technique used in neural networks to
prevent overfitting. During training, a random subset of neurons
is “dropped out” by setting their outputs to zero with a certain
probability.
This forces the network to learn more robust and independent
features, as it cannot rely on specific neurons. Dropout improves
generalization and reduces the risk of overfitting.

3
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
19. Give Properties of convex function CO1 U 2
Ans 1. A convex function refers to a function whose graph is shaped like a cup U
2. A twice differential function of single variable is convex, if and only if, its second
derivate is non-negative.
Example: quadratic function ( x 2)
3. A strictly convex function has exactly one local minimum point, which is also the
global minimum point
4. The sum of 2 convex function is also convex function

20. Suppose you have inputs as x, y, and z with values -2, 5, and -4 CO1 AP 2
respectively. You have a neuron ‘q’ and neuron ‘f’ with functions:

q=x+y

f=q*z

Graphical representation of the functions is as follows:


What is the gradient of F with respect to x, y, and z?
(HINT: To calculate gradient, you must find (df/dx), (df/dy) and
(df/dz))

Ans. x=−2 , y=5 , z=−4


f =q∗z= ( x + y )∗z=xz + yz
δf
=z=−4
δx
δf
=z=−4
δy
δf
=x + y =−2+5=3
δz
i.e.., Ans = (-4,-4,3)

PART-B
S. QUESTION AND ANSWERS CO RBT MARKS
No
1. Compare and Contrast Single layer and CO1 U 16

4
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
Multiplayer Perceptron.
Ans Single Layer Perceptron:

Perceptron is a building block of an Artificial Neural Network.


 Mr. Frank Rosenblatt invented the Perceptron for performing certain
calculations to detect input data capabilities or business intelligence.
 Perceptron is a linear Machine Learning algorithm used for supervised learning
for various binary classifiers.
 Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in
business intelligence.
 We can consider it as a single-layer neural network with four main parameters,
i.e., input values, weights and Bias, net sum, and an activation
function.
Basic Components of Perceptron:

Weight and Bias:


 Weight parameter represents the strength of the connection between units.
This is another most important parameter of Perceptron components. Weight is
directly proportional to the strength of the associated input neuron in deciding
the output. Further, Bias can be considered as the line of intercept in a linear
equation.
Activation Function:
 These are the final and important components that help to determine whether
the neuron will fire or not. Activation Function can be considered primarily as a

5
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
step function.
How does Perceptron work?
Step-1
 In the first step first, multiply all input values with corresponding weight values
and then add them to determine the weighted sum. Mathematically, we can
calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
 Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
 In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous
value as follows:
Y = f(∑wi*xi + b)
 This is one of the easiest Artificial neural networks (ANN) types.
 Single-layered perceptron model consists feed-forward network and also
includes a threshold transfer function inside the model.
 The main objective of the single-layer perceptron model is to analyze the
linearly separable objects with binary outcomes.
 In a single layer perceptron model, its algorithms do not contain recorded data,
so it begins with inconstantly allocated input for weight parameters. Further, it
sums up all inputs (weight). After adding all inputs, if the total sum of all inputs
is more than a pre-determined value, the model gets activated and shows the
output value as +1.
 If the outcome is same as pre-determined or threshold value, then the
performance of this model is stated as satisfied, and weight demand does not
change. However, this model consists of a few discrepancies triggered when
multiple weight inputs values are fed into the model.
 Hence, to find desired output and minimize errors, some changes should be
necessary for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."

Multilayer Perceptrons
Multilayer Perceptrons are feedforward artificial neural networks that generate outputs
from a set of inputs. In a Multilayer Perceptron, multiple layers of input nodes are
connected as a directed graph between the input and output layers. The Multilayer
Perceptron is a deep learning method that uses backpropagation to train the network.
Though Perceptrons are widely recognized as algorithms, they were originally
designed for image recognition. It gets its name from performing the human-like
function of perceiving, seeing, and identifying images.
Multilayer Perceptrons are essentially feed-forward neural networks with three
types of layers: input, output, and hidden. The input layer receives the input signal
for processing. The output layer performs tasks such as classification
and prediction. Multilayer Perceptrons' accurate computational engine consists of
an arbitrary number of hidden layers between input and output layers. Similarly,
the data flow from the input layer to the output layer in a Multilayer Perceptron.
The neurons in the Multilayer Perceptrons are trained using the backpropagation
learning algorithm. Multilayer Perceptrons are designed to approximate any
continuous function and can solve problems that are not linearly separable.

6
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING

Examples of Multilayer Perceptron


Multilayer Perceptrons are widely used to solve problems requiring supervised
learning and research into computational neuroscience and parallel distributed
processing. Examples include speech recognition, image recognition, and machine
translation.

Importance of Multilayer Perceptron important:


Researchers often use Multilayer Perceptrons to solve complex problems
stochastically, allowing approximate solutions to challenging issues like fitness
estimation.
Using the perceptron model, machines can learn weight coefficients that help them
classify inputs. This linear binary classifier is highly effective in arranging and
categorizing input data into different classes, allowing probability-based
predictions and classifying items into multiple categories. Multilayer Perceptrons
have the advantage of learning non-linear models and the ability to train models in
real-time (online learning).

Other advantages of Multilayer Perceptrons are:


 It can be used to solve complex nonlinear problems.
 It handles large amounts of input data well.
 Makes quick predictions after training.
 The same accuracy ratio can be achieved even with smaller samples.

2. With the iterative optimization algorithm, Derive CO1 U 16


Gradient Descent with cost function and convex
function, learning rate, steps and type relevant
to Deep learning for the figure below.

7
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING

Ans COST FUNCTION (2)


 Cost Function - expressed as a difference or distance between the predicted value
and the actual value.
 To find values of parameters (coefficients) of function f that minimizes cost
function

PROPERTIES OF CONVEX FUNCTION (3)


 A convex function refers to a function whose graph is shaped like a cup U
 A twice differential function of single variable is convex ,if and only if, its
second derivate is non-negative.
Example: quadratic function ( x 2)
 A strictly convex function has exactly one local minimum point, which is
also the global minimum point
 The sum of 2 convex function is also convex function
LEARNING RATE (2)
Learning rate (also referred to as step size or the alpha) is the size of the steps that
are taken to reach the minimum.
 High learning rates result in larger steps but risks overshooting the minimum.
 Small learning rates is evaluated and updated based on the behavior of the cost
function.
STEPS (5)
 STEP 1 : The starting point is just an arbitrary point for us to evaluate the
performance.
 STEP 2 : From that starting point, we will find the derivative (or slope), and from
there, we can use a tangent line to observe the steepness of the slope.
 STEP 3 : The slope will inform the updates to the parameters—i.e. the weights
and bias.
 STEP 4 : The slope at the starting point will be steeper, but as new parameters
are generated, the steepness should gradually reduce until it reaches the lowest
point on the curve, known as the point of convergence.
TYPES (3)
 Stochastic Gradient Descent
In gradient descent, the gradient is a vector pointing in the general direction of the
function’s steepest rise at a particular point. The algorithm might gradually drop towards
lower values of the function by moving in the opposite direction of the gradient, until
reaching the minimum of the function.
STEPS

8
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
 Initialization: Randomly initialize the parameters of the model.
 Set Parameters: Determine the number of iterations and the learning rate
(alpha) for updating the parameters.
 Stochastic Gradient Descent Loop: Repeat the following steps until the model
converges or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the shuffled order.
c. Compute the gradient of the cost function with respect to the model parameters
using the current training example (or batch).
d. Update the model parameters by taking a step in the direction of the negative
gradient, scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the cost function
between iterations of the gradient.
 Return Optimized Parameters: Once the convergence criteria are met or the
maximum number of iterations is reached, return the optimized model
parameters.

Random selection introduces randomness into the optimization process, hence the
term “stochastic” in stochastic Gradient Descent

3. Illustrate how Overfitting and underfitting is CO1 AP 16


lead by influence of Bias and variance.

Ans Bias means assumptions made by a model to make a function easier to learn. It is
actually the error rate of the training data. When the error rate has a high value, we
call it High Bias and when the error rate has a low value, we call it low Bias.
Variance is The difference between the error rate of training data and testing data. If
the difference is high then it’s called high variance and when the difference in errors is
low then it’s called low variance. Usually, we want to make a low variance for
generalized our model.
Mathematically, Let the variable we are trying to predict as Y and other covariates
(independent variables) as X. We assume there is a relationship between the two such
that Y=f(X) + e
Where e is the error term and it’s normally distributed with a mean of 0. We will make
a model ^f (x) of f(X) using linear regression or any other modeling technique.
So the expected squared error at a point x is

◦ In the below diagram, center of the target is a model that perfectly predicts
correct values.
◦ As we move away from the bulls-eye our predictions become get worse and
worse. We can repeat our process of model building to get separate hits on the
target.
◦ In supervised learning, underfitting happens when a model unable to capture
the underlying pattern of the data. These models usually have high bias and
low variance.
◦ Linear and logistic regression

9
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING

◦ In supervised learning, overfitting happens when our model captures the


noise along with the underlying pattern in data.
◦ It happens when we train our model a lot over noisy dataset. These models
have low bias and high variance. These models are very complex like Decision
trees which are prone to overfitting.

4. Elucidate the Challenges Motivating Deep CO1 U 16


Learning
Ans The Curse of Dimensionality
 The curse of dimensionality is a problem that arises when we are
working with a lot of data having multiple features or we can say it as
high dimensional data. The dimension of the data means the no. of
features or columns in our dataset.
 With the increase in dimensions, there are more chances for the
occurrence of multi-collinearity as well.

10
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING

 Suppose we have 4 data points in one dimension(only one feature in the


data set). So, it can be easily represented with the help of a line and the
dimension space is equal to 4.
 Now if we add one more feature,
 then this will cause an increase in dimension space to 4*4 =16.
 Again if we add one more feature to it, dimension space will
increase to 4*4*4 = 64, and so on (4 dimensions(4*4*4*4=256)
etc.).
 So as the dimensions keep on increasing, dimensions space
increases exponentially.
 High dimensional data is responsible for the curse of dimensionality, but
why do we have such a huge no. of dimensions in our data in the first
place.
 Use feature encoding technique(conversion of a categorical variable to
numerical features) such as one-hot encoding, which creates a lot of
dummy variables for each category, i.e it increases the number of
dimensions.

Local Constancy and Smoothness Regularization


 Priors – To guide what kind of function they should learn
 Function should not change very much within a small region. Many
simpler algorithms rely exclusively on this prior to generalize well – Thus
fail to scale statistical challenges in AI tasks
 Deep learning introduces additional (explicit and implicit) priors in order
to reduce generalization error on sophisticated tasks
 We now explain why smoothness alone is insufficient
 Several methods to encourage learning a function f* that satisfies the
condition(small changes ∈) and configurations x
¿ ¿
f ≈ f ( x+ ∈)
 If we know a good answer for input x then that answer is good in the
neighborhood of x
 An extreme example is k-nearest neighbor – Points having the same set
of nearest neighbors all have the same prediction – For k=1, no of
regions ≤ no of training examples
 A local kernel can be thought of as a similarity function that performs
template matching – By measuring how closely test example x
resembles training example x(i)
 Decision trees also suffers from exclusively smoothness based learning

11
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
– They break input space into as many regions as there are leaves and
use a separate parameter in each region

Manifold Learning
 A manifold is a connected region –
 Mathematically it is a set of points in a neighborhood –
 It appears to be in a Euclidean space
 E.g., we experience the world as a 2-D plane while it is a spherical
manifold in 3-D space
 Introduced for continuous data and in unsupervised learning, the
probability concentration idea can be generalized to discrete and
unsupervised settings
 Although manifold is mathematically defined, in machine learning it is
loosely defined: – A connected set of points that can be approximated
well by considering only a small no of degrees of freedom embedded in
a higher-dimensional space
5. (a) Solve to find Eigen Values and Eigen CO1 AP 16
Vector for:
y 1=−5 x 1 +2 x 2
y 2=−9 x 1 +6 x 2
(b)Find local minima for the function
2
y=( x +5) starting from x=3. Do at least 3
iterations assuming learning rate =0.01
Ans In matrix form, the equations 1 and 2 can be written as

[ ][ ][ ]
. y 1 −5 2 x 1
=
y 2 −9 6 x 2
In the form y= Av
| A−λI |=0

[ −5−λ
−9 ]
2
6−λ
2
=λ −λ−12=0
EIGEN VALUES
λ 1=−3 ; λ2=−4
Case 1: λ 1=−3

[−5+3
−9
2
][
=
−2 2
6+3 −9 9
=0
]
Corresponding equations;
−2 x 1+2 x 2=0
−9 x 1+ 9 x2 =0
Solving we get, −11 x 1+ 11 x2=0
x 1−x 2=0

12
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING

EIGEN VECTORS v= [ 11]


(b) y reaches minimum when x=-5
dy
STEP 1 : Initialize x=3 and to find gradient =2(x +5)
dx
with L.R=0.01

STEP 2:
Initialize parameters:
x 0=3
L.R=0.01
dy
=2(x 0 +5)
dx

dy
Iteration 1: x 1=x 0−L . R = 3-{0.01(2(3+5))}=2.84
dx

dy
Iteration 2: x 2=x 1−L . R = 3-{0.01(2(2.84+5))}=2.68
dx
dy
Iteration 1: x 3=x 2−L . R = 3-{0.01(2(2.68+5))}=2.57
dx

6. Any modification in learning algorithm that CO1 U 16


helps to reduce error over its test data is
generalization error. This concept of
generalization error gives rise to
Regularization in Machine learning. Give
detailed account of 4 concepts that
transpires to help in Deep learning as well.
Ans A scenario where a machine learning model tries to learn from the details along with
. the noise in the data and tries to fit each data point to a curve is called Overfitting.
In the figure below, we can see that the model is fit for every point in our data. If new
data is provided, the model curves may not match the patterns in the new data, and
the model may not predict very well.

13
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING

Regularization refers to techniques used to calibrate machine learning models to


minimize the adjusted loss function and avoid overfitting or underfitting.

It involves adding a penalty term to the loss function during training.


This penalty discourages the model from becoming too complex or having large
parameter values, which helps in controlling the model’s ability to fit noise in the
training data.
Regularization methods include L1 and L2 regularization, dropout, early stopping, and
more. By applying regularization, models become more robust and better at making
accurate predictions on unseen data.
How does Regularization Work?
Regularization works by adding a penalty or complexity term to the complex model.
Let's consider the simple linear regression equation:
y=β 0 + β 1 x 1 + β 2 x 2 … … .. β n x n+ b
In the above equation, y represents the value to be predicted X1, X2, …Xn are the
features for y.
β 0 , β 1 , β 2 ... β n are the weights or magnitude attached to the features, respectively.
Here represents the bias of the model, and b represents the intercept.

Linear regression models try to optimize the β 0 and b to minimize the cost function.
The equation for the cost function for the linear model is given below:

Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is called
as RSS or Residual sum of squares.

Regularization types:

14
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
1. L2 Regularization / Ridge Regression
2. L1 Regularization / Lasso Regression
3. Dropout
4. Dropconnect

1. L2 Regularization / Ridge Regression


Ridge regression is a regularization technique, which is used to reduce the complexity
of the model. It is also called as L2 regularization. In this technique, the cost function
is altered by adding the penalty term to it. The amount of bias added to the model is
called Ridge Regression penalty. We can calculate it by multiplying with the lambda
to the squared weight of each individual feature.
The equation for the cost function in ridge regression will be:
m n n n
2
∑ ( y i− y 'i )=∑ (Y i−∑ β j × X ij) + λ ∑ β 2j
i=1 i=1 j=1 j=0
λ is multiplied by square of weight set for individual feature of input data. This term is
ridge regularization penalty.
o In the above equation, the penalty term regularizes the coefficients of the
model, and hence ridge regression reduces the amplitudes of the coefficients
that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the
equation becomes the cost function of the linear regression
model. Hence, for the minimum value of λ, the model will resemble the linear
regression model.
o A general linear or polynomial regression will fail if there is high collinearity
between the independent variables, so to solve such problems, Ridge
regression can be used.
o It helps to solve the problems if we have more parameters than samples.

2. L1 Regularization / Lasso Regression


o Lasso regression is another regularization technique to reduce the complexity
of the model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only
the absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0. It is also called as L1
regularization.
o The equation for the cost function of Lasso regression will be:
m n n n
2
∑ ( y i− y 'i )=∑ (Y i−∑ β j × X ij) + λ ∑ |β j|
2

i=1 i=1 j=1 j=0


oSome of the features in this technique are completely neglected for model
evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model
as well as the feature selection.
3.Dropout

15
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING

 In machine learning, “dropout” refers to the practice of disregarding certain


nodes in a layer at random during training. A dropout is a regularization
approach that prevents overfitting by ensuring that no units are codependent
with one another.
 To achieve dropout regularization, some neurons in the artificial neural network
are randomly disabled. That prevents them from being too dependent on one
another as they learn the correlations. Thus, the neurons work more
independently, and the artificial neural network learns multiple independent
correlations in the data based on different configurations of the neurons.
 After dropout regularization, the network cannot rely on any single feature
since at any given time the feature might be suppressed. The network then
spreads out the weights, which avoids putting too much weight on any one
feature. That prevents the neurons from learning too much, which can lead to
overfitting.
4.Dropconnect
We introduce DropConnect, a generalization of Dropout for regularizing large fully-
connected layers within neural networks.
Used to add more noise to network. Randomly drop connections between neurons.
In Sparsely connected layer , connections are chosen at random in training stage.

7. Challenges in Neural Network Optimization CO1 U 16


Ans Of all of the many optimization problems involved in deep learning, the most difficult is

16
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
neural network training. It is quite common to invest days to months of time on
hundreds of machines in order to solve even a single instance of the neural network
training problem.
1. Ill-Conditioning
2. Local Minima
3. Plateaus, Saddle Points and Other Flat Regions
4. Cliffs and Exploding Gradients
5. Long-Term Dependencies
Ill-Conditioning
The most prominent is ill-conditioning of the Hessian matrix H. The ill-conditioning
problem is generally believed to be present in neural network training problems. Ill-
conditioning can manifest by causing SGD to get “stuck” in the sense that even very
small steps increase the cost function.
Local Minima
Some convex functions have a flat region at the bottom rather than a single global
minimum point, but any point within such a flat region is an acceptable solution.
Neural networks and any models with multiple equivalently parametrized latent
variables all have multiple local minima because of the model identifiability problem. A
model is said to be identifiable if a sufficiently large training set can rule out all but
one setting of the model’s parameters.
Plateaus, Saddle Points and Other Flat Regions
At a saddle point, the Hessian matrix has both positive and negative eigenvalues.
Points lying along eigenvectors associated with positive eigenvalues have greater cost
than the saddle point, while points lying along negative eigenvalues have lower value.
The gradient can often become very small near a saddle point. On the other hand,
gradient descent empirically seems to be able to escape saddle points in many cases.
Cliffs and Exploding Gradients
Neural networks with many layers often have extremely steep regions resembling
cliffs, These result from the multiplication of several large weights together. On the
face of an extremely steep cliff structure, the gradient update step can move the
parameters extremely far, usually jumping off of the cliff structure altogether.
Long-Term Dependencies
Another difficulty that neural network optimization algorithms must overcome
arises when the computational graph becomes extremely deep. Feedforward networks
with many layers have such deep computational graphs.

Vanishing gradients make it difficult to know which direction the parameters should
move to improve the cost function, while exploding gradients can make learning
unstable. The cliff structures described earlier that motivate gradient clipping are an
example of the exploding gradient phenomenon
8. Deep Networks CO1 R 16
(1)Methods and Variations
(2)Reasons
(3)Applications
(4)Differentiate AI,ML,DL

17
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
Ans

Deep learning is a subset of Machine learning.

(1)Methods and Variations


(1) Unsupervised learning – Boltzmann machines for preliminary training,
auto-encoders, generative adversial network.
(2) Supervised learning as CNN
(3) RNN to train processes in time
(4) Recursive NN to include feedback between circuit elements and chain
(2) Reasons for using Deep learning

(1) Analysing unstructured data – DL Algorithms can be trained to look at


text data by analysing social media posts, etc
(2) Data labelling – DL requires labelled data for training
(3) Feature engineering – DL can save time because it does not require
humans to extract features manually from raw data
(4) Efficiency – When DL algorithm is properly trained, it can perform
thousands of tasks
(5) Training – NN in DL have ability to be applied to many different
datatypes and applications

(3) Applications

18
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING

(4) Differentiate AI,ML,DL


AI ML DL

AI simulates human ML is a subset of AI that DL is a subset of ML that


intelligence to uses algorithms to learn employs artificial neural
perform tasks and patterns from data. networks for complex tasks.
make decisions.
AI may or may not ML heavily relies on labeled DL requires extensive labeled
require large data for training and data and performs
datasets; it can use making predictions. exceptionally with big
predefined rules. datasets.

Three broad Three broad DL can be considered as


categories/types Of AI categories/types Of ML are: neural networks with a large
are: Supervised Learning, number of parameters layers
Artificial Narrow Unsupervised Learning and lying in one of the four
Intelligence (ANI), Reinforcement Learning fundamental network
Artificial General architectures:
Intelligence (AGI) and Unsupervised Pre-trained

19
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DEEP LEARNING
Networks,
Convolutional Neural
Artificial Super
Networks, Recurrent Neural
Intelligence (ASI)
Networks and Recursive
Neural Networks

COURSE INCHARGE IQAC


HOD

20

You might also like