0 ratings 0% found this document useful (0 votes) 156 views 46 pages Deep Learning Chapter 1
The document covers the basics of deep networks, focusing on linear algebra, probability distributions, and gradient-based optimization techniques in machine learning. It explains key concepts such as scalars, vectors, matrices, tensors, and various optimization algorithms like Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Additionally, it discusses the importance of model capacity, overfitting, and underfitting in machine learning.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, 
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items 
Save Deep learning chapter 1 For Later tworks Basics
peep Ne a
UNIT I
DEEP NETWORKS BASICS
Linear Algebra: Scalars - Vectors - Matrices and tensors; Probability
Distributions - Gradientbased Optimization - Machine Learning Basics:
Capacity - Overfitting and underfitting - Hyperparameters and validation sets
- Estimators - Bias and variance - Stochastic gradient descent -- Challenges
motivating deep learning; Deep Networks: Deep feedforward networks;
Regularization - Optimization.
 
1. LINEAR ALGEBRA
The term Linear Algebra was initially introduced in the early 18th century to
find out the unknowns in Linear equations and solve the equation easily; It is alsoa
prerequisite to start learning Machine Learning and data science.
Deep Learning is a subdomain of machine learning, concerned with the algorithm
which imitates the function and structure of the brain called the artificial neural network.
Linear algebra is a form of continuous rather than discrete mathematics.
1.1. USES OF LINEAR ALGEBRA
1. Optimization of data.
Implementation of Linear Regression in Machine Learning.
linear algebra is also used in neural networks and the data science field.
2,
3
4. Better Graphic experience
5. Improved Statistics
6. Creating better Machine Learning algorithms
i
8
Estimating the forecast of Machine Learning
Easy to Learn
1.1.1 Better Graphics Experience
Linear Algebra helps to provide better graphical processing in Machine Learning
and edge detection. Moreover, Linear Algebra helps solve
like Image, audio, video,
data set through a specific terminology named Matrix
and compute large and complex
Decomposition Techniquess
 
 
Deep i,
1.1.2 Improved Statistics
Statistics is an important concept to organize and eee data in Ma
oa ind the concept of stat ~
Learning. Also. linear Algebra helps to understa! Pp i ime
manner.
1.1.3 Creating better Machine Learning algorithms
Few supervised learning algorithms can be created using Linea ebra
gistic Regression
 
2. Linear Regression
3. Decision Trees
+4. Support Vector Machines (SVM)
Further. below are some unsupervised learning algorithms listed that canals
created with the help of linear algebra as follows:
1. Single Value Decomposition (SVD)
2. Clustering
3. Components Analysis
1.1.4 Easy to Learn
 
mathematics and its applications.
1.2 EXAMPLES OF LINEAR ALGEBRA IN MACHINE LEARNING
Below are some popular examples of linear algebra in Machine learning:
1. Datasets and Data Files
2. Linear Regression
Recommender Systems
One-hot encoding
Regularization
Principal Component Analysis
Images and Photographs J
aA A Pwi —— seen een hres eterna sines:---..__-- eeu
Deep Networks Basics 1.3
8. Singular-Value Decomposition
9. Deep Learning
10. Latent Semantic Analysis
1.3. SCALARS
Ascalar is just a single number, which are usually arrays of multiple numbers. We
write scalars in italics. We usually give scalars lower-case variable names.
For example,
We might say “Let s € R be the slope of the line,” while defining a real-valued
scalar, or
“Let n € N be the aumber of units,” while defining a natural number scalar.
1.4 VECTORS
A vector is an array of numbers. The numbers are arranged in order. We can
identify each individual number by its index in that ordering.
Practically we give vectors lower case names written in bold typeface, such as x.
The elements of the vector are identified by writing its name in italic typeface, with a
subscript. The first element of x is x,, the second element is x, and so on.
1.5 MATRICES
Amatrix is a 2-D array of numbers.We usually give matrices upper-case variable
names with bold typeface, such as A If a real-valued matrix A has aheight of manda
width of n, then we say that A € Rm xn.
We usually identify the elements of a matrix using its name in italic but not bold
font, and the indices are listed with separating commas.
Aa A |
An, Ana1.4
1.6 TENSORS
 
In some cases we will need an array with more than two axes.
In the general , an array of numbers arranged on a regular grid with a var
number of axes is known as a tensor.
We denote a tensor named “A” with this typeface: A. We identify the ele
of A at coordinates (i, j, k) by writing Aij,k.
 
 
 
 
 
 
 
 
t 3)1)4]1
‘e’ 519]2]/6
7 5|3|5|8
‘s’ 9|/7|9]3
‘o 2)3]8|4
r 6/2])6]4
 
 
 
 
 
 
 
1.7 PROBABILITY DISTRIBUTIONS
Probability denotes the possibility of something happening. It is a mathematical
concept that predicts how likely events are to occur, The probability values are expressed
between 0 and 1. The definition of probability is the degree to which something is
likely to occur. This fundamental theory of probability is also applied to probability
distributions.
1.7.1 Discrete Variable and Probability Mass Function
The probability mass function is the function which describes the probability
‘sociated with the random variable X, This function is named P(X) or P(X =) ©
avoid confusion. P(X = x) corresponds to the probability that the random variable X
takes the value vy.
1.8 GRADIENT-BASED OPTIMIZATION
1.8.1 Optimizer
Optimizers update
earning
a the parameters of neural networks such as weights and lea
Tate to minimize the
railt
4
toss function. Here, the loss function acts as a guide to the ©Deep Networks Basics 15
 
telling optimizer if itis moving in the right direction to reach the bottom of the valley,
the global minimum.
1.8.2 The Intuition behind Optimizers with an Example
Let us imagine a climber hiking down the hill with no sense of direction. He
doesn’t know the right way to reach the valley in the hills, but, he can understand
whether he is moving closer (going downhill) or further away (uphill) from his final
destination. If he keeps taking steps in the correct direction, he will reach to his aim
i.,¢ the valley
Exactly, this is the intuition behind optimizers- to reach a global minimum
concerning the loss function.
1.8.3 Instances of Gradient-Based Optimizers
Different instances of Gradient descent based Optimizers are as follows:
¢ Batch Gradient Descent or Vanilla Gradient Descent or Gradient Descent (GD)
« — Stochastic Gradient Descent (SGD)
« Mini batch Gradient Descent (MB-GD)
1.8.4 Batch Gradient Descent
Gradient descent is an optimization algorithm that's used when training deep
learning models.
It’s based on a convex function and updates its parameters iteratively to minimize
a given function to its local minimum. The notation used in the above Formula is
given below,
Gradient Descent
a.
0,= 9-455) Oo @,)
i
Leaming Rate
 
In the above formula,
© cis the learning rate,
© Jis the cost function, and
© @is the parameter to be updated.16 Deep Learning
As you can see, the gradient represents the partial derivative of J (cost function)
with respect to ©;.
Note that, as we reach closer to the global minima, the slope or the gradient of
the curve becomes less and less steep, which results in a smaller value of derivative.
which in turn reduces the step size or learning rate automatically.
It is the most basic but most used optimizer that directly uses the derivative of
the loss function and learning rate to reduce the loss function and tries to reach the
global minimum.
Thus, the Gradient Descent Optimization algorithm has many application
including
« Linear Regression,
+ Classification Algorithms,
*  Backpropagation in Neural Networks, etc.
 
       
  
 
 
Initial
Weight \
Incremental
Step \
Weight
Our aim is to reach at the bottom of the graph (Cost vs weight), or to a point
where we can no longer move downhill-a local minimum.
1.8.5 Role of Gradient
Cost
t Gradient
je
a
   
   
Minimum Cost
Derivative of Cos
In general, Gradient represents the slope of the equation while gradients are partial
Sea describe the change reflected in the loss function with respect !0
ean tell ue ae en tametes ofthe function, Now, this slight change in oss Fncton
next step to reduce the output of the loss function.
deriv,
the siDeep Networks Basics 17
 
1.8.6 Role of Learning Rate
Learning rate represents the size of the steps our optimization algorithm takes to
reach the global minima. To ensure that the gradient descent algorithm reaches the
local minimum we must set the learning rate to an appropriate value, which is neither
too low nor too high.
Taking very large steps i.e, a large value of the learning rate may skip the global
minima, and the model will never reach the optimal value for the loss function. On the
contrary, taking very small steps i.e, a small value of learning rate will take forever to
converge.
Thus, the size of the step is also dependent on the gradient value.
Big learning rate ‘Small learning rate
The gradient represents the direction of increase. But our aim is to find the
minimum point in the valley so we have to go in the opposite direction of the gradient.
Therefore, we update parameters in the negative gradient direction to minimize the
loss.
Algorithm: 6 = 8 -— a . AJ(8)
In code, Batch Gradient Descent looks something like this:
for x in range(epochs):
params_gradient = find_gradient(loss_function, data, parameters)
parameters = parameters — learning_rate * params_gradient
Advantages of Batch Gradient Descent
1. Easy computation.
2. Easy to implement.
3. Easy to understand.1.8
 
i,
Deep Learning
Disadvantages of Bateh Gradient Descent
1.9
1. May trap at local minima.
2. Weights are changed after calculating the gradient on the whole dataset %
if the datasct is too large then this may take years to converge to the minjr;
3. Requires large memory to calculate gradient on the whole dataset
STOCHASTIC GRADIENT DESCENT
1. To overcome some of the disadvantages of the GD algorithm, the SGI
algorithm comes into the picture as an extension of the Gradient Descen
2. One of the disadvantages of the Gradient Descent algorithm is that it require
a lot of memory to load the entire dataset at a time to compute the derivat
of the loss function.
3. So, In the SGD algorithm, we compute the derivative by taking one dat
point at a timei.e, tries to update the model’s parameters more frequently
4. Therefore, the model parameters are updated after the computation of loss
on each training example.
So, let’s have a dataset that contains 1000 rows, and when we apply SGD it
will update the model parameters 1000 times in one complete cycle of a
dataset instead of one time as in Gradient Descent.
Algorithm: 6 = @ — a . AJ(O;x(i);y(i))
where {x(i), y(i)} are the training examples
We want the training, even more, faster, so we take a GradientDescent step for
cach training example. Let’s see the implications in the image below:
Oe
Stochastic Gradient Descent Gradient Descent
ED ES
‘+’ denotes a mi
is a lot faster
Figure : SGD vs GD te
ids to many oscillations to reach convergence. But ac P
for GD, as it uses only one training example (vs. the
batch for GD).
inium of the Cost. SGD lea
to compute for SGD thanDeep Networks Basics 1.9
 
Let’s try to find some insights from the above diagram:
1. In the left diagram of the above picture, we have SGD (where 1 per step
time) we take a Gradient Descent step for each example and on the right
diagram is GD(1 step per entire training set).
2. SGD seems to be quite noisy, but at the same time it is much faster than
others and also it might be possible that it not converges to a minimum.
3. Itis observed that in SGD the updates take more iterations compared to GD
to reach minima.
4. On the contrary, the GD takes fewer steps to reach minima but the SGD
algorithm is noisier and takes more iterations as the model parameters are
frequently updated parameters having high variance and fluctuations in loss
functions at different values of intensities.
5.
Its code snippet simply adds a loop over the training examples and finds the
gradient with respect to each of the training examples.
for x in range(epochs):
np.random.shuffle(data)
for example in data:
params_gradient = find_gradient(loss_function, example, parameters)
parameters = parameters - learning_rate * params_gradient
Advantages of Stochastic Gradient Descent
1. Convergence takes less time as compared to others since there are frequent
updates in model parameters.
2. Requires less memory as no need to store values of loss functions.
3. May get new minima’s.
Disadvantages of Stochastic Gradient Descent
1. High variance in model parameters.
2. Even after achieving global minima, it may overshoots.
3. To reach the same convergence as that of gradient descent, we need to slowly
reduce the value of the learning rate.Deep Learni
oo P Learning
1.9.1 Mini-Batch Gradient Descent
1. To overcome the problem of large time complexity in the case of the SGD
algorithm.
2. MB-GD algorithm comes into the picture as an extension of the SGD
algorithm.
3. It’s not all but it also overcomes the problem of Gradient descent. Therefore.
It’s considered the best among all the variations of gradient descent
algorithms. MB-GD algorithm takes a batch of points or subset of points
from the dataset to compute derivate.
Stochastic Gradient Descent Mini-Batch Gradient Descent
4. It is observed that the derivative of the loss function for MB-GD is almost
the same as a derivate of the loss function for GD after some number of
iterations.
5. But the number of iterations to achieve minima is large for MB-GD compared
to GD and the cost of computation is also large.
6. Therefore, the weight updation is dependent on the derivate of loss for a
batch of points. The updates in the case of MB-GD are much noisy because
the derivative is not always towards minima.
7. Itupdates the model parameters after every batch. So, this algorithm divides
the dataset into various batches and after every batch, it updates the
parameters.
Algorithm: @ = 6 - a . AJ(0; Bi’),
where {B(i)} are the batches of training examples
In the code snippet, instead of iterating over examples, we now iterate over mini-
batches of size 30:
_Deep Networks Basics 1.11
for x in range(epochs):
np.tandom.shuffle(data)
for batch in get_batches(data, batch_size=30):
params_gradient = find_gradient(loss_function, batch, parameters)
parameters = parameters — learning_rate * params_gradient
Advantages of Mini Batch Gradient Descent
1. Updates the model parameters frequently and also has less variance.
2. Requires not less or high amount of memory i.e requires a medium amount
of memory.
Disadvantages of Mini Batch Gradient Descent
1. The parameter updation in MB-SGD is much noisy compared to the weight
updation in the GD algorithm.
2. Compared to the GD algorithm, it takes a longer time to converge.
3. May get stuck at local minima.
1.9.2 Challenges with all types of Gradient-based Optimizers
1.9.2.1 Optimum Learning Rate
Choosing an optimum value of the learning rate. If we choose the learning rate
as a too-small value, then gradient descent may take a very long time to converge. For
more about this challenge, refer to the above section of Learning Rate which we
discussed in the Gradient Descent Algorithm.
1.9.6.2 Constant Learning Rate
For all the parameters, they have a constant learning rate but there may be some
parameters that we may not want to change at the same rate.
1.10 MACHINE LEARNING BASICS : Capacity,Overfitting and
underfitting
1.10.1 Capacity of a model
Model capacity is ability to fit variety of functions
1. Model with Low capacity struggles to fit training setaL
a
142 Deep Learning
2. AHigh capacity model can overfit by memorizing
One way to control capacity of a learning algorithm is by choosing the hypothesis
space ie., set of functions that the learning algorithm is allowed to select as being the
solution.
E-g.. the linear regression algorithm has the set of all linear functions of its input
as the hypothesis space.
We can generalize to include polynomials is its hypothesis space which increases
model capacity.
1.10.1.1 Capacity of Polynomial Curve Fits
A polynomial of degree 1 gives a linear regression model with the prediction
y=b+wx
By introducing x, as another features provided to the regression model, we can
learn a model that is quadratic as a function of x.
yYsbtw xtwix,
The output is still a linear function of the parameters so we can use normal
equations to train in closed-form
We can continue to add more powers of x as additional features, e.g., a polynomial
of degree 9.
9
$=b+ yD wx!
ial
1.10.1.2 Appropriate Capacity
1. Machine Learning algorithms will perform well when their capacity is
appropriate for the true complexity of the task that they need to perform and
the amount of training data they are provided with
2. Models with insufficient capacity are unable to solve complex tasks
3. Models with high capacity can solve complex tasks, bit when their capacity
is higher than needed to solve the present task, they may overfit.
a,Deep Networks Basics
 
1.13
1.10.1. 3 Ordering Learning Machines by Capacity
1.10.1.4
 
Goal of learning is to choose an optimal element of a structure (e.g.,
polynomial degree) and estimate its coefficients from a given training sample.
For approximating functions linear in parameters such as polynomials,
complexity is given by the no. of free parameters.
For functions nonlinear in parameters, the complexity is defined as VC-
dimension. The optimal choice of model complexity provides the minimum
of the expected risk.
  
 
    
   
—_>
underfitting overfitting
True Risk
Classification Error
Confidence Interval
Empirical Risk
Representational and Effective Capacity
Representational capacity:
Specifies family of functions learning algorithm can choose from
Effective capacity:
Imperfections in optimization algorithm can limit representational capacity