Deep Learning Chapter 1

The document covers the basics of deep networks, focusing on linear algebra, probability distributions, and gradient-based optimization techniques in machine learning. It explains key concepts such as scalars, vectors, matrices, tensors, and various optimization algorithms like Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Additionally, it discusses the importance of model capacity, overfitting, and underfitting in machine learning.

Uploaded by

startrader196

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

156 views46 pages

Deep Learning Chapter 1

Uploaded by

startrader196

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 46

tworks Basics peep Ne a UNIT I DEEP NETWORKS BASICS Linear Algebra: Scalars - Vectors - Matrices and tensors; Probability Distributions - Gradientbased Optimization - Machine Learning Basics: Capacity - Overfitting and underfitting - Hyperparameters and validation sets - Estimators - Bias and variance - Stochastic gradient descent -- Challenges motivating deep learning; Deep Networks: Deep feedforward networks; Regularization - Optimization. 1. LINEAR ALGEBRA The term Linear Algebra was initially introduced in the early 18th century to find out the unknowns in Linear equations and solve the equation easily; It is alsoa prerequisite to start learning Machine Learning and data science. Deep Learning is a subdomain of machine learning, concerned with the algorithm which imitates the function and structure of the brain called the artificial neural network. Linear algebra is a form of continuous rather than discrete mathematics. 1.1. USES OF LINEAR ALGEBRA 1. Optimization of data. Implementation of Linear Regression in Machine Learning. linear algebra is also used in neural networks and the data science field. 2, 3 4. Better Graphic experience 5. Improved Statistics 6. Creating better Machine Learning algorithms i 8 Estimating the forecast of Machine Learning Easy to Learn 1.1.1 Better Graphics Experience Linear Algebra helps to provide better graphical processing in Machine Learning and edge detection. Moreover, Linear Algebra helps solve like Image, audio, video, data set through a specific terminology named Matrix and compute large and complex Decomposition Techniquess Deep i, 1.1.2 Improved Statistics Statistics is an important concept to organize and eee data in Ma oa ind the concept of stat ~ Learning. Also. linear Algebra helps to understa! Pp i ime manner. 1.1.3 Creating better Machine Learning algorithms Few supervised learning algorithms can be created using Linea ebra gistic Regression 2. Linear Regression 3. Decision Trees +4. Support Vector Machines (SVM) Further. below are some unsupervised learning algorithms listed that canals created with the help of linear algebra as follows: 1. Single Value Decomposition (SVD) 2. Clustering 3. Components Analysis 1.1.4 Easy to Learn mathematics and its applications. 1.2 EXAMPLES OF LINEAR ALGEBRA IN MACHINE LEARNING Below are some popular examples of linear algebra in Machine learning: 1. Datasets and Data Files 2. Linear Regression Recommender Systems One-hot encoding Regularization Principal Component Analysis Images and Photographs J aA A Pwi —— seen een hres eterna sines:---..__-- eeu Deep Networks Basics 1.3 8. Singular-Value Decomposition 9. Deep Learning 10. Latent Semantic Analysis 1.3. SCALARS Ascalar is just a single number, which are usually arrays of multiple numbers. We write scalars in italics. We usually give scalars lower-case variable names. For example, We might say “Let s € R be the slope of the line,” while defining a real-valued scalar, or “Let n € N be the aumber of units,” while defining a natural number scalar. 1.4 VECTORS A vector is an array of numbers. The numbers are arranged in order. We can identify each individual number by its index in that ordering. Practically we give vectors lower case names written in bold typeface, such as x. The elements of the vector are identified by writing its name in italic typeface, with a subscript. The first element of x is x,, the second element is x, and so on. 1.5 MATRICES Amatrix is a 2-D array of numbers.We usually give matrices upper-case variable names with bold typeface, such as A If a real-valued matrix A has aheight of manda width of n, then we say that A € Rm xn. We usually identify the elements of a matrix using its name in italic but not bold font, and the indices are listed with separating commas. Aa A | An, Ana1.4 1.6 TENSORS In some cases we will need an array with more than two axes. In the general , an array of numbers arranged on a regular grid with a var number of axes is known as a tensor. We denote a tensor named “A” with this typeface: A. We identify the ele of A at coordinates (i, j, k) by writing Aij,k. t 3)1)4]1 ‘e’ 519]2]/6 7 5|3|5|8 ‘s’ 9|/7|9]3 ‘o 2)3]8|4 r 6/2])6]4 1.7 PROBABILITY DISTRIBUTIONS Probability denotes the possibility of something happening. It is a mathematical concept that predicts how likely events are to occur, The probability values are expressed between 0 and 1. The definition of probability is the degree to which something is likely to occur. This fundamental theory of probability is also applied to probability distributions. 1.7.1 Discrete Variable and Probability Mass Function The probability mass function is the function which describes the probability ‘sociated with the random variable X, This function is named P(X) or P(X =) © avoid confusion. P(X = x) corresponds to the probability that the random variable X takes the value vy. 1.8 GRADIENT-BASED OPTIMIZATION 1.8.1 Optimizer Optimizers update earning a the parameters of neural networks such as weights and lea Tate to minimize the railt 4 toss function. Here, the loss function acts as a guide to the ©Deep Networks Basics 15 telling optimizer if itis moving in the right direction to reach the bottom of the valley, the global minimum. 1.8.2 The Intuition behind Optimizers with an Example Let us imagine a climber hiking down the hill with no sense of direction. He doesn’t know the right way to reach the valley in the hills, but, he can understand whether he is moving closer (going downhill) or further away (uphill) from his final destination. If he keeps taking steps in the correct direction, he will reach to his aim i.,¢ the valley Exactly, this is the intuition behind optimizers- to reach a global minimum concerning the loss function. 1.8.3 Instances of Gradient-Based Optimizers Different instances of Gradient descent based Optimizers are as follows: ¢ Batch Gradient Descent or Vanilla Gradient Descent or Gradient Descent (GD) « — Stochastic Gradient Descent (SGD) « Mini batch Gradient Descent (MB-GD) 1.8.4 Batch Gradient Descent Gradient descent is an optimization algorithm that's used when training deep learning models. It’s based on a convex function and updates its parameters iteratively to minimize a given function to its local minimum. The notation used in the above Formula is given below, Gradient Descent a. 0,= 9-455) Oo @,) i Leaming Rate In the above formula, © cis the learning rate, © Jis the cost function, and © @is the parameter to be updated.16 Deep Learning As you can see, the gradient represents the partial derivative of J (cost function) with respect to ©;. Note that, as we reach closer to the global minima, the slope or the gradient of the curve becomes less and less steep, which results in a smaller value of derivative. which in turn reduces the step size or learning rate automatically. It is the most basic but most used optimizer that directly uses the derivative of the loss function and learning rate to reduce the loss function and tries to reach the global minimum. Thus, the Gradient Descent Optimization algorithm has many application including « Linear Regression, + Classification Algorithms, * Backpropagation in Neural Networks, etc. Initial Weight \ Incremental Step \ Weight Our aim is to reach at the bottom of the graph (Cost vs weight), or to a point where we can no longer move downhill-a local minimum. 1.8.5 Role of Gradient Cost t Gradient je a Minimum Cost Derivative of Cos In general, Gradient represents the slope of the equation while gradients are partial Sea describe the change reflected in the loss function with respect !0 ean tell ue ae en tametes ofthe function, Now, this slight change in oss Fncton next step to reduce the output of the loss function. deriv, the siDeep Networks Basics 17 1.8.6 Role of Learning Rate Learning rate represents the size of the steps our optimization algorithm takes to reach the global minima. To ensure that the gradient descent algorithm reaches the local minimum we must set the learning rate to an appropriate value, which is neither too low nor too high. Taking very large steps i.e, a large value of the learning rate may skip the global minima, and the model will never reach the optimal value for the loss function. On the contrary, taking very small steps i.e, a small value of learning rate will take forever to converge. Thus, the size of the step is also dependent on the gradient value. Big learning rate ‘Small learning rate The gradient represents the direction of increase. But our aim is to find the minimum point in the valley so we have to go in the opposite direction of the gradient. Therefore, we update parameters in the negative gradient direction to minimize the loss. Algorithm: 6 = 8 -— a . AJ(8) In code, Batch Gradient Descent looks something like this: for x in range(epochs): params_gradient = find_gradient(loss_function, data, parameters) parameters = parameters — learning_rate * params_gradient Advantages of Batch Gradient Descent 1. Easy computation. 2. Easy to implement. 3. Easy to understand.1.8 i, Deep Learning Disadvantages of Bateh Gradient Descent 1.9 1. May trap at local minima. 2. Weights are changed after calculating the gradient on the whole dataset % if the datasct is too large then this may take years to converge to the minjr; 3. Requires large memory to calculate gradient on the whole dataset STOCHASTIC GRADIENT DESCENT 1. To overcome some of the disadvantages of the GD algorithm, the SGI algorithm comes into the picture as an extension of the Gradient Descen 2. One of the disadvantages of the Gradient Descent algorithm is that it require a lot of memory to load the entire dataset at a time to compute the derivat of the loss function. 3. So, In the SGD algorithm, we compute the derivative by taking one dat point at a timei.e, tries to update the model’s parameters more frequently 4. Therefore, the model parameters are updated after the computation of loss on each training example. So, let’s have a dataset that contains 1000 rows, and when we apply SGD it will update the model parameters 1000 times in one complete cycle of a dataset instead of one time as in Gradient Descent. Algorithm: 6 = @ — a . AJ(O;x(i);y(i)) where {x(i), y(i)} are the training examples We want the training, even more, faster, so we take a GradientDescent step for cach training example. Let’s see the implications in the image below: Oe Stochastic Gradient Descent Gradient Descent ED ES ‘+’ denotes a mi is a lot faster Figure : SGD vs GD te ids to many oscillations to reach convergence. But ac P for GD, as it uses only one training example (vs. the batch for GD). inium of the Cost. SGD lea to compute for SGD thanDeep Networks Basics 1.9 Let’s try to find some insights from the above diagram: 1. In the left diagram of the above picture, we have SGD (where 1 per step time) we take a Gradient Descent step for each example and on the right diagram is GD(1 step per entire training set). 2. SGD seems to be quite noisy, but at the same time it is much faster than others and also it might be possible that it not converges to a minimum. 3. Itis observed that in SGD the updates take more iterations compared to GD to reach minima. 4. On the contrary, the GD takes fewer steps to reach minima but the SGD algorithm is noisier and takes more iterations as the model parameters are frequently updated parameters having high variance and fluctuations in loss functions at different values of intensities. 5. Its code snippet simply adds a loop over the training examples and finds the gradient with respect to each of the training examples. for x in range(epochs): np.random.shuffle(data) for example in data: params_gradient = find_gradient(loss_function, example, parameters) parameters = parameters - learning_rate * params_gradient Advantages of Stochastic Gradient Descent 1. Convergence takes less time as compared to others since there are frequent updates in model parameters. 2. Requires less memory as no need to store values of loss functions. 3. May get new minima’s. Disadvantages of Stochastic Gradient Descent 1. High variance in model parameters. 2. Even after achieving global minima, it may overshoots. 3. To reach the same convergence as that of gradient descent, we need to slowly reduce the value of the learning rate.Deep Learni oo P Learning 1.9.1 Mini-Batch Gradient Descent 1. To overcome the problem of large time complexity in the case of the SGD algorithm. 2. MB-GD algorithm comes into the picture as an extension of the SGD algorithm. 3. It’s not all but it also overcomes the problem of Gradient descent. Therefore. It’s considered the best among all the variations of gradient descent algorithms. MB-GD algorithm takes a batch of points or subset of points from the dataset to compute derivate. Stochastic Gradient Descent Mini-Batch Gradient Descent 4. It is observed that the derivative of the loss function for MB-GD is almost the same as a derivate of the loss function for GD after some number of iterations. 5. But the number of iterations to achieve minima is large for MB-GD compared to GD and the cost of computation is also large. 6. Therefore, the weight updation is dependent on the derivate of loss for a batch of points. The updates in the case of MB-GD are much noisy because the derivative is not always towards minima. 7. Itupdates the model parameters after every batch. So, this algorithm divides the dataset into various batches and after every batch, it updates the parameters. Algorithm: @ = 6 - a . AJ(0; Bi’), where {B(i)} are the batches of training examples In the code snippet, instead of iterating over examples, we now iterate over mini- batches of size 30: _Deep Networks Basics 1.11 for x in range(epochs): np.tandom.shuffle(data) for batch in get_batches(data, batch_size=30): params_gradient = find_gradient(loss_function, batch, parameters) parameters = parameters — learning_rate * params_gradient Advantages of Mini Batch Gradient Descent 1. Updates the model parameters frequently and also has less variance. 2. Requires not less or high amount of memory i.e requires a medium amount of memory. Disadvantages of Mini Batch Gradient Descent 1. The parameter updation in MB-SGD is much noisy compared to the weight updation in the GD algorithm. 2. Compared to the GD algorithm, it takes a longer time to converge. 3. May get stuck at local minima. 1.9.2 Challenges with all types of Gradient-based Optimizers 1.9.2.1 Optimum Learning Rate Choosing an optimum value of the learning rate. If we choose the learning rate as a too-small value, then gradient descent may take a very long time to converge. For more about this challenge, refer to the above section of Learning Rate which we discussed in the Gradient Descent Algorithm. 1.9.6.2 Constant Learning Rate For all the parameters, they have a constant learning rate but there may be some parameters that we may not want to change at the same rate. 1.10 MACHINE LEARNING BASICS : Capacity,Overfitting and underfitting 1.10.1 Capacity of a model Model capacity is ability to fit variety of functions 1. Model with Low capacity struggles to fit training setaL a 142 Deep Learning 2. AHigh capacity model can overfit by memorizing One way to control capacity of a learning algorithm is by choosing the hypothesis space ie., set of functions that the learning algorithm is allowed to select as being the solution. E-g.. the linear regression algorithm has the set of all linear functions of its input as the hypothesis space. We can generalize to include polynomials is its hypothesis space which increases model capacity. 1.10.1.1 Capacity of Polynomial Curve Fits A polynomial of degree 1 gives a linear regression model with the prediction y=b+wx By introducing x, as another features provided to the regression model, we can learn a model that is quadratic as a function of x. yYsbtw xtwix, The output is still a linear function of the parameters so we can use normal equations to train in closed-form We can continue to add more powers of x as additional features, e.g., a polynomial of degree 9. 9 $=b+ yD wx! ial 1.10.1.2 Appropriate Capacity 1. Machine Learning algorithms will perform well when their capacity is appropriate for the true complexity of the task that they need to perform and the amount of training data they are provided with 2. Models with insufficient capacity are unable to solve complex tasks 3. Models with high capacity can solve complex tasks, bit when their capacity is higher than needed to solve the present task, they may overfit. a,Deep Networks Basics 1.13 1.10.1. 3 Ordering Learning Machines by Capacity 1.10.1.4 Goal of learning is to choose an optimal element of a structure (e.g., polynomial degree) and estimate its coefficients from a given training sample. For approximating functions linear in parameters such as polynomials, complexity is given by the no. of free parameters. For functions nonlinear in parameters, the complexity is defined as VC- dimension. The optimal choice of model complexity provides the minimum of the expected risk. —_> underfitting overfitting True Risk Classification Error Confidence Interval Empirical Risk Representational and Effective Capacity Representational capacity: Specifies family of functions learning algorithm can choose from Effective capacity: Imperfections in optimization algorithm can limit representational capacity

Algorithm Lecture 12 Dijkstra Algorithm
No ratings yet
Algorithm Lecture 12 Dijkstra Algorithm
26 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
Iterative Improvement & Graph Theory Questions
No ratings yet
Iterative Improvement & Graph Theory Questions
12 pages
Module 6 Lecture 1 (Advance Topics)
No ratings yet
Module 6 Lecture 1 (Advance Topics)
18 pages
CS3591 Computer Networks Lab Manual Finalized
No ratings yet
CS3591 Computer Networks Lab Manual Finalized
67 pages
Klick Micro
No ratings yet
Klick Micro
3 pages
Ad3351 Daa Important Questions
No ratings yet
Ad3351 Daa Important Questions
94 pages
4.5 Issues in Code Generation
No ratings yet
4.5 Issues in Code Generation
7 pages
Unit 4 - Domain Testing
100% (1)
Unit 4 - Domain Testing
76 pages
Python Model Soultion 2
0% (1)
Python Model Soultion 2
12 pages
Experiment No. 1: Theory
No ratings yet
Experiment No. 1: Theory
7 pages
Jntuk Machine Learning 3-2 Unit-4
No ratings yet
Jntuk Machine Learning 3-2 Unit-4
32 pages
DBMS Module4 QuestionBank
No ratings yet
DBMS Module4 QuestionBank
2 pages
Backtracking & Branching Guide
No ratings yet
Backtracking & Branching Guide
4 pages
Proposistional Logic
100% (1)
Proposistional Logic
62 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
98 pages
Dpco Unit 3
No ratings yet
Dpco Unit 3
16 pages
31.5 - Python Syllabus
No ratings yet
31.5 - Python Syllabus
2 pages
IDS Unit-1-Handwritten
No ratings yet
IDS Unit-1-Handwritten
39 pages
Numpy - Tutorial - Ipynb - Colaboratory
No ratings yet
Numpy - Tutorial - Ipynb - Colaboratory
9 pages
Aiml Unit 2
No ratings yet
Aiml Unit 2
34 pages
CS01207
No ratings yet
CS01207
3 pages
Session 02
No ratings yet
Session 02
16 pages
Chapter 3 Gate Level Minimization
No ratings yet
Chapter 3 Gate Level Minimization
92 pages
Cs3481 - Dbms Record
No ratings yet
Cs3481 - Dbms Record
63 pages
Ccs357 Optimization Techniques
No ratings yet
Ccs357 Optimization Techniques
54 pages
Unit3 Notes
No ratings yet
Unit3 Notes
29 pages
ESDL Lab Manual
No ratings yet
ESDL Lab Manual
7 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
Recurrence Relations PDF
No ratings yet
Recurrence Relations PDF
8 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
124 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
6.1 Emerging Databases
No ratings yet
6.1 Emerging Databases
18 pages
Syllabus
No ratings yet
Syllabus
9 pages
Artificial Intelligence - AL3391 - Important Questions With Answer - Unit 1 - Intelligent Agents
No ratings yet
Artificial Intelligence - AL3391 - Important Questions With Answer - Unit 1 - Intelligent Agents
10 pages
Artifical Intelligence and Machine Learning Lab
No ratings yet
Artifical Intelligence and Machine Learning Lab
109 pages
BCA NEP Syllabus 3 & 4th Sem 2022-23 & Onwards
No ratings yet
BCA NEP Syllabus 3 & 4th Sem 2022-23 & Onwards
27 pages
Relations: Dr. Mitesh S. Joshi. January 28, 2022
No ratings yet
Relations: Dr. Mitesh S. Joshi. January 28, 2022
32 pages
Unit-4 (Part-1) Backtracking
No ratings yet
Unit-4 (Part-1) Backtracking
39 pages
4th Sem DBMS LAB Manual
No ratings yet
4th Sem DBMS LAB Manual
43 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
Dbms Unit II
No ratings yet
Dbms Unit II
49 pages
Ma3354 DM Unit 4 Part A, B Question and Answer
No ratings yet
Ma3354 DM Unit 4 Part A, B Question and Answer
8 pages
Python Programming Course Outline
No ratings yet
Python Programming Course Outline
5 pages
15IF11 Multicore E PDF
No ratings yet
15IF11 Multicore E PDF
14 pages
DBMS LAB MANUAL Updated
No ratings yet
DBMS LAB MANUAL Updated
67 pages
JAVA Code: Cyclic Redundancy Check For Error-Detection: Oosp Project
No ratings yet
JAVA Code: Cyclic Redundancy Check For Error-Detection: Oosp Project
7 pages
Unit - Viii Machine Dependent Code Optimization Peephole Optimization
No ratings yet
Unit - Viii Machine Dependent Code Optimization Peephole Optimization
9 pages
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
No ratings yet
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
24 pages
CCS357 Lab Manual
No ratings yet
CCS357 Lab Manual
41 pages
Supervised Learning Techniques
No ratings yet
Supervised Learning Techniques
33 pages
DDM Lab Manual
100% (1)
DDM Lab Manual
80 pages
AI and Soft Computing
No ratings yet
AI and Soft Computing
10 pages
QM Method
No ratings yet
QM Method
13 pages
Digital Systems and Number Systems
No ratings yet
Digital Systems and Number Systems
23 pages
R23-DWDM Syllabus
No ratings yet
R23-DWDM Syllabus
5 pages
Predicate Calculus
No ratings yet
Predicate Calculus
9 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
DL Unit 1
No ratings yet
DL Unit 1
9 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages

Deep Learning Chapter 1

Uploaded by

Deep Learning Chapter 1

Uploaded by

You might also like