0% found this document useful (0 votes)

63 views30 pages

ANN-unit 3

Back propagation is a method used to train neural networks by adjusting weights based on the error rate from previous iterations, allowing for improved model reliability and generalization. It involves calculating the gradient of the loss function layer by layer and can be applied in both static and recurrent forms. The technique is efficient and widely recognized for its effectiveness in various applications, including deep learning and error-prone tasks like image recognition.

Uploaded by

Neelesh Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views30 pages

ANN-unit 3

Uploaded by

Neelesh Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

UNIT-3

Back Propagation
Back propagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.
Back propagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the gradient
of a loss function with respect to all the weights in the network.

How Back propagation Algorithm Works

The Back propagation algorithm in neural network computes the gradient of the loss function for
a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native
direct computation. It computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.

Consider the following Back propagation neural network example diagram to understand:

How Back propagation Algorithm Works

1. Inputs X, arrive through the reconnected path

2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
4. Calculate the error in the outputs
5. Error= Actual Output – Desired Output
6. Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.

1
Keep repeating the process until the desired output is achieved

Most prominent advantages of Back propagation are:

• Back propagation is fast, simple and easy to program

• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to be learned.

What is a Feed Forward Network?

A feed forward neural network is an artificial neural network where the nodes never form a
cycle. This kind of neural network has an input layer, hidden layers, and an output layer. It is the
first and simplest type of artificial neural network.

Types of Back propagation Networks

Two Types of Back propagation Networks are:

• Static Back-propagation
• Recurrent Back propagation

Static back-propagation:
It is one kind of back propagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.

Recurrent Back propagation:

Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.

The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is no static in recurrent back propagation.

History of Back propagation

• In 1961, the basics concept of continuous back propagation was derived in the context of
control theory by J. Kelly, Henry Arthur, and E. Bryson.
• In 1969, Bryson and Ho gave a multi-stage dynamic system optimization method.
• In 1974, Webs stated the possibility of applying this principle in an artificial neural
network.
• In 1982, Hopfield brought his idea of a neural network.
• In 1986, by the effort of David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams,
back propagation gained recognition.
• In 1993, Wan was the first person to win an international pattern recognition contest with
the help of the back propagation method.

2
Back propagation Key Points

• Simplifies the network structure by elements weighted links that have the least effect on
the trained network
• You need to study a group of input and activation values to develop the relationship
between the input and hidden unit layers.
• It helps to assess the impact that a given input variable has on a network output. The
knowledge gained from this analysis should be represented in rules.
• Back propagation is especially useful for deep neural networks working on error-prone
projects, such as image or speech recognition.
• Back propagation takes advantage of the chain and power rules allows back propagation
to function with any number of outputs.

Best practice Back propagation

Back propagation in neural network can be explained with the help of “Shoe Lace” analogy

Too little tension =

• Not enough constraining and very loose

Too much tension =

• Too much constraint (overtraining)

• Taking too much time (relatively slow process)
• Higher likelihood of breaking

Pulling one lace more than other =

• Discomfort (bias)

Disadvantages of using Back propagation

• The actual performance of back propagation on a specific problem is dependent on the

input data.
• Back propagation algorithm in data mining can be quite sensitive to noisy data
• You need to use the matrix-based approach for back propagation instead of mini-batch.

Summary

• A neural network is a group of connected it I/O units where each connection has a weight
associated with its computer programs.

3
• Back propagation is a short form for “backward propagation of errors.” It is a standard
method of training artificial neural networks
• Back propagation algorithm in machine learning is fast, simple and easy to program
• A feed forward BPN network is an artificial neural network.
• Two Types of Back propagation Networks are 1)Static Back-propagation 2) Recurrent
Back propagation
• In 1961, the basics concept of continuous back propagation was derived in the context of
control theory by J. Kelly, Henry Arthur, and E. Bryson.
• Back propagation in data mining simplifies the network structure by removing weighted
links that have a minimal effect on the trained network.
• It is especially useful for deep neural networks working on error-prone projects, such as
image or speech recognition.
• The biggest drawback of the Back propagation is that it can be sensitive for noisy data.

Back propagation is an algorithm that back propagates the errors from the output nodes to the
input nodes. Therefore, it is simply referred to as the backward propagation of errors. It uses in
the vast applications of neural networks in data mining like Character recognition, Signature
verification, etc.

Hessian Matrix,

The Hessian matrix is a square matrix of second-order partial derivatives of a function. It is

named after the German mathematician Ludwig Hessian, who first introduced it in the 19th
century. The Hessian matrix is an important tool in optimization and machine learning, as it
provides information about the curvature of a function and can be used to determine whether a
point is a local minimum, maximum, or saddle point.

The Hessian matrix H of a function f(x) is defined as:

H_ij = ∂^2f/∂x_i∂x_j

Where H_ij is the entry in the i-th row and j-th column of the matrix, and x_i and x_j are the
variables of the function. The Hessian matrix is a symmetric matrix, meaning that H_ij = H_ji for
all i and j.

The Hessian matrix provides information about the curvature of the function at a given point. If
all the eigenvalues of the Hessian matrix are positive, then the function has a local minimum at
that point. If all the eigenvalues are negative, then the function has a local maximum. If the
eigenvalues have both positive and negative values, then the point is a saddle point.

The Hessian matrix is used in optimization algorithms such as Newton's method, which uses the
second-order derivative information to iteratively find the minimum of a function. In machine
learning, the Hessian matrix is used in methods such as the Hessian-free optimization, which is a
variant of Newton's method that avoids the expensive computation of the full Hessian matrix.

4
Generalization, Cross Validation
Generalization refers to the ability of a machine learning model to perform well on new, unseen
data that was not used during the training process. The ultimate goal of any machine learning
model is to generalize well, as the model's ability to make accurate predictions on new data is
what makes it useful in practice.

Cross-validation is a technique used to assess the generalization performance of a machine

learning model. In cross-validation, the available data is split into two sets: a training set and a
validation set. The model is trained on the training set and then evaluated on the validation set.
This process is repeated several times, with different splits of the data, to get an estimate of the
model's generalization performance.

One common approach to cross-validation is k-fold cross-validation, where the data is divided
into k equal-sized subsets, or folds. The model is trained on k-1 folds and evaluated on the
remaining fold. This process is repeated k times, with each fold being used once for validation.
The results of each fold can then be averaged to get an estimate of the model's generalization
performance.

Cross-validation can be used to tune hyper parameters of a machine learning model, such as the
learning rate or regularization strength, by evaluating the model's performance on the validation
set for different values of the hyper parameters.

Overall, cross-validation is an important tool for assessing the generalization performance of a

machine learning model and can be used to improve the model's accuracy on new, unseen data.

Network Pruning Techniques

Network pruning is a technique used in deep learning to reduce the size of a neural network by
removing unnecessary or redundant parameters, while maintaining or even improving the
accuracy of the model. Here are some common network pruning techniques:

Weight pruning: This technique involves identifying and removing the connections in the
network that have small or zero weights. This can be done by setting a threshold value below
which the weights are pruned.

Neuron pruning: This technique involves removing entire neurons from the network, typically
those with small or zero activation. This can be done by identifying the least important neurons
using sensitivity analysis, and then removing them from the network.

Channel pruning: This technique involves removing entire channels in a convolution neural
network that are deemed to be unnecessary or redundant. This can be done by measuring the

5
importance of each channel, for example by looking at the magnitude of the weights associated
with each channel.

Filter pruning: This technique involves removing entire filters in a convolution neural network
that are deemed to be unnecessary or redundant. This can be done by measuring the importance
of each filter, for example by looking at the average activation of the feature map produced by
each filter.

Structured pruning: This technique involves removing entire substructures from the network,
such as layers, blocks or modules. This can be done by applying the above pruning techniques to
the substructures, or by using more advanced methods such as spectral clustering.

Lottery ticket hypothesis: This is a recent technique in which a neural network is trained from
scratch with random weights, and then a subset of weights is identified that is important for the
network to achieve high accuracy. These important weights are then used to initialize a smaller
network, which is then fine-tuned to achieve similar or better accuracy than the original network.

Overall, network pruning techniques can be used to reduce the size of neural networks, making
them more efficient to deploy on resource-constrained devices, while still maintaining high
accuracy.

Neural Network:

Neural networks are an information processing paradigm inspired by the human nervous system.
Just like in the human nervous system, we have biological neurons in the same way in neural
networks we have artificial neurons, artificial neurons are mathematical functions derived from
biological neurons. The human brain is estimated to have about 10 billion neurons, each
connected to an average of 10,000 other neurons. Each neuron receives a signal through a
synapse, which controls the effect of the sign concerning on the neuron.

6
Back propagation:

Back propagation is a widely used algorithm for training feed forward neural networks. It
computes the gradient of the loss function with respect to the network weights. It is very
efficient, rather than naively directly computing the gradient concerning each weight. This
efficiency makes it possible to use gradient methods to train multi-layer networks and update
weights to minimize loss; variants such as gradient descent or stochastic gradient descent are
often used.
The back propagation algorithm works by computing the gradient of the loss function with
respect to each weight via the chain rule, computing the gradient layer by layer, and iterating
backward from the last layer to avoid redundant computation of intermediate terms in the chain
rule.

Features of Back propagation:

1. It is the gradient descent method as used in the case of simple perception network with the
differentiable unit.
2. It is different from other networks in respect to the process by which the weights are
calculated during the learning period of the network.
3. training is done in the three stages :
• the feed-forward of input training pattern
• the calculation and back propagation of the error
• updating of the weight
Working of Back propagation:
Neural networks use supervised learning to generate output vectors from input vectors that the
network operates on. It Compares generated output to the desired output and generates an error
report if the result does not match the generated output vector. Then it adjusts the weights
according to the bug report to get your desired output.

Back propagation Algorithm:

Step 1: Inputs X, arrive through the reconnected path.

Step 2: The input is modeled using true weights W. Weights are usually chosen randomly.

Step 3: Calculate the output of each neuron from the input layer to the hidden layer to the output
layer.
Step 4: Calculate the error in the outputs
Back propagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the
error.

7
Step 6: Repeat the process until the desired output is achieved.

Parameters:
• x = inputs training vector x=(x1, x2, xn).
• t = target vector t= (t1, t2……………tn).
• Eke = error at output unit.
• Δj = error at hidden layer.
• α = learning rate.
• V0j = bias of hidden unit j.
Training Algorithm:
Step 1: Initialize weight to small random values.
Step 2: While the steps stopping condition is to be false do step 3 to 10.
Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and transmits the signal xi signal to all the units.
Step 5: Each hidden unit Zj (z=1 to a) sums its weighted input signal to calculate its net input
Zinj = v0j + Σxivij ( i=1 to n)
Applying activation function zj = f(zinj) and sends this signals to all units in the layer about
i.e output units
For each output l=unit yk = (k=1 to m) sums its weighted input signals.
yink = w0k + Σ ziwjk (j=1 to a)
And applies its activation function to calculate the output signals.
yk = f(yink)

Back propagation Error:

Step 6: Each output unit yk (k=1 to n) receives a target pattern corresponding to an input pattern
then error is calculated as:
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all units in the layer above
δinj = Σ δj wjk
The error information term is calculated as :

8
δj = δinj + zinj

Updating of weight and bias:

Step 8: Each output unit yk (k=1 to m) updates its bias and weight (j=1 to a). The weight
correction term is given by :
Δ wjk = α δk zj
And the bias correction term is given by Δwk = α δk.
Therefore wjk(new) = wjk(old) + Δ wjk
w0k(new) = wok(old) + Δ wok
For each hidden unit zj (j=1 to a) update its bias and weights (i=0 to n) the weight
connection term
Δ vij = α δj xi
And the bias connection on term
Δ v0j = α δj
Therefore vij(new) = vij(old) + Δvij
v0j (new) = v0j (old) + Δv0j
Step 9: Test the stopping condition. The stopping condition can be the minimization of error,
number of epochs.

Need for Back propagation:

Back propagation is “back propagation of errors” and is very useful for training neural networks.
It’s fast, easy to implement, and simple. Back propagation does not require any parameters to be
set, except the number of inputs. Back propagation is a flexible method because no prior
knowledge of the network is required.

Types of Back propagation

There are two types of back propagation networks.

• Static back propagation: Static back propagation is a network designed to map static inputs
for static outputs. These types of networks are capable of solving static classification
problems such as OCR (Optical Character Recognition).

• Recurrent back propagation: Recursive back propagation is another network used for
fixed-point learning. Activation in recurrent back propagation is feed-forward until a fixed
value is reached. Static back propagation provides an instant mapping, while recurrent back
propagation does not provide an instant mapping.

Advantages:

• It is simple, fast, and easy to program.

• Only numbers of the input are tuned, not any other parameter.

9
• It is Flexible and efficient.
• No need for users to learn any special functions.

Disadvantages:

• It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
• Performance is highly dependent on input data.
• Spending too much time training.
• The matrix-based approach is preferred over a mini-batch.

The algorithm is used to effectively train a neural network through a method called chain

rule. In simple terms, after each forward pass through a network, back propagation performs a
backward pass while adjusting the model’s parameters (weights and biases).

Define the neural network model

The 4-layer neural network consists of 4 neurons for the input layer, 4 neurons for the hidden

layers and 1 neuron for the output layer.

Simple 4-layer neural network illustration

Input layer

The neurons, colored in purple, represent the input data. These can be as simple as scalars or
more complex like vectors or multidimensional matrices.

10
Equation for input x_i

The first set of activations (a) are equal to the input values. NB: “activation” is the neuron’s value

after applying an activation function. See below.

Hidden layers

The final values at the hidden neurons, colored in green, are computed using z^l — weighted

inputs in layer l, and a^l— activations in layer l. For layer 2 and 3 the equations are:

• l=2

Equations for z² and a²

• l=3

Equations for z³ and a³

W² and W³ are the weights in layer 2 and 3 while b² and b³ are the biases in those layers.
Activations a² and a³ are computed using an activation function f. typically, this function f is

11
non-linear (e.g. sigmoid, ReLU, tanh) and allows the network to learn complex patterns in data.

Looking carefully, you can see that all of x, z², a², z³, a³, W¹, W², b¹ and b² are missing their

subscripts presented in the 4-layer network illustration above. The reason is that we have

combined all parameter values in matrices, grouped by layers. This is the standard way of

working with neural networks and one should be comfortable with the calculations. However, I

will go over the equations to clear out any confusion.

• W¹ is a weight matrix of shape (n, m) where n is the number of output neurons (neurons in
the next layer) and m is the number of input neurons (neurons in the previous layer). For us, n
= 2 and m = 4.

Equation for W¹

NB: The first number in any weight’s subscript matches the index of the neuron in the next

layer (in our case this is the Hidden_2 layer) and the second number matches the index of the
neuron in previous layer (in our case this is the Input layer).

• x is the input vector of shape (m, 1) where m is the number of input neurons. For us, m = 4.

12
Equation for x

b¹ is a bias vector of shape (n , 1) where n is the number of neurons in the current layer. For us, n
= 2.

Equation for b¹

Following the equation for z², we can use the above definitions of W¹, x and b¹ to derive “Equation for z²”:

Equation for z² Now carefully observe the neural network illustration from above.

Input and Hidden_1 layers You will see that z² can be expressed using (z_1)² and (z_2)² where
(z_1)² and (z_2)² are the sums of the multiplication between every input x_i with the
corresponding weight (W_ij)¹. This leads to the same “Equation for z²” and proofs that the matrix
representations for z², a², z³ and a³ are correct.

13
Output layer

The final part of a neural network is the output layer which produces the predicated value. In our

simple example, it is presented as a single neuron, colored in blue and evaluated as follows:

Equation for output s Again, we are using the matrix representation to simplify the equation. One
can use the above techniques to understand the underlying logic.

Forward propagation and evaluation

The equations above form network’s forward propagation. Here is a short overview:
Overview of forward propagation equations colored by layer the final step in a forward pass is to
evaluate the predicted output s against an expected output y. The output y is part of the training
dataset (x, y) where x is the input (as we saw in the previous section). Evaluation
between s and y happens through a cost function. This can be as simple as MSE (mean squared
error) or more complex like cross-entropy.

We name this cost function C and denote it as follows:

Gradient of a function C (x_1, x_2, exam) in point x is a vector of the partial derivatives of C in x.

Equation for derivative of C in x

• The derivative of a function C measures the sensitivity to change of the function value
(output value) with respect to a change in its argument x (input value). In other words, the
derivative tells us the direction C is going.

14
• The gradient shows how much the parameter x needs to change (in positive or negative
direction) to minimize C. Compute those gradients happen using a technique called chain rule.

For a single weight (w_jk) ^l, the gradient is:

Equations for derivative of C in a single weight (w_jk) ^l

Similar set of equations can be applied to (b_j)^l:

Equations for derivative of C in a single bias (b_j) ^l

The common part in both equations is often called “local gradient” and is expressed as follows:

15
Equation for local gradient the “local gradient” can easily be determined using the chain rule. I
won’t go over the process now but if you have any questions, please comment below. The
gradients allow us to optimize the model’s parameters:

Algorithm for optimizing weights and biases (also called “Gradient descent”)

• Initial values of w and b are randomly chosen.

• Epsilon (e) is the learning rate. It determines the gradient’s influence.
• W and b are matrix representations of the weights and biases. Derivative of C in w or b can
be calculated using partial derivatives of C in the individual weights or biases.
• Termination condition is met once the cost function is minimized.
I would like to dedicate the final part of this section to a simple example in which we will
calculate the gradient of C with respect to a single weight (w_22)².

Let’s zoom in on the bottom part of the above neural network:

Visual representation of back propagation in a neural network

Weight (w_22)² connects (a_2)² and (z_2)², so computing the gradient requires applying the chain

rule through (z_2)³ and (a_2)³:

16
Equation for derivative of C in (w_22)²

Calculating the final value of derivative of C in (a_2)³ requires knowledge of the function C.

Since C is dependent on (a_2)³, calculating the derivative should be fairly straightforward. I hope

this example manages to throw some light on the mathematics behind computing gradients. To

further enhance your skills, I strongly recommend watching.

What Is A Hessian Matrix?
The Hessian matrix is a square matrix of second-order partial derivatives of a function with
respect to its variables. In machine learning, the function is typically the loss function of a model
with respect to its parameters. The Hessian matrix is denoted by H and is defined as:

H_ij = ∂²f / (∂θ_i ∂θ_j)

Where f is the function, θ_i and θ_j are the ith and jth parameters of the model, and ∂²f / (∂θ_i
∂θ_j) is the second-order partial derivative of f with respect to θ_i and θ_j.

The Hessian matrix is a square matrix with the same number of rows and columns as the number
of parameters in the model. The element H_ij of the Hessian matrix represents the curvature of
the loss function with respect to the ith and jth parameters. Positive values of H_ij indicate that
increasing the values of both parameters simultaneously will increase the loss function, while
negative values indicate that increasing the values of both parameters will decrease the loss
function. Zero values indicate that the parameters are independent of each other.

The Hessian matrix provides important information about the curvature of the loss function,
particularly around critical points such as local minima, saddle points, and maxima. The Eigen
values and eigenvectors of the Hessian matrix can be used to analyze the behavior of the model
at these critical points and to optimize the model more efficiently and accurately using second-
order optimization algorithms such as Newton's method. The Hessian matrix is also useful for
regularization and compression techniques such as weight decay, Hessian-based early stopping,
and Hessian-based pruning.
The determinant of the Hessian is also called the discriminate of f. For a two variable function
f(x, y), it is given by:

17
Discriminate of f(x, y)
Examples of Hessian Matrices and Discriminates
Suppose we have the following function:

g(x, y) = x^3 + 2y^2 + 3xy^2

Then the Hessian H_g and the discriminate D_g are given by:

Hessian and discriminated of g(x, y) = x^3 + 2y^2 + 3xy^2

Let’s evaluate the discriminate at different points:

D_g(0, 0) = 0

D_g(1, 0) = 36 + 24 = 60

D_g(0, 1) = -36

D_g(-1, 0) = 12

What Do The Hessian And Discriminate Signify?

The Hessian and the corresponding discriminated are used to determine the local extreme points
of a function. Evaluating those helps in the understanding of a function of several variables. Here
are some important rules for a point (a, b) where the discriminated is D(a, b):

1. The function f has a local minimum if f_xx(a, b) > 0 and the discriminated D(a,b) > 0

18
2. The function f has a local maximum if f_xx(a, b) < 0 and the discriminated D(a,b) > 0
3. The function f has a saddle point if D(a, b) < 0
4. We cannot draw any conclusions if D(a, b) = 0 and need more tests
Example: g(x, y)

For the function g(x, y):

1. We cannot draw any conclusions for the point (0, 0)

2. f_xx(1, 0) = 6 > 0 and D_g(1, 0) = 60 > 0, hence (1, 0) is a local minimum
3. The point (0,1) is a saddle point as D_g(0, 1) < 0
4. f_xx(-1,0) = -6 < 0 and D_g(-1, 0) = 12 > 0, hence (-1, 0) is a local maximum
The figure below shows a graph of the function g(x, y) and its corresponding contours.

Why Is The Hessian Matrix Important In Machine Learning?

The Hessian matrix is an important mathematical tool in machine learning, particularly in the
context of optimization algorithms. Here are some reasons why the Hessian matrix is important:

Second-order optimization: The Hessian matrix provides information about the curvature of the
loss surface, which is useful for second-order optimization algorithms. Second-order
optimization algorithms, such as Newton's method, use the Hessian matrix to optimize the model
parameters more efficiently and accurately than first-order optimization algorithms.

Understanding model behavior: The Hessian matrix can help us understand the behavior of the
model, particularly around critical points such as local minima, saddle points, and maxima. The
eigenvalues and eigenvectors of the Hessian matrix provide information about the direction and
curvature of the loss surface at these critical points.

Regularization: The Hessian matrix can be used for regularization techniques such as weight
decay and Hessian-based early stopping. Weight decay is a regularization technique that
penalizes large weights by adding a term proportional to the L2 norm of the weights multiplied
by the Hessian matrix. Hessian-based early stopping involves stopping the training process when
the Hessian matrix becomes too large, which helps prevent over fitting.

Model compression: The Hessian matrix can also be used for model compression techniques
such as Hessian-based pruning. Hessian-based pruning involves removing the smallest
eigenvalues and corresponding eigenvectors of the Hessian matrix, which results in a
compressed model with fewer parameters.

19
5-fold cross-validation model fits for a simulated land value prediction task. The quadratic form
has lower cross-validation error, so we’ll re-fit and deploy that one.

For our running example, we set X=5 and use the squared error loss function. The quadratic
model has a lower CV error than the linear model, so we choose that model form to re-fit to the
full dataset then deploy. Nice and tidy, let’s ship it.

That’s pretty much where most instructional texts and most practitioners (including myself,
historically) leave things. Not so fast.

What is generalization error?

When we evaluate a model’s prediction accuracy with CV or simple data splitting, we are
estimating the generalization error of the model, also known as out-of-sample error, expected
prediction error, test error, prediction risk, and population risk.4 Generalization error is the
model’s average prediction error on new data points from the same process that generated our
original dataset, where the error is measured with a loss function. Think of it as asking how well
does the model generalize to previously unseen data points?

Generalization error is an unknown quantity in real-world problems, so it’s useful to ask how
well we can estimate it with methods like cross-validation and data splitting.

Our final model for the land value simulation. To make business decisions about the model, we
need to know its generalization error, i.e. its average prediction error on new data points. The
cross-validation error helped with model selection but how useful is it for this purpose?

Why do we care about generalization error?

Generalization error is the best description of a predictive model’s accuracy in deployment. As
Hastie and Tibshirani (and Friedman) point out in their 2009 textbook Elements of Statistical
Learning (ESL), we use estimates of generalization error in two different ways:

20
1. Model selection: choose model architecture, hyper parameters, features, and early
stopping to maximize predictive performance. This is the easier of the two tasks because
we only need to know that one model is better than another, but not exactly how accurate
each model is.
2. Model assessment: estimate the generalization error of a model, as accurately as
possible.

1. It is a great reminder to re-read ESL chapter 7 on model assessment and some of the
more recent papers cited by Bates, et al. There are many surprising things about model
assessment that are easy to forget in the hustle of industry practice.
2. Neither industry practitioners nor academic sources seem to worry much about the
rationale for model assessment, but we should. Particularly if bad predictions can be
catastrophic—as in medicine, finance, insurance, or flight control systems, for example—
we need to understand the distribution of model errors. In these cases, the question may
not be which model is best, but is any model acceptably accurate at all?
3. If the assumptions of Bates, et al. do apply to your business problem, then consider
trying their nested CV method for the generalization error confidence interval.
Admittedly, most data science problems in industry today have plenty of data, so simple
train-validation-test set splits should suffice.
4. As this research topic gathers momentum, more results will be found. Be on the lookout
and be open to updating your model evaluation procedures.

A very high-level view of the model building and evaluation pipeline

21
There are several metrics that can be deduced from the confusion matrix, such as —
Accuracy = (TP + TN) /(TP + TN + FP + FN)
Precision = (TP) / (TP + FP)
Recall = (TP) / (TP + FN)
F1 Score = (2 x Precision x Recall) / (Precision + Recall)— where TP is True Positive, FN is False
Negative and likewise for the rest.

Precision is basically all the things that you said were relevant whereas Recall is all the things that

are actually relevant. In other words, recall is also referred to as the sensitivity of your model,

whereas precision is referred to as Positive Predicted Value Now that you have grasped the

concept, let's understand how to do it with ease using the Sci-kit Learn API and a few lines of

Cross Validation

Cross validation is a technique for assessing how the statistical analysis generalizes to an
independent data set. It is a technique for evaluating machine learning models by training several
models on subsets of the available input data and evaluating them on the complementary subset of
the data. Using cross-validation, there are high chances that we can detect over-fitting with ease.

There are several cross validation techniques such as:-1. K-Fold Cross Validation
2. Leave P-out Cross Validation
3. Leave one-out Cross Validation
4. Repeated Random Sub-sampling Method
5. Holdout Method

Remove weights or neurons?

There are different ways to prune a neural network. (1) You can prune weights. This is done by

setting individual parameters to zero and making the network sparse. This would lower the

number of parameters in the model while keeping the architecture the same. (2) You can remove

entire nodes from the network. This would make the network architecture itself smaller, while

aiming to keep the accuracy of the initial larger network.

22
Visualization of pruning weights/synapses vs. nodes/neurons (Source)

Weight-based pruning is more popular as it is easier to do without hurting the performance of the

network. However, it requires sparse computations to be effective. This requires hardware support

and a certain amount of sparsely to be efficient. Pruning nodes will allow dense computation

which is more optimized. This allows the network to be run normally without sparse computation.

This dense computation is more often better supported on hardware. However, removing entire

neurons can more easily hurt the accuracy of the neural network.

When to prune?

A major consideration in pruning is where to put it in the training/testing machine learning

timeline. If you are using a weight magnitude-based pruning approach, as described in the

previous section, you would want to prune after training. However, after pruning, you may

observe that the model performance has suffered. This can be fixed by fine-tuning, meaning

retraining the model after pruning to restore accuracy.

23
How to evaluate pruning?
Evaluating the effectiveness of pruning involves comparing the performance of the pruned model
to the original model. Here are some common methods for evaluating pruning:

Test accuracy: The most straightforward way to evaluate pruning is to compare the test accuracy
of the pruned model to the original model. If the pruned model has similar or better accuracy
than the original model, it can be considered a successful pruning.

FLOPs reduction: Floating Point Operations per Second (FLOPs) is a measure of the
computational complexity of a model. Evaluating pruning based on FLOPs reduction is useful
for reducing the computational resources required to run the model. The effectiveness of pruning
can be evaluated by comparing the FLOPs of the pruned model to the original model.

Sparsity: Pruning can also be evaluated based on the sparsity of the pruned model. Sparsity is the
percentage of weights or connections that are set to zero after pruning. Higher sparsity indicates
more aggressive pruning. The effectiveness of pruning can be evaluated by comparing the
sparsity of the pruned model to the original model.

Compression ratio: The compression ratio is the ratio of the size of the pruned model to the size
of the original model. Evaluating pruning based on compression ratio is useful for reducing the
storage requirements of the model. The effectiveness of pruning can be evaluated by comparing
the compression ratio of the pruned model to the original model.

Transfer learning: Evaluating pruning using transfer learning involves using the pruned model as
a starting point for training a new model on a related task. If the pruned model generalizes well
to the new task, it can be considered an effective pruning.

Overall, evaluating pruning involves balancing the trade-off between model size and
performance. Pruning can be considered effective if it reduces the size of the model while
maintaining or improving its performance on a given task.

What is Neural Network Pruning?

Neural network pruning is a technique used in machine learning to reduce the size of a neural
network model by removing unnecessary weights, connections, or entire neurons. The goal of
pruning is to simplify the model and improve its performance by reducing over fitting and
increasing generalization.

There are several types of pruning techniques that can be used in neural networks:

Weight pruning: Weight pruning involves removing small-weight connections in the network. In
this technique, the connections with the smallest absolute weights are removed. This is done

24
iteratively, where after each pruning iteration; the network is retrained to fine-tune the remaining
weights.

Neuron pruning: Neuron pruning involves removing entire neurons from the network. In this
technique, neurons with the smallest impact on the network's output are identified and removed.
This is done iteratively, where after each pruning iteration; the network is retrained to fine-tune
the remaining neurons.

Structured pruning: Structured pruning involves removing entire layers or sub-networks from the
network. In this technique, layers with the smallest impact on the network's output are identified
and removed. This is done iteratively, where after each pruning iteration; the network is retrained
to fine-tune the remaining layers.

Neural network pruning is typically performed after training the original neural network, as a
post-processing step. The effectiveness of pruning is evaluated by comparing the performance of
the pruned model to the original model, using metrics such as accuracy, speed, or memory usage.

Pruning is a useful technique for reducing the size of large neural networks, which can be
computationally expensive to train and deploy. By removing unnecessary weights, connections,
or neurons, pruning can simplify the network and improve its performance. However, pruning
needs to be carefully optimized to achieve a good balance between model size and performance.

Network Pruning

Steps to be followed while pruning:

• Determine the significance of each neuron.

• Prioritize the neurons based on their value (assuming there is a clearly defined measure
for “importance”).
• Remove the neuron that is the least significant.
• Determine whether to prune further based on a termination condition (to be defined by
the user).

25
• If unanticipated adjustments in data distribution may occur during deployment, don’t
prune.
• If you only have a partial understanding of the distribution shifts throughout training and
pruning, prune moderately.
• If you can account for all movements in the data distribution throughout training and
pruning, prune to the maximum extent possible.
• When retraining, specifically consider data augmentation to maximize the prune
potential.

Types of Pruning

Pruning can take many different forms, with the approach chosen based on our desired output. In
some circumstances, speed takes precedence over memory, whereas in others, memory is
sacrificed. The way sparsity structure, scoring, scheduling, and fine-tuning are handled by
different pruning approaches.
Structured and Unstructured Pruning
Individual parameters are pruned using an unstructured pruning approach. This results in a sparse
neural network, which, while lower in terms of parameter count, may not be configured in a way
that promotes speed improvements.
Randomly zeroing out the parameters saves memory but may not necessarily improve computing
performance because we end up conducting the same number of matrix multiplications as
before. Because we set specific weights in the weight matrix to zero, this is also known as
Weight Pruning.

Structured and Unstructured Pruning

To make use of technology and software that is specialized for dense processing, structured
pruning algorithms consider parameters in groups, deleting entire neurons, filters, or channels.
We set entire columns in the weight matrix to zero, thus removing the matching output neuron.
This is also known as Unit/Neuron Pruning. In a feed forward layer, for example, part of the
Convolution NN channels or neurons is deleted, resulting in a direct reduction in computation

26
Advantages

• Reduces the inference and training time, depends on compression method and of course
hardware
• As the neurons, connections between layers and weights are reduced, there is a reduction
in storage requirement
• Reduces the heat dissipation in deployed hardware say mobile phones
• Power Saving

Disadvantages

• Fewer pre-trained models and versions are available

• Difficulty in selection compression method as we have to know the architecture of
targeted hardware
• Not much quantify beyond original accuracy

Cross-validation
Cross-validation is a technique used in machine learning and statistical modeling to evaluate the
performance of a predictive model. The goal of cross-validation is to assess how well a model
will generalize to new data that it has not been trained on.

The basic idea of cross-validation is to divide the available data into two parts: a training set and
a validation set. The model is trained on the training set, and then its performance is evaluated on
the validation set. This process is repeated several times, with different subsets of the data used
for training and validation, and the results are averaged to get an overall estimate of the model's
performance.

The most commonly used form of cross-validation is k-fold cross-validation, which involves
dividing the data into k equally sized subsets (or "folds"). The model is trained on k-1 folds, and
then tested on the remaining fold. This process is repeated k times, with each fold used as the
validation set once. The performance of the model is then averaged across the k iterations to get
an overall estimate of its performance.

Another common form of cross-validation is leave-one-out cross-validation, which involves

using all but one observation for training and the remaining observation for validation. This

27
process is repeated for each observation in the data set, and the results are averaged to get an
estimate of the model's performance.

Cross-validation is a useful tool for assessing the performance of a model, as it provides an

estimate of how well the model will generalize to new data. It is particularly useful for avoiding
over fitting, which occurs when a model fits the training data too closely and does not generalize
well to new data. By evaluating the model on a validation set that is separate from the training
data, cross-validation can provide an estimate of how well the model will perform on new data.
Types of Back propagation Network

There are two kinds of back propagation networks. It is categorized as below:

Static Back propagation

Static back propagation is one type of network that aims in producing a mapping of a static input
for static output. These kinds of networks are capable of solving static classification problems
like optical character recognition (OCR).

Recurrent Back propagation

The recurrent back propagation is another type of network employed in fixed-point learning. The
activations in recurrent back propagation are fed forward till it attains a fixed value. Following
this, an error is calculated and propagated backward. Software, NeuroSolutions has the ability to
perform the recurrent back propagation.
The key differences: The static back propagation offers immediate mapping, while mapping
recurrent back propagation is not immediate.

Disadvantages of Back propagation

Disadvantages of back propagation are:

• Back propagation possibly be sensitive to noisy data and irregularity

• The performance of this is highly reliant on the input data
• Needs excessive time for training
• The need for a matrix-based method for back propagation instead of mini-batch

Applications of Back propagation

The applications are

• The neural network is trained to enunciate each letter of a word and a sentence

28
• It is used in the field of speech recognition
• It is used in the field of character and face recognition

Virtues and Limitations of Back Propagation Learning

Back propagation is a widely used algorithm in the field of artificial neural networks and deep
learning. It is a supervised learning technique that is used to train feed forward neural networks.
Back propagation is often used in conjunction with gradient descent optimization to update the
weights and biases of the neural network during training.

Here are some of the virtues and limitations of back propagation learning:

Virtues:

Flexibility: Back propagation can be used to train a wide range of neural network architectures,
making it a flexible algorithm that can be applied to many different types of problems.

Scalability: Back propagation can be applied to large datasets, making it an effective technique
for processing large amounts of data.

Generalization: Back propagation can be used to train neural networks to generalize well to
unseen data, making it a useful tool for tasks such as classification, regression, and image
recognition.

Efficiency: Back propagation can converge to a solution quickly, making it an efficient

algorithm for training neural networks.

Limitations:

Local Minima: Back propagation can get trapped in local minima and fail to find the global
minimum of the cost function.

Over fitting: Back propagation can overfit the training data, leading to poor performance on
unseen data.

Initialization: Back propagation can be sensitive to the initialization of the weights and biases of
the neural network, which can affect the convergence rate and the final solution.

Gradient Vanishing and Exploding: Back propagation can suffer from the gradient vanishing
and exploding problem, where the gradients become too small or too large, leading to slow
convergence or instability.

Accelerated Convergence
Accelerated convergence is a term used in mathematics and computer science to describe a
method that speeds up the convergence of an iterative algorithm. Convergence is the process by

29
which an iterative algorithm approaches a solution to a problem, and the rate of convergence
determines how quickly the algorithm converges to the solution.

Accelerated convergence methods are designed to improve the rate of convergence by modifying
the iterative algorithm in some way. There are many different techniques for accelerating
convergence, including:

Aitkin’s delta-squared method: This method involves taking successive differences between
terms in the sequence generated by an iterative algorithm and then applying a correction factor to
each term. The result is a sequence that converges much more quickly than the original
sequence.

Stephenson’s method: This method involves applying the Aitkin’s delta-squared method to the
function being iterated, rather than the sequence of approximations generated by the algorithm.
This can improve the rate of convergence even further.

Newton's method with line search: This method involves using a line search algorithm to
determine the step size in each iteration of Newton's method. This can significantly speed up the
convergence of the algorithm.

Conjugate gradient method: This method is used for solving systems of linear equations, and it
involves choosing a sequence of conjugate directions to iteratively solve the system. The
conjugate gradient method can converge much more quickly than other methods for solving
linear systems.

Accelerated convergence methods are widely used in numerical analysis, scientific computing,
and optimization, where the speed of convergence can have a significant impact on the efficiency
of algorithms.

Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
ANN-unit 4
No ratings yet
ANN-unit 4
25 pages
NNunit 2
No ratings yet
NNunit 2
25 pages
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
20 pages
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
No ratings yet
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
31 pages
Artificial Intelligence Artificial Neural Networks - : Introduction
No ratings yet
Artificial Intelligence Artificial Neural Networks - : Introduction
43 pages
Multiple-Layer Networks Backpropagation Algorithms
No ratings yet
Multiple-Layer Networks Backpropagation Algorithms
46 pages
ANN Unit-3 Associative Learning
No ratings yet
ANN Unit-3 Associative Learning
13 pages
Experiment No. 4 TE SL-II (ANN)
100% (1)
Experiment No. 4 TE SL-II (ANN)
2 pages
The Backpropagation Algorithm
No ratings yet
The Backpropagation Algorithm
4 pages
Deep Learning Unit-III
No ratings yet
Deep Learning Unit-III
9 pages
Deep Learning CNN Training Guide
No ratings yet
Deep Learning CNN Training Guide
20 pages
Unit 2
No ratings yet
Unit 2
64 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Artificial Neural Networks Video Tutorial: Machine Learning 17CS73
No ratings yet
Artificial Neural Networks Video Tutorial: Machine Learning 17CS73
23 pages
Unit IV V Deep Learning Material
No ratings yet
Unit IV V Deep Learning Material
32 pages
Lecture 1: Introduction To Reinforcement Learning: David Silver
No ratings yet
Lecture 1: Introduction To Reinforcement Learning: David Silver
46 pages
Btech CSE
100% (1)
Btech CSE
17 pages
DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
NN Unit - 1
No ratings yet
NN Unit - 1
27 pages
Backpropagation Examples PDF
No ratings yet
Backpropagation Examples PDF
9 pages
Deep Learning with RBMs and DBNs
No ratings yet
Deep Learning with RBMs and DBNs
79 pages
ML Unit-1
No ratings yet
ML Unit-1
43 pages
Testbank PyTorch Recipes ProblemSolution Approach To Build Train and Deploy Neural Network Models 2nd Edition Pradeepta Mishra Fast Access
No ratings yet
Testbank PyTorch Recipes ProblemSolution Approach To Build Train and Deploy Neural Network Models 2nd Edition Pradeepta Mishra Fast Access
327 pages
Module I
No ratings yet
Module I
109 pages
Graph Neural Network The Next Frontier in Deep Learning
No ratings yet
Graph Neural Network The Next Frontier in Deep Learning
1 page
ANN-Unit 6 - Deep Neural Networks
No ratings yet
ANN-Unit 6 - Deep Neural Networks
29 pages
Math Essentials for ML Enthusiasts
No ratings yet
Math Essentials for ML Enthusiasts
25 pages
Machine Learning Deep Learning
No ratings yet
Machine Learning Deep Learning
2 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
18 pages
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
No ratings yet
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
3 pages
Introduction To Neural Networks Using Matlab 6 0 S N Sivanandam Sumathi Deepa
0% (1)
Introduction To Neural Networks Using Matlab 6 0 S N Sivanandam Sumathi Deepa
4 pages
Unit 4
No ratings yet
Unit 4
38 pages
ML-5TH Unit
No ratings yet
ML-5TH Unit
28 pages
UNIT-I - Introduction To Computer Vision
No ratings yet
UNIT-I - Introduction To Computer Vision
45 pages
Lecture Notes 5
No ratings yet
Lecture Notes 5
3 pages
Math4ml PDF
No ratings yet
Math4ml PDF
21 pages
Backpropagation Learning in Neural Networks
No ratings yet
Backpropagation Learning in Neural Networks
27 pages
DL Unit1 Final
No ratings yet
DL Unit1 Final
41 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
25 pages
Autoencoder Report 1
No ratings yet
Autoencoder Report 1
34 pages
AI-Lecture 12 - Simple Perceptron
100% (1)
AI-Lecture 12 - Simple Perceptron
24 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
8 pages
Back Propagation
100% (1)
Back Propagation
27 pages
Unit 1.2 Desigining A Learning System
No ratings yet
Unit 1.2 Desigining A Learning System
15 pages
ML Unit-1
No ratings yet
ML Unit-1
45 pages
Stats 1 Formulae
No ratings yet
Stats 1 Formulae
26 pages
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Ann Chapter 2
No ratings yet
Ann Chapter 2
240 pages
Assignment On RNN
No ratings yet
Assignment On RNN
1 page
PyTorch Fundamentals: Tensors Guide
No ratings yet
PyTorch Fundamentals: Tensors Guide
45 pages
Data Science
No ratings yet
Data Science
74 pages
Autoencoders
No ratings yet
Autoencoders
66 pages
Pytorch Tutorial: - Ntu Machine Learning Course
No ratings yet
Pytorch Tutorial: - Ntu Machine Learning Course
64 pages
Unit 5
No ratings yet
Unit 5
36 pages
DL Full Merged
No ratings yet
DL Full Merged
454 pages
Backpropagation Process in Deep Neural Network
No ratings yet
Backpropagation Process in Deep Neural Network
6 pages
Lecture-17 Machine Learning With Python
No ratings yet
Lecture-17 Machine Learning With Python
37 pages
Backpropagation
No ratings yet
Backpropagation
4 pages
Classification and Diagnosis Using Back Propagation Artificial Neural Networks ANN
No ratings yet
Classification and Diagnosis Using Back Propagation Artificial Neural Networks ANN
5 pages
Breadth First Search Algorithm Guide
No ratings yet
Breadth First Search Algorithm Guide
17 pages
Solutions-Markov Decision Processes
No ratings yet
Solutions-Markov Decision Processes
8 pages
604E - Analysis Design & Algorithm
No ratings yet
604E - Analysis Design & Algorithm
2 pages
Discrete Mathematics IMP Questions
100% (1)
Discrete Mathematics IMP Questions
5 pages
790 1549 1 PB 1
No ratings yet
790 1549 1 PB 1
9 pages
CS 188 Introduction To AI Midterm Study Guide
No ratings yet
CS 188 Introduction To AI Midterm Study Guide
2 pages
Shortest Path - Mathematics For The Liberal Arts Corequisite
No ratings yet
Shortest Path - Mathematics For The Liberal Arts Corequisite
13 pages
Quicksort On Singly Linked List 14. Iterative Quick Sort 15. Merge Sort For Linked List
No ratings yet
Quicksort On Singly Linked List 14. Iterative Quick Sort 15. Merge Sort For Linked List
21 pages
Simulated Annealing for Experts
No ratings yet
Simulated Annealing for Experts
10 pages
Lab Manual Ann
No ratings yet
Lab Manual Ann
12 pages
Simplex Method
No ratings yet
Simplex Method
35 pages
Knapsack Problem: Truck - 10t Capacity
No ratings yet
Knapsack Problem: Truck - 10t Capacity
14 pages
Kernel Machines
No ratings yet
Kernel Machines
33 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
12 pages
Nla Primer-Toc PDF
No ratings yet
Nla Primer-Toc PDF
5 pages
DAAImp Questions
No ratings yet
DAAImp Questions
6 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
CS3491 - AIML Lab Record
No ratings yet
CS3491 - AIML Lab Record
79 pages
Bhupendra Jogi
No ratings yet
Bhupendra Jogi
25 pages
Optimization Techniques Lab
No ratings yet
Optimization Techniques Lab
9 pages
Cs301 Solved Subjective Final Term by Junaid
No ratings yet
Cs301 Solved Subjective Final Term by Junaid
39 pages
LU and CHOLESKY DECOMPOSITION
No ratings yet
LU and CHOLESKY DECOMPOSITION
26 pages
CE31501 Soft-Computing Tools in Engineering ES 2013
No ratings yet
CE31501 Soft-Computing Tools in Engineering ES 2013
1 page
Introduction To Optimum Design 4th Edition Arora Solutions Manualpdf Download
100% (13)
Introduction To Optimum Design 4th Edition Arora Solutions Manualpdf Download
54 pages
Applying Filtering Techniques To Image
No ratings yet
Applying Filtering Techniques To Image
5 pages
Report
No ratings yet
Report
8 pages
Adp Huffman Coding
No ratings yet
Adp Huffman Coding
15 pages
Turbo Codes: Principles and Applications
No ratings yet
Turbo Codes: Principles and Applications
24 pages
Unit 1 Theory of Equations
No ratings yet
Unit 1 Theory of Equations
25 pages
2d Sampling
No ratings yet
2d Sampling
5 pages