ANN-unit 3
ANN-unit 3
Back Propagation
Back propagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.
Back propagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the gradient
of a loss function with respect to all the weights in the network.
Consider the following Back propagation neural network example diagram to understand:
1
Keep repeating the process until the desired output is achieved
• Static Back-propagation
• Recurrent Back propagation
Static back-propagation:
It is one kind of back propagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is no static in recurrent back propagation.
• In 1961, the basics concept of continuous back propagation was derived in the context of
control theory by J. Kelly, Henry Arthur, and E. Bryson.
• In 1969, Bryson and Ho gave a multi-stage dynamic system optimization method.
• In 1974, Webs stated the possibility of applying this principle in an artificial neural
network.
• In 1982, Hopfield brought his idea of a neural network.
• In 1986, by the effort of David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams,
back propagation gained recognition.
• In 1993, Wan was the first person to win an international pattern recognition contest with
the help of the back propagation method.
2
Back propagation Key Points
• Simplifies the network structure by elements weighted links that have the least effect on
the trained network
• You need to study a group of input and activation values to develop the relationship
between the input and hidden unit layers.
• It helps to assess the impact that a given input variable has on a network output. The
knowledge gained from this analysis should be represented in rules.
• Back propagation is especially useful for deep neural networks working on error-prone
projects, such as image or speech recognition.
• Back propagation takes advantage of the chain and power rules allows back propagation
to function with any number of outputs.
• Discomfort (bias)
Summary
• A neural network is a group of connected it I/O units where each connection has a weight
associated with its computer programs.
3
• Back propagation is a short form for “backward propagation of errors.” It is a standard
method of training artificial neural networks
• Back propagation algorithm in machine learning is fast, simple and easy to program
• A feed forward BPN network is an artificial neural network.
• Two Types of Back propagation Networks are 1)Static Back-propagation 2) Recurrent
Back propagation
• In 1961, the basics concept of continuous back propagation was derived in the context of
control theory by J. Kelly, Henry Arthur, and E. Bryson.
• Back propagation in data mining simplifies the network structure by removing weighted
links that have a minimal effect on the trained network.
• It is especially useful for deep neural networks working on error-prone projects, such as
image or speech recognition.
• The biggest drawback of the Back propagation is that it can be sensitive for noisy data.
Back propagation is an algorithm that back propagates the errors from the output nodes to the
input nodes. Therefore, it is simply referred to as the backward propagation of errors. It uses in
the vast applications of neural networks in data mining like Character recognition, Signature
verification, etc.
Hessian Matrix,
H_ij = ∂^2f/∂x_i∂x_j
Where H_ij is the entry in the i-th row and j-th column of the matrix, and x_i and x_j are the
variables of the function. The Hessian matrix is a symmetric matrix, meaning that H_ij = H_ji for
all i and j.
The Hessian matrix provides information about the curvature of the function at a given point. If
all the eigenvalues of the Hessian matrix are positive, then the function has a local minimum at
that point. If all the eigenvalues are negative, then the function has a local maximum. If the
eigenvalues have both positive and negative values, then the point is a saddle point.
The Hessian matrix is used in optimization algorithms such as Newton's method, which uses the
second-order derivative information to iteratively find the minimum of a function. In machine
learning, the Hessian matrix is used in methods such as the Hessian-free optimization, which is a
variant of Newton's method that avoids the expensive computation of the full Hessian matrix.
4
Generalization, Cross Validation
Generalization refers to the ability of a machine learning model to perform well on new, unseen
data that was not used during the training process. The ultimate goal of any machine learning
model is to generalize well, as the model's ability to make accurate predictions on new data is
what makes it useful in practice.
One common approach to cross-validation is k-fold cross-validation, where the data is divided
into k equal-sized subsets, or folds. The model is trained on k-1 folds and evaluated on the
remaining fold. This process is repeated k times, with each fold being used once for validation.
The results of each fold can then be averaged to get an estimate of the model's generalization
performance.
Cross-validation can be used to tune hyper parameters of a machine learning model, such as the
learning rate or regularization strength, by evaluating the model's performance on the validation
set for different values of the hyper parameters.
Network pruning is a technique used in deep learning to reduce the size of a neural network by
removing unnecessary or redundant parameters, while maintaining or even improving the
accuracy of the model. Here are some common network pruning techniques:
Weight pruning: This technique involves identifying and removing the connections in the
network that have small or zero weights. This can be done by setting a threshold value below
which the weights are pruned.
Neuron pruning: This technique involves removing entire neurons from the network, typically
those with small or zero activation. This can be done by identifying the least important neurons
using sensitivity analysis, and then removing them from the network.
Channel pruning: This technique involves removing entire channels in a convolution neural
network that are deemed to be unnecessary or redundant. This can be done by measuring the
5
importance of each channel, for example by looking at the magnitude of the weights associated
with each channel.
Filter pruning: This technique involves removing entire filters in a convolution neural network
that are deemed to be unnecessary or redundant. This can be done by measuring the importance
of each filter, for example by looking at the average activation of the feature map produced by
each filter.
Structured pruning: This technique involves removing entire substructures from the network,
such as layers, blocks or modules. This can be done by applying the above pruning techniques to
the substructures, or by using more advanced methods such as spectral clustering.
Lottery ticket hypothesis: This is a recent technique in which a neural network is trained from
scratch with random weights, and then a subset of weights is identified that is important for the
network to achieve high accuracy. These important weights are then used to initialize a smaller
network, which is then fine-tuned to achieve similar or better accuracy than the original network.
Overall, network pruning techniques can be used to reduce the size of neural networks, making
them more efficient to deploy on resource-constrained devices, while still maintaining high
accuracy.
Neural Network:
Neural networks are an information processing paradigm inspired by the human nervous system.
Just like in the human nervous system, we have biological neurons in the same way in neural
networks we have artificial neurons, artificial neurons are mathematical functions derived from
biological neurons. The human brain is estimated to have about 10 billion neurons, each
connected to an average of 10,000 other neurons. Each neuron receives a signal through a
synapse, which controls the effect of the sign concerning on the neuron.
6
Back propagation:
Back propagation is a widely used algorithm for training feed forward neural networks. It
computes the gradient of the loss function with respect to the network weights. It is very
efficient, rather than naively directly computing the gradient concerning each weight. This
efficiency makes it possible to use gradient methods to train multi-layer networks and update
weights to minimize loss; variants such as gradient descent or stochastic gradient descent are
often used.
The back propagation algorithm works by computing the gradient of the loss function with
respect to each weight via the chain rule, computing the gradient layer by layer, and iterating
backward from the last layer to avoid redundant computation of intermediate terms in the chain
rule.
1. It is the gradient descent method as used in the case of simple perception network with the
differentiable unit.
2. It is different from other networks in respect to the process by which the weights are
calculated during the learning period of the network.
3. training is done in the three stages :
• the feed-forward of input training pattern
• the calculation and back propagation of the error
• updating of the weight
Working of Back propagation:
Neural networks use supervised learning to generate output vectors from input vectors that the
network operates on. It Compares generated output to the desired output and generates an error
report if the result does not match the generated output vector. Then it adjusts the weights
according to the bug report to get your desired output.
Step 3: Calculate the output of each neuron from the input layer to the hidden layer to the output
layer.
Step 4: Calculate the error in the outputs
Back propagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the
error.
7
Step 6: Repeat the process until the desired output is achieved.
Parameters:
• x = inputs training vector x=(x1, x2, xn).
• t = target vector t= (t1, t2……………tn).
• Eke = error at output unit.
• Δj = error at hidden layer.
• α = learning rate.
• V0j = bias of hidden unit j.
Training Algorithm:
Step 1: Initialize weight to small random values.
Step 2: While the steps stopping condition is to be false do step 3 to 10.
Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and transmits the signal xi signal to all the units.
Step 5: Each hidden unit Zj (z=1 to a) sums its weighted input signal to calculate its net input
Zinj = v0j + Σxivij ( i=1 to n)
Applying activation function zj = f(zinj) and sends this signals to all units in the layer about
i.e output units
For each output l=unit yk = (k=1 to m) sums its weighted input signals.
yink = w0k + Σ ziwjk (j=1 to a)
And applies its activation function to calculate the output signals.
yk = f(yink)
Step 6: Each output unit yk (k=1 to n) receives a target pattern corresponding to an input pattern
then error is calculated as:
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all units in the layer above
δinj = Σ δj wjk
The error information term is calculated as :
8
δj = δinj + zinj
Step 8: Each output unit yk (k=1 to m) updates its bias and weight (j=1 to a). The weight
correction term is given by :
Δ wjk = α δk zj
And the bias correction term is given by Δwk = α δk.
Therefore wjk(new) = wjk(old) + Δ wjk
w0k(new) = wok(old) + Δ wok
For each hidden unit zj (j=1 to a) update its bias and weights (i=0 to n) the weight
connection term
Δ vij = α δj xi
And the bias connection on term
Δ v0j = α δj
Therefore vij(new) = vij(old) + Δvij
v0j (new) = v0j (old) + Δv0j
Step 9: Test the stopping condition. The stopping condition can be the minimization of error,
number of epochs.
Back propagation is “back propagation of errors” and is very useful for training neural networks.
It’s fast, easy to implement, and simple. Back propagation does not require any parameters to be
set, except the number of inputs. Back propagation is a flexible method because no prior
knowledge of the network is required.
• Recurrent back propagation: Recursive back propagation is another network used for
fixed-point learning. Activation in recurrent back propagation is feed-forward until a fixed
value is reached. Static back propagation provides an instant mapping, while recurrent back
propagation does not provide an instant mapping.
Advantages:
9
• It is Flexible and efficient.
• No need for users to learn any special functions.
Disadvantages:
• It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
• Performance is highly dependent on input data.
• Spending too much time training.
• The matrix-based approach is preferred over a mini-batch.
The algorithm is used to effectively train a neural network through a method called chain
rule. In simple terms, after each forward pass through a network, back propagation performs a
backward pass while adjusting the model’s parameters (weights and biases).
The 4-layer neural network consists of 4 neurons for the input layer, 4 neurons for the hidden
Input layer
The neurons, colored in purple, represent the input data. These can be as simple as scalars or
more complex like vectors or multidimensional matrices.
10
Equation for input x_i
The first set of activations (a) are equal to the input values. NB: “activation” is the neuron’s value
Hidden layers
The final values at the hidden neurons, colored in green, are computed using z^l — weighted
inputs in layer l, and a^l— activations in layer l. For layer 2 and 3 the equations are:
• l=2
• l=3
W² and W³ are the weights in layer 2 and 3 while b² and b³ are the biases in those layers.
Activations a² and a³ are computed using an activation function f. typically, this function f is
11
non-linear (e.g. sigmoid, ReLU, tanh) and allows the network to learn complex patterns in data.
Looking carefully, you can see that all of x, z², a², z³, a³, W¹, W², b¹ and b² are missing their
subscripts presented in the 4-layer network illustration above. The reason is that we have
combined all parameter values in matrices, grouped by layers. This is the standard way of
working with neural networks and one should be comfortable with the calculations. However, I
• W¹ is a weight matrix of shape (n, m) where n is the number of output neurons (neurons in
the next layer) and m is the number of input neurons (neurons in the previous layer). For us, n
= 2 and m = 4.
Equation for W¹
NB: The first number in any weight’s subscript matches the index of the neuron in the next
layer (in our case this is the Hidden_2 layer) and the second number matches the index of the
neuron in previous layer (in our case this is the Input layer).
• x is the input vector of shape (m, 1) where m is the number of input neurons. For us, m = 4.
12
Equation for x
b¹ is a bias vector of shape (n , 1) where n is the number of neurons in the current layer. For us, n
= 2.
Equation for b¹
Following the equation for z², we can use the above definitions of W¹, x and b¹ to derive “Equation for z²”:
Equation for z² Now carefully observe the neural network illustration from above.
Input and Hidden_1 layers You will see that z² can be expressed using (z_1)² and (z_2)² where
(z_1)² and (z_2)² are the sums of the multiplication between every input x_i with the
corresponding weight (W_ij)¹. This leads to the same “Equation for z²” and proofs that the matrix
representations for z², a², z³ and a³ are correct.
13
Output layer
The final part of a neural network is the output layer which produces the predicated value. In our
simple example, it is presented as a single neuron, colored in blue and evaluated as follows:
Equation for output s Again, we are using the matrix representation to simplify the equation. One
can use the above techniques to understand the underlying logic.
The equations above form network’s forward propagation. Here is a short overview:
Overview of forward propagation equations colored by layer the final step in a forward pass is to
evaluate the predicted output s against an expected output y. The output y is part of the training
dataset (x, y) where x is the input (as we saw in the previous section). Evaluation
between s and y happens through a cost function. This can be as simple as MSE (mean squared
error) or more complex like cross-entropy.
Gradient of a function C (x_1, x_2, exam) in point x is a vector of the partial derivatives of C in x.
• The derivative of a function C measures the sensitivity to change of the function value
(output value) with respect to a change in its argument x (input value). In other words, the
derivative tells us the direction C is going.
14
• The gradient shows how much the parameter x needs to change (in positive or negative
direction) to minimize C. Compute those gradients happen using a technique called chain rule.
The common part in both equations is often called “local gradient” and is expressed as follows:
15
Equation for local gradient the “local gradient” can easily be determined using the chain rule. I
won’t go over the process now but if you have any questions, please comment below. The
gradients allow us to optimize the model’s parameters:
Algorithm for optimizing weights and biases (also called “Gradient descent”)
Weight (w_22)² connects (a_2)² and (z_2)², so computing the gradient requires applying the chain
16
Equation for derivative of C in (w_22)²
Calculating the final value of derivative of C in (a_2)³ requires knowledge of the function C.
Since C is dependent on (a_2)³, calculating the derivative should be fairly straightforward. I hope
this example manages to throw some light on the mathematics behind computing gradients. To
Where f is the function, θ_i and θ_j are the ith and jth parameters of the model, and ∂²f / (∂θ_i
∂θ_j) is the second-order partial derivative of f with respect to θ_i and θ_j.
The Hessian matrix is a square matrix with the same number of rows and columns as the number
of parameters in the model. The element H_ij of the Hessian matrix represents the curvature of
the loss function with respect to the ith and jth parameters. Positive values of H_ij indicate that
increasing the values of both parameters simultaneously will increase the loss function, while
negative values indicate that increasing the values of both parameters will decrease the loss
function. Zero values indicate that the parameters are independent of each other.
The Hessian matrix provides important information about the curvature of the loss function,
particularly around critical points such as local minima, saddle points, and maxima. The Eigen
values and eigenvectors of the Hessian matrix can be used to analyze the behavior of the model
at these critical points and to optimize the model more efficiently and accurately using second-
order optimization algorithms such as Newton's method. The Hessian matrix is also useful for
regularization and compression techniques such as weight decay, Hessian-based early stopping,
and Hessian-based pruning.
The determinant of the Hessian is also called the discriminate of f. For a two variable function
f(x, y), it is given by:
17
Discriminate of f(x, y)
Examples of Hessian Matrices and Discriminates
Suppose we have the following function:
Then the Hessian H_g and the discriminate D_g are given by:
D_g(0, 0) = 0
D_g(1, 0) = 36 + 24 = 60
D_g(0, 1) = -36
D_g(-1, 0) = 12
1. The function f has a local minimum if f_xx(a, b) > 0 and the discriminated D(a,b) > 0
18
2. The function f has a local maximum if f_xx(a, b) < 0 and the discriminated D(a,b) > 0
3. The function f has a saddle point if D(a, b) < 0
4. We cannot draw any conclusions if D(a, b) = 0 and need more tests
Example: g(x, y)
Second-order optimization: The Hessian matrix provides information about the curvature of the
loss surface, which is useful for second-order optimization algorithms. Second-order
optimization algorithms, such as Newton's method, use the Hessian matrix to optimize the model
parameters more efficiently and accurately than first-order optimization algorithms.
Understanding model behavior: The Hessian matrix can help us understand the behavior of the
model, particularly around critical points such as local minima, saddle points, and maxima. The
eigenvalues and eigenvectors of the Hessian matrix provide information about the direction and
curvature of the loss surface at these critical points.
Regularization: The Hessian matrix can be used for regularization techniques such as weight
decay and Hessian-based early stopping. Weight decay is a regularization technique that
penalizes large weights by adding a term proportional to the L2 norm of the weights multiplied
by the Hessian matrix. Hessian-based early stopping involves stopping the training process when
the Hessian matrix becomes too large, which helps prevent over fitting.
Model compression: The Hessian matrix can also be used for model compression techniques
such as Hessian-based pruning. Hessian-based pruning involves removing the smallest
eigenvalues and corresponding eigenvectors of the Hessian matrix, which results in a
compressed model with fewer parameters.
19
5-fold cross-validation model fits for a simulated land value prediction task. The quadratic form
has lower cross-validation error, so we’ll re-fit and deploy that one.
For our running example, we set X=5 and use the squared error loss function. The quadratic
model has a lower CV error than the linear model, so we choose that model form to re-fit to the
full dataset then deploy. Nice and tidy, let’s ship it.
That’s pretty much where most instructional texts and most practitioners (including myself,
historically) leave things. Not so fast.
Generalization error is an unknown quantity in real-world problems, so it’s useful to ask how
well we can estimate it with methods like cross-validation and data splitting.
Our final model for the land value simulation. To make business decisions about the model, we
need to know its generalization error, i.e. its average prediction error on new data points. The
cross-validation error helped with model selection but how useful is it for this purpose?
20
1. Model selection: choose model architecture, hyper parameters, features, and early
stopping to maximize predictive performance. This is the easier of the two tasks because
we only need to know that one model is better than another, but not exactly how accurate
each model is.
2. Model assessment: estimate the generalization error of a model, as accurately as
possible.
1. It is a great reminder to re-read ESL chapter 7 on model assessment and some of the
more recent papers cited by Bates, et al. There are many surprising things about model
assessment that are easy to forget in the hustle of industry practice.
2. Neither industry practitioners nor academic sources seem to worry much about the
rationale for model assessment, but we should. Particularly if bad predictions can be
catastrophic—as in medicine, finance, insurance, or flight control systems, for example—
we need to understand the distribution of model errors. In these cases, the question may
not be which model is best, but is any model acceptably accurate at all?
3. If the assumptions of Bates, et al. do apply to your business problem, then consider
trying their nested CV method for the generalization error confidence interval.
Admittedly, most data science problems in industry today have plenty of data, so simple
train-validation-test set splits should suffice.
4. As this research topic gathers momentum, more results will be found. Be on the lookout
and be open to updating your model evaluation procedures.
21
There are several metrics that can be deduced from the confusion matrix, such as —
Accuracy = (TP + TN) /(TP + TN + FP + FN)
Precision = (TP) / (TP + FP)
Recall = (TP) / (TP + FN)
F1 Score = (2 x Precision x Recall) / (Precision + Recall)— where TP is True Positive, FN is False
Negative and likewise for the rest.
Precision is basically all the things that you said were relevant whereas Recall is all the things that
are actually relevant. In other words, recall is also referred to as the sensitivity of your model,
whereas precision is referred to as Positive Predicted Value Now that you have grasped the
concept, let's understand how to do it with ease using the Sci-kit Learn API and a few lines of
Cross Validation
Cross validation is a technique for assessing how the statistical analysis generalizes to an
independent data set. It is a technique for evaluating machine learning models by training several
models on subsets of the available input data and evaluating them on the complementary subset of
the data. Using cross-validation, there are high chances that we can detect over-fitting with ease.
There are several cross validation techniques such as:-1. K-Fold Cross Validation
2. Leave P-out Cross Validation
3. Leave one-out Cross Validation
4. Repeated Random Sub-sampling Method
5. Holdout Method
There are different ways to prune a neural network. (1) You can prune weights. This is done by
setting individual parameters to zero and making the network sparse. This would lower the
number of parameters in the model while keeping the architecture the same. (2) You can remove
entire nodes from the network. This would make the network architecture itself smaller, while
22
Visualization of pruning weights/synapses vs. nodes/neurons (Source)
Weight-based pruning is more popular as it is easier to do without hurting the performance of the
network. However, it requires sparse computations to be effective. This requires hardware support
and a certain amount of sparsely to be efficient. Pruning nodes will allow dense computation
which is more optimized. This allows the network to be run normally without sparse computation.
This dense computation is more often better supported on hardware. However, removing entire
neurons can more easily hurt the accuracy of the neural network.
When to prune?
timeline. If you are using a weight magnitude-based pruning approach, as described in the
previous section, you would want to prune after training. However, after pruning, you may
observe that the model performance has suffered. This can be fixed by fine-tuning, meaning
23
How to evaluate pruning?
Evaluating the effectiveness of pruning involves comparing the performance of the pruned model
to the original model. Here are some common methods for evaluating pruning:
Test accuracy: The most straightforward way to evaluate pruning is to compare the test accuracy
of the pruned model to the original model. If the pruned model has similar or better accuracy
than the original model, it can be considered a successful pruning.
FLOPs reduction: Floating Point Operations per Second (FLOPs) is a measure of the
computational complexity of a model. Evaluating pruning based on FLOPs reduction is useful
for reducing the computational resources required to run the model. The effectiveness of pruning
can be evaluated by comparing the FLOPs of the pruned model to the original model.
Sparsity: Pruning can also be evaluated based on the sparsity of the pruned model. Sparsity is the
percentage of weights or connections that are set to zero after pruning. Higher sparsity indicates
more aggressive pruning. The effectiveness of pruning can be evaluated by comparing the
sparsity of the pruned model to the original model.
Compression ratio: The compression ratio is the ratio of the size of the pruned model to the size
of the original model. Evaluating pruning based on compression ratio is useful for reducing the
storage requirements of the model. The effectiveness of pruning can be evaluated by comparing
the compression ratio of the pruned model to the original model.
Transfer learning: Evaluating pruning using transfer learning involves using the pruned model as
a starting point for training a new model on a related task. If the pruned model generalizes well
to the new task, it can be considered an effective pruning.
Overall, evaluating pruning involves balancing the trade-off between model size and
performance. Pruning can be considered effective if it reduces the size of the model while
maintaining or improving its performance on a given task.
There are several types of pruning techniques that can be used in neural networks:
Weight pruning: Weight pruning involves removing small-weight connections in the network. In
this technique, the connections with the smallest absolute weights are removed. This is done
24
iteratively, where after each pruning iteration; the network is retrained to fine-tune the remaining
weights.
Neuron pruning: Neuron pruning involves removing entire neurons from the network. In this
technique, neurons with the smallest impact on the network's output are identified and removed.
This is done iteratively, where after each pruning iteration; the network is retrained to fine-tune
the remaining neurons.
Structured pruning: Structured pruning involves removing entire layers or sub-networks from the
network. In this technique, layers with the smallest impact on the network's output are identified
and removed. This is done iteratively, where after each pruning iteration; the network is retrained
to fine-tune the remaining layers.
Neural network pruning is typically performed after training the original neural network, as a
post-processing step. The effectiveness of pruning is evaluated by comparing the performance of
the pruned model to the original model, using metrics such as accuracy, speed, or memory usage.
Pruning is a useful technique for reducing the size of large neural networks, which can be
computationally expensive to train and deploy. By removing unnecessary weights, connections,
or neurons, pruning can simplify the network and improve its performance. However, pruning
needs to be carefully optimized to achieve a good balance between model size and performance.
Network Pruning
25
• If unanticipated adjustments in data distribution may occur during deployment, don’t
prune.
• If you only have a partial understanding of the distribution shifts throughout training and
pruning, prune moderately.
• If you can account for all movements in the data distribution throughout training and
pruning, prune to the maximum extent possible.
• When retraining, specifically consider data augmentation to maximize the prune
potential.
Types of Pruning
Pruning can take many different forms, with the approach chosen based on our desired output. In
some circumstances, speed takes precedence over memory, whereas in others, memory is
sacrificed. The way sparsity structure, scoring, scheduling, and fine-tuning are handled by
different pruning approaches.
Structured and Unstructured Pruning
Individual parameters are pruned using an unstructured pruning approach. This results in a sparse
neural network, which, while lower in terms of parameter count, may not be configured in a way
that promotes speed improvements.
Randomly zeroing out the parameters saves memory but may not necessarily improve computing
performance because we end up conducting the same number of matrix multiplications as
before. Because we set specific weights in the weight matrix to zero, this is also known as
Weight Pruning.
To make use of technology and software that is specialized for dense processing, structured
pruning algorithms consider parameters in groups, deleting entire neurons, filters, or channels.
We set entire columns in the weight matrix to zero, thus removing the matching output neuron.
This is also known as Unit/Neuron Pruning. In a feed forward layer, for example, part of the
Convolution NN channels or neurons is deleted, resulting in a direct reduction in computation
26
Advantages
• Reduces the inference and training time, depends on compression method and of course
hardware
• As the neurons, connections between layers and weights are reduced, there is a reduction
in storage requirement
• Reduces the heat dissipation in deployed hardware say mobile phones
• Power Saving
Disadvantages
Cross-validation
Cross-validation is a technique used in machine learning and statistical modeling to evaluate the
performance of a predictive model. The goal of cross-validation is to assess how well a model
will generalize to new data that it has not been trained on.
The basic idea of cross-validation is to divide the available data into two parts: a training set and
a validation set. The model is trained on the training set, and then its performance is evaluated on
the validation set. This process is repeated several times, with different subsets of the data used
for training and validation, and the results are averaged to get an overall estimate of the model's
performance.
The most commonly used form of cross-validation is k-fold cross-validation, which involves
dividing the data into k equally sized subsets (or "folds"). The model is trained on k-1 folds, and
then tested on the remaining fold. This process is repeated k times, with each fold used as the
validation set once. The performance of the model is then averaged across the k iterations to get
an overall estimate of its performance.
27
process is repeated for each observation in the data set, and the results are averaged to get an
estimate of the model's performance.
Static back propagation is one type of network that aims in producing a mapping of a static input
for static output. These kinds of networks are capable of solving static classification problems
like optical character recognition (OCR).
The recurrent back propagation is another type of network employed in fixed-point learning. The
activations in recurrent back propagation are fed forward till it attains a fixed value. Following
this, an error is calculated and propagated backward. Software, NeuroSolutions has the ability to
perform the recurrent back propagation.
The key differences: The static back propagation offers immediate mapping, while mapping
recurrent back propagation is not immediate.
• The neural network is trained to enunciate each letter of a word and a sentence
28
• It is used in the field of speech recognition
• It is used in the field of character and face recognition
Here are some of the virtues and limitations of back propagation learning:
Virtues:
Flexibility: Back propagation can be used to train a wide range of neural network architectures,
making it a flexible algorithm that can be applied to many different types of problems.
Scalability: Back propagation can be applied to large datasets, making it an effective technique
for processing large amounts of data.
Generalization: Back propagation can be used to train neural networks to generalize well to
unseen data, making it a useful tool for tasks such as classification, regression, and image
recognition.
Limitations:
Local Minima: Back propagation can get trapped in local minima and fail to find the global
minimum of the cost function.
Over fitting: Back propagation can overfit the training data, leading to poor performance on
unseen data.
Initialization: Back propagation can be sensitive to the initialization of the weights and biases of
the neural network, which can affect the convergence rate and the final solution.
Gradient Vanishing and Exploding: Back propagation can suffer from the gradient vanishing
and exploding problem, where the gradients become too small or too large, leading to slow
convergence or instability.
Accelerated Convergence
Accelerated convergence is a term used in mathematics and computer science to describe a
method that speeds up the convergence of an iterative algorithm. Convergence is the process by
29
which an iterative algorithm approaches a solution to a problem, and the rate of convergence
determines how quickly the algorithm converges to the solution.
Accelerated convergence methods are designed to improve the rate of convergence by modifying
the iterative algorithm in some way. There are many different techniques for accelerating
convergence, including:
Aitkin’s delta-squared method: This method involves taking successive differences between
terms in the sequence generated by an iterative algorithm and then applying a correction factor to
each term. The result is a sequence that converges much more quickly than the original
sequence.
Stephenson’s method: This method involves applying the Aitkin’s delta-squared method to the
function being iterated, rather than the sequence of approximations generated by the algorithm.
This can improve the rate of convergence even further.
Newton's method with line search: This method involves using a line search algorithm to
determine the step size in each iteration of Newton's method. This can significantly speed up the
convergence of the algorithm.
Conjugate gradient method: This method is used for solving systems of linear equations, and it
involves choosing a sequence of conjugate directions to iteratively solve the system. The
conjugate gradient method can converge much more quickly than other methods for solving
linear systems.
Accelerated convergence methods are widely used in numerical analysis, scientific computing,
and optimization, where the speed of convergence can have a significant impact on the efficiency
of algorithms.
30