0% found this document useful (0 votes)
6 views43 pages

CH 1

Artificial Intelligence (AI) simulates human intelligence in machines using algorithms and data, with key components including machine learning, deep learning, and natural language processing. Machine learning enables systems to learn from data and improve over time, with various forms such as supervised, unsupervised, self-supervised, and reinforcement learning. The document also discusses the evaluation of machine learning models, the concepts of overfitting and underfitting, and techniques to mitigate these issues.

Uploaded by

tulasisahu2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views43 pages

CH 1

Artificial Intelligence (AI) simulates human intelligence in machines using algorithms and data, with key components including machine learning, deep learning, and natural language processing. Machine learning enables systems to learn from data and improve over time, with various forms such as supervised, unsupervised, self-supervised, and reinforcement learning. The document also discusses the evaluation of machine learning models, the concepts of overfitting and underfitting, and techniques to mitigate these issues.

Uploaded by

tulasisahu2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

.

Unit-1

Definition of Artificial Intelligence

Artificial intelligence (AI) refers to the simulation or approximation of human


intelligence in machines. Artificial intelligence or AI recreates human
intelligence and behaviour using algorithms, data, and models. AI predicts,
automates, and completes tasks typically done by humans with greater
accuracy and precision, reduced bias, cost, and timesaving.
Key Components
of AI
1. Machine learning: Machine learning is an AI application that automatically
learns and improves from previous sets of experiences without the requirement
for explicit programming.
2. Deep learning: Deep learning is a subset of ML that learns by processing data
with the help of artificial neural networks.
3. Neural network: Neural networks are computer systems that are loosely
modeled on neural connections in the human brain and enable deep learning.
4. Cognitive computing: Cognitive computing aims to recreate the human thought
process in a computer model. It seeks to imitate and improve the interaction
between humans and machines by understanding human language and the
meaning of images.0
5.
6. Natural language processing (NLP): NLP is a tool that allows computers to
comprehend, recognize, interpret, and produce human language and speech.
7. Computer vision: Computer vision employs deep learning and pattern
identification to interpret image content (graphs, tables, PDF pictures, and
videos).

How AI works
an AI system accepts data input in the form of speech, text, image, etc. The
system then processes data by applying various rules and algorithms,
interpreting, predicting, and acting on the input data. Upon processing, the
system provides an outcome, i.e., success or failure, on data input. The result is
then assessed through analysis, discovery, and feedback. Lastly, the system uses
its assessments to adjust input data, rules and algorithms, and target outcomes.
This loop continues until the desired result is achieved.
History of Machine Learning

1.Probabilistic modeling

Probabilistic modeling is the application of the principles of statistics to data analysis. It was
one of the earliest forms of machine learning, and it’s still widely used to this day. One of the
best-known algorithms in this category is the Naive Bayes algorithm

Naive Bayes is a type of machine-learning classifier based on applying Bayes’ theorem while
assuming that the features in the input data are all independent (a strong, or “naive”
assumption,

A closely related model is the logistic regression (logreg for short), which is sometimes
considered to be the “hello world” of modern machine learning. Logreg is a classification
algorithm rather than a regression algorithm. Much like Naive Bayes, logreg predates
computing by a long time, yet it’s still useful to this daybecause of its simple and versatile
nature.
Early neural networks

Early iterations of neural networks have been completely supplanted by the modern
variants , but it’s helpful to be aware of how deep learning originated. Although the core
ideas of neural networks were investigated in toy forms as early as the 1950s, the approach
took decades to get started This changed in the mid-1980, when multiple people
independently rediscovered the Backpropagation algorithm— a way to train chains of
parametric operations using gradient-descent optimization.

The first successful practical application of neural nets came in 1989 from Bell Labs, when
Yann LeCun combined the earlier ideas of convolutional neural networks and
backpropagation, and applied them to the problem of classifying handwritten digits. The
resulting network, dubbed LeNet, was used by the United States Postal Service in the 1990s
to automate the reading of ZIP codes on mail envelopes.

Kernel methods

Kernel Methods are a group of classification algorithms, the best known of which is the
support vector machine (SVM). The modern formulation of an SVM was developed by
Vladimir Vapnik and Corinna Cortes in the early 1990s at Bell Labs and published in 1995,2
although an older linear formulation was published by Vapnik and Alexey Chervonenkis as
early as 1963.3 SVMs aim at solving classification problems by finding good decision
boundaries) between two sets of points belonging to two different categories.

A decision boundary can be thought of as a line or surface separating your training data into
two spaces corresponding to two categories. To classify new data points, we just need to
check which side of the decision boundary they fall on

SVMs proceed to find these boundaries in two steps:

1 The data is mapped to a new high-dimensional representation where the decision boundary
can be expressed as a hyperplane (if the data was twodimensional, as in figure 1.10, a
hyperplane would be a straight line).

2 A good decision boundary (a separation hyperplane) is computed by trying to maximize the


distance between the hyperplane and the closest data points from each class, a step called
maximizing the margin. This allows the boundary to generalize well to new samples outside
of the training dataset.

Kernel methods is used to convert the input data into a high-dimensional feature space, which
makes it simpler to distinguish between classes or generate predictions. Kernel methods
employ a kernel function to implicitly map the data into the feature space, as opposed to
manually computing the feature space.

The most popular kind of kernel approach is the Support Vector Machine (SVM), a binary
classifier that determines the best hyperplane that most effectively divides the two groups. In
order to efficiently locate the ideal hyperplane, SVMs map the input into a higher-
dimensional space using a kernel function.

Decision trees, random forests, and gradient boosting machines

Decision trees are flowchart-like structures that let you classify input data points or predict
output values given inputs. They’re easy to visualize and interpret.

Random Forest algorithm introduced a robust, practical take on decision-tree learning that
involves building a large number of specialized decision trees and then ensembling their
outputs. Random forests are applicable to a w ide range of problems the second-best
algorithm for any shallow machine-learning task

A gradient boosting machine, much like a random forest, is a machine-learning technique


based on ensembling weak prediction models, generally decision trees

gradient boosting is a way to improve any machine-learning model by iteratively training


new models that specialize in addressing the weak points of the previous models.
Fundamentals of Machine Learning

Definition

Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine learning
algorithms build a mathematical model that helps in making predictions or decisions
without being explicitly programmed. Machine learning brings computer science and
statistics together for creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the information, the
higher will be the performance.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of predicted
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset.


o It can learn from past data and improve automatically.
o It is a data-driven technology.

Machine learning is much similar to data mining as it also deals with the huge amount of the
data.
Four Forms of Machine Learning

 Supervised Learning
 Unsupervised Learning
 Self Supervised Learning
 Reinforced Learning

1) Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample


labeled data to the machine learning system in order to train it, and on that basis, it predicts
the output.

The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression

Generally, almost all applications of deep learning that are in the spotlight these days belong
in this category, such as optical character recognition, speech recognition, image
classification, and language translation.

Although supervised learning mostly consists of classification and regression, there are more
exotic variants as well, including the following (with examples):

1. Sequence generation—Given a picture, predict a caption describing it. Sequence


generation can sometimes be reformulated as a series of classification problems (such
as repeatedly predicting a word or token in a sequence).
2. Syntax tree prediction—Given a sentence, predict its decomposition into a syntax
tree.
3. Object detection—Given a picture, draw a bounding box around certain objects
inside the picture. This can also be expressed as a classification problem (given many
candidate bounding boxes, classify the contents of each one) or as a joint
classification and regression problem, where the bounding-box coordinates are
predicted via vector regression.
4. Image segmentation—Given a picture, draw a pixel-level mask on a specific object.

2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns without any


supervision.

The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:

o Clustering
o Association

Unsupervised learning is the bread and butter of data analytics, and it’s often a
necessary step in better understanding a dataset before attempting to solve a
supervised-learning problem. Dimensionality reduction and clustering are well-known
categories of unsupervised learning.

3)Self Supervised Learning

This is a specific instance of supervised learning, but it’s different enough that it deserves its
own category. Self-supervised learning is supervised learning without human-annotated
labels. There are still labels involved (because the learning has to be supervised by
something), but they’re generated from the input data, typically using a heuristic algorithm.
For example, autoencoders are a well-known instance of self-supervised learning, where the
generated targets are the input, unmodified. In the same way, trying to predict the next frame
in a video, given past frames, or the next word in a text, given previous words, are instances
of self-supervised learning

4) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a


reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

Evaluating machine-learning models

The train/test/validation split


We can properly evaluate our model by not training the model on the entire dataset. A typical
train/test split would be to use 70% of the data for training and 30% of the data for testing.

it's important to use new data when evaluating our model to prevent the likelihood of
overfitting to the training set

To evaluate the model while still building and tuning the model, we create a third subset of
the data known as the validation set. A typical train/test/validation split would be to use 60%
of the data for training, 20% of the data for validation, and 20% of the data for testing.

The reason for chosing validation set is that developing a model always involves tuning its
configuration:

for example, choosing the number of layers or the size of the layers (called the
hyperparameters of the model, to distinguish them from the parameters, which are the
network’s weights).

We do this tuning by using as a feedback signal the performance of the model on the
validation data. In essence, this tuning is a form of learning: a search for a good configuration
in some parameter space. As a result, tuning the configuration of the model based on its
performance on the validation set can quickly result in overfitting to the validation set, even
though your model is never directly trained on it.

We need to use a completely different, never-before-seen dataset to evaluate the model: the
test dataset. our model shouldn’t have had access to any information about the test set, even
indirectly.

classic evaluation Methods

1. simple hold-out validation

2. K fold validation

3. Iterated K-fold validation

Refer textbook2
Overfitting and Underfitting

Bias and Variance in Machine Learning


 Bias: Assumptions made by a model to make a function easier to learn. It is actually the
error rate of the training data. When the error rate has a high value, we call it High Bias
and when the error rate has a low value, we call it low Bias.
 Variance: The difference between the error rate of training data and testing data is
called variance. If the difference is high then it’s called high variance and when the
difference in errors is low then it’s called low variance. Usually, we want to make a low
variance for generalized our model.

Underfitting in Machine Learning


 A statistical model or a machine learning algorithm is said to have underfitting when
it cannot capture the underlying trend of the data, i.e., it only performs well on
training data but performs poorly on testing data. (It’s just like trying to fit
undersized pants!)
 Underfitting destroys the accuracy of our machine-learning model. Its occurrence
simply means that our model or the algorithm does not fit the data well enough.
 It usually happens when we have less data to build an accurate model and also when
we try to build a linear model with fewer non-linear data.
 In such cases, the rules of the machine learning model are too easy and flexible to
be applied to such minimal data, and therefore the model will probably make a lot of
wrong predictions.
 Underfitting can be avoided by using more data and also reducing the features by
feature selection.
 An underfitted model has high bias and low variance.

In a nutshell, Underfitting refers to a model that can neither performs well on the training
data nor generalize to new data.
Reasons for Underfitting
1. High bias and low variance.
2. The size of the training dataset used is not enough.
3. The model is too simple.
4. Training data is not cleaned and also contains noise in it.
Techniques to Reduce Underfitting
1. Increase model complexity.
2. Increase the number of features, performing feature engineering.
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.

Overfitting in Machine Learning


 A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data.
 When a model gets trained with so much data, it starts learning from the noise and
inaccurate data entries in our data set. And when testing with test data results in
High variance.
 Then the model does not categorize the data correctly, because of too many details
and noise. The causes of overfitting are the non-parametric and non-linear methods
because these types of machine learning algorithms have more freedom in building
the model based on the dataset and therefore they can really build unrealistic
models.
 A solution to avoid overfitting is using a linear algorithm if we have linear data or
using the parameters like the maximal depth if we are using decision tree
 The overfitted model has low bias and high variance.

Reasons for Overfitting:

1. High variance and low bias.


2. The model is too complex.
3. The size of the training data.
Techniques to Reduce Overfitting
1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization .
5. Use dropout for neural networks to tackle overfitting.

Optimization and Generalization

The fundamental issue in machine learning is the tension between optimization and
generalization

Optimization refers to the process of adjusting a model to get the best performance possible
on the training data
generalization refers to how well the trained model performs on data it has never seen
before.

The processing of fighting overfitting is called regularization.

Regularization techniques

1.Reducing the network’s size

Simplifying The Model


The first step when dealing with overfitting is to decrease the complexity of the model. To
decrease the complexity, we can simply remove layers or reduce the number of neurons to
make the network smaller. There is no general rule on how much to remove or how large
your network should be. But, if your neural network is overfitting, try making it smaller.

The simplest way to prevent overfitting is to reduce the size of the model: the number of
learnable parameters in the model (which is determined by the number of layers and the
number of units per layer). In deep learning, the number of learnable parameters in a model is
often referred to as the model’s capacity. Intuitively, a model with more parameters has more
memorization capacity and therefore can easily learn a perfect dictionary-like mapping
between training samples and their targets—a mapping without any generalization power.

Refer Textbook2

2. Adding weight regularization


This is a common way to mitigate overfitting is to put constraints on the complexity
of a network by forcing its weights to take only small values, which makes the
distribution of weight values more regular. This is called weight regularization, and
it’s done by adding to the loss function of the network a cost associated with having
large weights.

Regularization is a technique to reduce the complexity of the model. It does so by


adding a penalty term to the loss function. The most common techniques are known
as L1 and L2 regularization:
The L1 penalty aims to minimize the absolute value of the weights.

If the data is too complex to be modeled accurately then L2 is a better choice as it is


able to learn inherent patterns present in the data. While L1 is better if the data is
simple enough to be modeled accurately.

1.L1 Regularization
2. L2 Regularization

A regression model that uses L1 regularization technique is called Lasso Regression and

model which uses L2 is called Ridge Regression.

The key difference between these two is the penalty term.

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss

function. Here the highlighted part represents L2 regularization element.

Cost function

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of

magnitude” of coefficient as penalty term to the loss function.

Cost function

Refer Textbook2

Adding dropout

Dropout is one of the most effective and most commonly used regularization techniques for
neural networks, developed by Geoff Hinton and his students at the University of Toronto

Dropout, applied to a layer, consists of randomly dropping out (setting to zero) a number of
output features of the layer during training.
During training, some layer outputs are ignored or dropped at random. This makes the layer
appear and is regarded as having a different number of nodes and connectedness to the
preceding layer.

Dropout makes the training process noisy, requiring nodes within a layer to take on more or
less responsible for the inputs on a probabilistic basis

dropout may break apart circumstances in which network tiers co-adapt to fix mistakes
committed by prior layers, making the model more robust.

if neurons are randomly dropped out of the network during training, other neurons will have
to step in and handle the representation required to make predictions for the missing neurons.
This is believed to result in multiple independent internal representations being learned by the
network.

The effect is that the network becomes less sensitive to the specific weights of neurons. This,
in turn, results in a network capable of better generalization and less likely to overfit the
training data.

Basics of Neural Networks


Neural networks or artificial neural networks are fundamental tools in machine learning,
powering many state-of-the-art algorithms and applications across various domains, including
computer vision, natural language processing, robotics, and more.

A neural network consists of interconnected nodes, called neurons, organized into layers.
Each neuron receives input signals, performs a computation on them using an activation
function, and produces an output signal that may be passed to other neurons in the network.
An activation function determines the output of a neuron given its input. These functions
introduce nonlinearity into the network, enabling it to learn complex patterns in data.

The network is typically organized into layers, starting with the input layer, where data is
introduced. Followed by hidden layers where computations are performed and finally, the
output layer where predictions or decisions are made.

Neurons in adjacent layers are connected by weighted connections, which transmit signals
from one layer to the next. The strength of these connections, represented by weights,
determines how much influence one neuron's output has on another neuron's input. During
the training process, the network learns to adjust its weights based on examples provided in a
training dataset. Additionally, each neuron typically has an associated bias, which allows the
neuron to adjust its output threshold.

Neural networks are trained using techniques called feedforward propagation


and backpropagation. During feedforward propagation, input data is passed through the
network layer by layer, with each layer performing a computation based on the inputs it
receives and passing the result to the next layer.

Backpropagation is an algorithm used to train neural networks by iteratively adjusting the


network's weights and biases in order to minimize the loss function. A loss function (also
known as a cost function or objective function) is a measure of how well the model's
predictions match the true target values in the training data. The loss function quantifies the
difference between the predicted output of the model and the actual output, providing a signal
that guides the optimization process during training.

The goal of training a neural network is to minimize this loss function by adjusting the
weights and biases. The adjustments are guided by an optimization algorithm, such as
gradient descent.

Types of Neural Networks

The ANN depicted on the right of the image is a simple neural network called ‘perceptron’. It consists
of a single layer, which is the input layer, with multiple neurons with their own weights; there are no
hidden layers. The perceptron algorithm learns the weights for the input signals in order to draw a
linear decision boundary.

However, to solve more complicated, non-linear problems related to image processing, computer
vision, and natural language processing tasks, we work with deep neural networks.

Feedforward Neural Networks (FNN)

These are the simplest form of ANNs, where information flows in one direction, from input
to output. There are no cycles or loops in the network architecture. Multilayer perceptrons
(MLP) are a type of feedforward neural network.
Recurrent Neural Networks (RNN)

In RNNs, connections between nodes form directed cycles, allowing information to persist
over time. This makes them suitable for tasks involving sequential data, such as time series
prediction, natural language processing, and speech recognition.

Convolutional Neural Networks (CNN)

CNNs are designed to effectively process grid-like data, such as images. They consist of
layers of convolutional filters that learn hierarchical representations of features within the
input data. CNNs are widely used in tasks like image classification, object detection, and
image segmentation.

Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU)

These are specialized types of recurrent neural networks designed to address the vanishing
gradient problem in traditional RNN. LSTMs and GRUs incorporate gated mechanisms to
better capture long-range dependencies in sequential data, making them particularly effective
for tasks like speech recognition, machine translation, and sentiment analysis.

Autoencoder

It is designed for unsupervised learning and consists of an encoder network that compresses
the input data into a lower-dimensional latent space, and a decoder network that reconstructs
the original input from the latent representation. Autoencoders are often used for
dimensionality reduction, data denoising, and generative modeling.

Generative Adversarial Networks (GAN)

GANs consist of two neural networks, a generator and a discriminator, trained


simultaneously in a competitive setting. The generator learns to generate synthetic data
samples that are indistinguishable from real data, while the discriminator learns to distinguish
between real and fake samples. GANs have been widely used for generating realistic images,
videos, and other types of data.

Feed-Forward Neural Network in Deep Learning


A feedforward neural network is one of the simplest types of artificial neural
networks devised. In this network, the information moves in only one direction—
forward—from the input nodes, through the hidden nodes (if any), and to the
output nodes. There are no cycles or loops in the network. Feedforward neural
networks were the first type of artificial neural network invented and are simpler
than their counterparts like recurrent neural networks and convolutional neural
networks.
Architecture of Feedforward Neural Networks
The architecture of a feedforward neural network consists of three types of layers:
the input layer, hidden layers, and the output layer. Each layer is made up of units
known as neurons, and the layers are interconnected by weights.

 Input Layer: This layer consists of neurons that receive inputs and pass them on
to the next layer. The number of neurons in the input layer is determined by the
dimensions of the input data.
 Hidden Layers: These layers are not exposed to the input or output and can be
considered as the computational engine of the neural network. Each hidden layer's
neurons take the weighted sum of the outputs from the previous layer, apply
an activation function, and pass the result to the next layer. The network can have
zero or more hidden layers.
 Output Layer: The final layer that produces the output for the given inputs. The
number of neurons in the output layer depends on the number of possible outputs
the network is designed to produce.

Artificial neural networks (ANNs) are comprised of a node layers,


containing an input layer, one or more hidden layers, and an output layer.
Each node, or artificial neuron, connects to another and has an associated
weight and threshold.
Input Layer
Neurons in the input layer don’t perform any calculations; they are
simply placeholders for input data. This placeholding is essential
because the use of artificial neural networks involves performing
computations on matrices that have predefined dimensions.
Dense Layers
There are many kinds of hidden layers. the most general type is the
dense layer, which can also be called a fully connected layer. Dense
layers are found in many deep learning architectures
Each of the neurons in a given dense layer receive information from
every one of the neurons in the preceding layer of the network
A dense layer is fully connected to the layer before it
Dense layers are broadly useful, because they can nonlinearly
recombine the information provided by the preceding layer of the network

Artificial Neuron
Anatomy of an artificial neuron.

Perceptron
Perceptron is a type of neural network that performs binary classification that
maps input features to an output decision, usually classifying data into one of two
categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a
layer of output nodes..
Types of Perceptron
1. Single-Layer Perceptron is a type of perceptron is limited to learning
linearly separable patterns. It is effective for tasks where the data can be
divided into distinct categories through a straight line. While powerful in its
simplicity, it struggles with more complex problems where the relationship
between inputs and outputs is non-linear.
A single-layer perceptron (SLP) is the simplest type of artificial neural network. It
consists of a single layer of neurons that directly connect to the input features. It's a
foundational concept in understanding neural networks and is limited to learning
linearly separable patterns.

Key Characteristics:
 Single Layer:
The defining feature is the single layer of neurons. There are no hidden layers between the
input and output.
 Linear Separability:
SLPs can only classify data that can be divided by a straight line (in 2D) or a hyperplane (in
higher dimensions). They struggle with non-linearly separable data like XOR, circles, or
spirals.
 Input Connections:
Each neuron in the SLP is directly connected to all input features.
 Activation Function:
A threshold or step function is typically used to determine the output of each neuron based
on a weighted sum of the inputs.
How it works:
1. 1. Weighted Sum:
The input features are multiplied by corresponding weights, and these weighted values are
summed up.
2. 2. Activation Function:
The sum is then passed through an activation function, which determines the neuron's
output. The most common activation function for SLPs is a step function that outputs 1 if
the sum is above a threshold and 0 otherwise.
3. 3. Output:
The output of the activation function is the final prediction of the perceptron

2. Multi-Layer Perceptron possess enhanced processing capabilities as they


consist of two or more layers, adept at handling more complex patterns and
relationships within the data.
Basic Components of Perceptron
A Perceptron is composed of key components that work together to process
information and make predictions.
 Input Features: The perceptron takes multiple input features, each
representing a characteristic of the input data.
 Weights: Each input feature is assigned a weight that determines its influence
on the output. These weights are adjusted during training to find the optimal
values.
 Each connection between neurons has a weight which is one of the
factors that is changed during training. The weight of the connection
affects how much input is passed between neurons. This behavior
follows the formula inputs * weights
Weights represent the strength of connections between neurons in different layers.
They determine how much influence one neuron's output has on the input of another
neuron.
Example:
In a simple linear equation (y = mx + b), the weight (m) determines the slope of
the line and how much x affects y.
During training, the network adjusts weights to minimize the error between predicted and
actual outputs.
 Summation Function: The perceptron calculates the weighted sum of its
inputs, combining them with their respective weights.
 Activation Function:
 captures the presence of non-linear relationships between the
inputs.contributes to the conversion of the input into a more usable output
The weighted sum is passed through the Heaviside step function,
comparing it to a threshold to produce a binary output (0 or 1).

 Output: The final output is determined by the activation function, often used
for binary classification tasks.
 Bias: The bias term helps the perceptron make adjustments independent of the
input, improving its flexibility in learning.
 Biases are constants added to the weighted sum of inputs before an activation function is
applied.
 In the linear equation (y = mx + b), the bias (b) represents the y-intercept and determines the
output when x is zero
 Like weights, biases are also learned during training.
 The output of a neuron is expressed by the formula
output = inputs * weights + bias

Importance:
Flexibility:
Weights and biases provide the neural network with the flexibility to learn complex
relationships in data.
Non-linearity:
They enable the network to model non-linear relationships, which is crucial for solving
many real-world problems.
Pattern Recognition:
By adjusting weights and biases, the network can learn to recognize specific patterns and
make accurate predictions

How they work together:


 The weighted sum of inputs is calculated for each neuron, and the bias is added to this sum.
This combined value is then passed through an activation function, which introduces non-linearity

 The output of the activation function becomes the input for the next layer, and the process
repeats.
 In essence, weights and biases are the knobs that a neural network uses to fine-tune
its behavior and learn from data

 Learning Algorithm: The perceptron adjusts its weights and bias using a
learning algorithm, such as the Perceptron Learning Rule , to minimize
prediction errors.

Artificial Neural Networks Architecture


Weight

.
Bias
Once a neuron receives inputs from all the other neurons connected to it,
a bias is added, a constant value that changes is added to the previous
computation involving weight.

Improving Deep Networks

Weight initialization is an important consideration in the design of a


neural network model.
The nodes in neural networks are composed of parameters referred to
as weights used to calculate a weighted sum of the inputs.

Neural network models are fit using an optimization algorithm called


stochastic gradient descent that incrementally changes the network
weights to minimize a loss function, hopefully resulting in a set of
weights for the mode that is capable of making useful predictions.

This optimization algorithm requires a starting point in the space of


possible weight values from which to begin the optimization process.
Weight initialization is a procedure to set the weights of a neural
network to small random values that define the starting point for the
optimization (learning or training) of the neural network model.

A multilayer perceptron (MLP) is a type of feedforward artificial neural network


characterized by multiple layers of interconnected nodes, including an input layer,
one or more hidden layers, and an output layer. MLPs are capable of learning
complex, non-linear relationships within data, making them useful for a wide range
of tasks like classification and regression.

Key Features:
 Fully Connected: Each node in a layer is connected to every node in the adjacent layers.
 Feedforward: Information flows in one direction, from input to output, without cycles.
 Non-linear Activation Functions: Introduce non-linearity, allowing the network to model
complex relationships.
 Multiple Layers: At least one hidden layer is present, enabling the network to learn
hierarchical representations of data.
 Backpropagation: The learning algorithm used to adjust the network's weights and biases
based on the error between predicted and actual outputs.
How it works:
1. Input Layer: Receives the initial data or features.
2. Hidden Layers: Perform computations and feature extraction.
3. Output Layer: Produces the final prediction or classification.
4. Forward Pass: Data flows through the network, with each layer applying weights, biases,
and activation functions.
5. Error Calculation: The difference between the predicted and actual output is calculated.
6. Backward Pass (Backpropagation): The error is used to adjust the weights and biases,
minimizing the error.
Applications:
 Image Recognition: Classifying images into different categories.
 Natural Language Processing: Analyzing and understanding text.
 Speech Recognition: Converting spoken language into text.
 Regression Problems: Predicting continuous values.

Cross-entropy loss Estimation

Cross-entropy loss, also known as log loss, is a measure of the difference between
two probability distributions. In machine learning, particularly in classification
problems, it quantifies the dissimilarity between a model's predicted probability
distribution and the true distribution (often a one-hot encoded vector representing
the correct class). Lower cross-entropy loss indicates a better model fit, with 0
representing a perfect prediction.

 Measures Dissimilarity:
Cross-entropy loss quantifies how different the predicted probability distribution is from
the actual distribution.
 Common in Classification:
It's widely used as a loss function in classification tasks, including binary and multi-class
scenarios.
 Interpreting the Value:
The loss value ranges from 0 to 1, with 0 being the best possible score (perfect prediction).
 Optimization Goal:
During training, optimization algorithms aim to minimize the cross-entropy loss, pushing
the model's predictions closer to the true labels.
Mathematical Formulation:
In a binary classification problem with true label y (either 0 or 1) and predicted
probability p, the cross-entropy loss is calculated as:
 (y * log(p) + (1 - y) * log(1 - p))

For multi-class classification, where the model predicts probabilities for multiple
classes, the formula becomes:
 Σ (yᵢ * log(pᵢ)), where i ranges over all classes
Practical Applications:
 Image Classification:
Models predict the probability of an image belonging to different classes (e.g., cat, dog,
bird), and cross-entropy loss helps evaluate the accuracy of these predictions.
 Language Modeling:
In language models, cross-entropy loss measures how well the model predicts the next
word in a sequence.
 Recommender Systems:
It can be used to assess how well a model predicts user preferences or item
recommendations
Practical Applications:
 Image Classification:
Models predict the probability of an image belonging to different classes (e.g., cat, dog,
bird), and cross-entropy loss helps evaluate the accuracy of these predictions.
 Language Modeling:
In language models, cross-entropy loss measures how well the model predicts the next
word in a sequence.
 Recommender Systems:
It can be used to assess how well a model predicts user preferences or item
recommendations.

Entropy calculates the degree of randomness or disorder within a system. In the context of
information theory, the entropy of a random variable is the average uncertainty, surprise, or
information inherent to the possible outcomes. To put things simply, it measures the
uncertainty of an event.

The Shannon entropy equation

Cross-entropy, also known as logarithmic loss or log loss, is a popular loss function used in machine
learning to measure the performance of a classification model.

The cross-entropy loss function is used to find the optimal solution by adjusting the weights
of a machine learning model during training. The objective is to minimize the error between
the actual and predicted outcomes. A lower cross-entropy value indicates better performance.

cross-entropy measures the average number of bits required to identify an

event from one probability distribution, P, using the optimal code for another probability distribution,
Q, and is typically used in machine learning to evaluate the performance of a model where the
objective is to minimize the error between the predicted probability distribution and true distribution.

Cross Entropy as a Loss Function


In machine learning, loss functions help models determine how wrong it is and improve itself
based on that wrongness. They are mathematical functions that quantify the difference
between predicted and actual values in a machine learning model

The measure of error from a loss function also serves as a guide during
the optimization process by providing feedback to the model on how well it fits the data.
Hence, most machine learning models implement a loss function during the optimization
phase, where the model parameters are chosen to help the model minimize the error and
arrive at an optimal solution – the smaller the error, the better the model.

We can measure the error between two probability distributions using the cross-entropy loss
function. For example, let’s assume we’re conducting a binary classification task (a
classification task with two classes, 0 and 1)
Binary cross entropy formula

Multiclass Cross Entropy Loss


Multiclass Cross-Entropy Loss, also known as categorical cross-entropy
or softmax loss, is a widely used loss function for training models in multiclass
classification problems. For a dataset with N instances, Multiclass Cross-Entropy
Loss is calculated as

The cross-entropy loss is a scalar value that quantifies how far off the model's
predictions are from the true labels. For each sample in the dataset, the cross-
entropy loss reflects how well the model's prediction matches the true label. A
lower loss for a sample indicates a more
accurate prediction, while a higher loss suggests a larger discrepancy.
 Interpretability with Binary Classification:
o In binary classification, since there are two classes (0 and 1) it is
start forward to interpret th e loss value,
o If the true label is 1, the loss is primarily influenced by how close
the predicted probability for class 1 is to 1.0.
o If the true label is 0, the loss is influenced by how close the
predicted probability for class 1 is to 0.0.

Interpretability with Multiclass Classification:


o In multiclass classification, only the true label contributes towards the
loss as for other labels being zero does not add anything to the loss
function.
o Lower loss indicates that the model is assigning high probabilities to
the correct class and low probabilities to incorrect classes.

Entropy:

Entropy, also known as Shannon entropy, was formally introduced in 1948 by Claude Shannon

Entropy calculates the degree of randomness or disorder within a system

It measures the uncertainty of an event. it measures the uncertainty of an event.

Example:In a binary classification problem where a model predicts the probability


of an email being spam. If the email is actually spam, and the model predicts a
probability of 0.9 (90% chance of being spam), the cross-entropy loss will be
relatively low. However, if the model predicts a probability of 0.1 (10% chance of
being spam), the cross-entropy loss will be much higher, indicating a poor
prediction.
Example: a binary classification problem (e.g., cat vs. not cat) and your model
predicts the following probabilities for an image:
 Predicted probability of cat: 0.8
 Predicted probability of not cat: 0.2
And the true label for this image is:
 True label (cat): 1
 True label (not cat): 0
Calculation:
1. Binary Cross-Entropy Loss: The formula for binary cross-entropy loss is:
Loss = -(y * log(p) + (1 - y) * log(1 - p))

where:
 y is the true label (0 or 1)
 is the predicted probability for the true class (e.g., if y=1, then p is the probability of the cat
p
being present)
1. Applying the formula:
In our example: y = 1 (cat) and p = 0.8 (predicted probability of cat).
Therefore, the loss for this single sample is:
Loss = -(1 * log(0.8) + (1 - 1) * log(1 - 0.8))
Loss = -(log(0.8))
Loss ≈ -(-0.223)
Loss ≈ 0.223

Interpretation:

A lower cross-entropy loss (closer to 0) indicates a better prediction, meaning the


model's predicted probability is closer to the actual label. In our example, the loss
of 0.223 suggests the model is relatively confident (0.8 probability) and accurate
(since the true label was 1
Several Python libraries are commonly used for estimating Cross-Entropy Loss,
particularly in the context of machine learning and deep learning.
 Scikit-learn (sklearn): For general machine learning tasks, sklearn.metrics.log_loss provides a
straightforward way to calculate cross-entropy loss (also known as log loss) for classification
problems.
from sklearn.metrics import log_loss
import numpy as np

y_true = np.array([0, 1, 0, 1]) # True labels


y_pred = np.array([[0.1, 0.9], [0.8, 0.2], [0.3, 0.7], [0.6, 0.4]]) #
Predicted probabilities

loss = log_loss(y_true, y_pred)


print(f"Cross-Entropy Loss: {loss}")
Softmax Layer
A softmax layer in a neural network converts a vector of numbers into a
probability distribution. It is commonly used as the output layer in multi-class
classification problems, ensuring that the outputs are between 0 and 1 and sum up
to 1, representing the probabilities of the input belonging to each class.

Purpose:
 Probability Distribution:
The softmax function transforms raw output scores (logits) from the previous layer into
probabilities.
 Multi-class Classification:
It's particularly useful when an input can belong to one of several classes, allowing the
network to predict the most likely class.
 Decision Making:
By providing probabilities, the softmax layer facilitates making decisions about which class
the input belongs to.

How it works:
1. 1. Exponentiation:
Each logit (output from the previous layer) is exponentiated, ensuring all values are
positive.
2. 2. Normalization:
The exponentiated values are then divided by the sum of all exponentiated values. This
normalization step ensures that the outputs sum up to 1, forming a probability distribution.
Formula:
The softmax function is often represented by the following formula:

Where:
 σ(z)_i is the i-th element of the softmax output vector.
 z_i is the i-th logit (input to the softmax layer).
 K is the number of classes.
 e is the exponential function.

Eg:A neural network trained to classify images into three categories: cat, dog, and
bird. The softmax layer would take the raw output scores from the previous layer
and convert them into probabilities for each class. For example, the output might
be [0.2, 0.7, 0.1], indicating a 20% probability of being a cat, 70% probability of
being a dog, and 10% probability of being a bird.
Gradient Descent

Gradient descent is an optimization algorithm used in machine learning to minimize the


cost function by iteratively adjusting parameters in the direction of the negative gradient,
aiming to find the optimal set of parameters

This technique is widely used in training machine learning models, particularly in


linear regression, logistic regression, and neural networks.

Gradient Descent is an optimization algorithm that helps machine learning models


learn by updating parameters (such as weights in neural networks) to minimize the
cost function.

The cost function (or loss function) measures how well the model’s predictions
match the actual data. By iteratively adjusting the parameters in the direction that
reduces the cost function, the model improves its accuracy.
1. Initialize Parameters – Start with random values for the parameters
(weights and biases).
2. Compute the Gradient – Calculate the derivative (gradient) of the loss
function with respect to each parameter.
3. Update Parameters – Adjust the parameters by moving in the opposite

Key points about the formula θ = θ – α * ∇J(θ):


direction of the gradient:

Where:
θ represents model parameters

∇J(θ) is the gradient/slope of the cost function


α (learning rate) controls step size

4. Repeat Until Convergence – Continue updating parameters until the


change is minimal or a stopping criterion is met.

The learning rate is a hyperparameter that determines the size of the step taken in the
weight update. A small learning rate results in a slow convergence, while a large learning
rate can lead to overshooting the minimum and oscillating around the minimum. It’s
important to choose an appropriate learning rate that balances the speed of convergence and
the stability of the optimization.
1) Batch Gradient Descent:
In batch gradient descent, the gradient of the loss function is computed with respect to the
weights for the entire training dataset, and the weights are updated after each iteration. This
provides a more accurate estimate of the gradient, but it can be computationally expensive
for large datasets.
2) Stochastic Gradient Descent (SGD):
In SGD, the gradient of the loss function is computed with respect to a single training
example, and the weights are updated after each example. SGD has a lower computational
cost per iteration compared to batch gradient descent, but it can be less stable and may not
converge to the optimal solution.
3) Mini-Batch Gradient Descent:
Mini-batch gradient descent is a compromise between batch gradient descent and SGD. The
gradient of the loss function is computed with respect to a small randomly selected subset
of the training examples (called a mini-batch), and the weights are updated after each mini-
batch. Mini-batch gradient descent provides a balance between the stability of batch
gradient descent and the computational efficiency of SGD.
4) Momentum:
Momentum is a variant of gradient descent that incorporates information from the previous
weight updates to help the algorithm converge more quickly to the optimal solution.
Momentum adds a term to the weight update that is proportional to the running average of
the past gradients, allowing the algorithm to move more quickly in the direction of the
optimal solution
Derivatives and Stochastic Gradient Descent
Neural Network Implementation Issues

 Data Dependency:
Neural networks, especially deep learning models, require large amounts of high-quality
data for effective training. Gathering and cleaning sufficient data can be time-consuming,
expensive, and sometimes impractical.

 Computational Costs:
Training large neural networks can be computationally expensive, demanding significant
processing power, often requiring specialized hardware like GPUs or TPUs.
 Overfitting:
Neural networks can memorize the training data instead of generalizing, leading to poor
performance on new data.
 Interpretability:
Neural networks are often described as "black boxes" due to the difficulty in understanding
how they arrive at their predictions. This lack of transparency can be problematic in fields
where explainability is crucial.
 Optimization Challenges:
Training neural networks involves finding optimal model parameters, which can be difficult
due to issues like vanishing or exploding gradients, local minima, and the need for proper
hyperparameter tuning.
 Hyperparameter Tuning:
Choosing the right hyperparameters (e.g., learning rate, batch size, network architecture) is
crucial for optimal performance, but this process can be time-consuming and complex.
 Vanishing and Exploding Gradients:
In deep neural networks, gradients can become very small (vanishing) or very large
(exploding) during backpropagation, hindering the learning process.
 Limited Data:
Many real-world applications lack the vast amounts of labeled data needed for effective
neural network training.
 Bias and Fairness:
Neural networks can inherit biases from the training data, leading to unfair or
discriminatory outcomes.
 Continual Learning:
Training neural networks on continuously arriving data can be challenging, as models may
struggle with "catastrophic forgetting" or interference.
 Scalability:
Scaling neural networks to handle large problem instances and real-world applications can
be a significant challenge.

 Hardware and Deployment Constraints:


Neural networks can have specific hardware and deployment requirements, which can be a
limiting factor in certain environments.
 Data Privacy and Security:
Protecting sensitive data used in training and deployment is a critical concern.
 Loss Function Issues:
Incorrect implementation or usage of loss functions can significantly impact network
performance. For example, in PyTorch, NLLLoss requires softmax input,
while CrossEntropyLoss does not.
 Gradient Descent:
If implemented manually, ensure that backpropagation is working correctly.
 Hidden Dimension and Network Size:
Incorrect hidden dimensions or network size can lead to errors and suboptimal
performance.

Data Independence

Data independence in the context of neural networks refers to the ability to modify
the underlying data representation or structure without requiring changes to the
network's architecture or training process. This concept is analogous to data
independence in databases, where changes to the physical storage or logical
structure of data can be made without affecting the applications that use it. In
essence, it allows for greater flexibility and maintainability of neural networks by
decoupling the data from the model's implementation

 Data format changes:


Modifications to how data is preprocessed, normalized, or represented (e.g., switching from
one-hot encoding to embeddings) should not necessitate changes to the network's
architecture or training procedure.

 Data augmentation:
The application of data augmentation techniques (e.g., random rotations, crops) should not
require modifications to the core network architecture or training loop.
 Data source changes:
Switching to a different dataset or source of data should not require significant changes to
the network's structure or training strategy.
 Feature engineering:
Modifications to feature selection, extraction, or engineering should be possible without
requiring changes to the network's architecture.

Benefits

Increased flexibility:

Neural networks can be adapted to new data sources or tasks without requiring extensive
re-engineering.
 Improved maintainability:
Changes to the data pipeline or preprocessing steps can be made without affecting the core
network, simplifying maintenance and updates.
 Enhanced reusability:
Networks trained on one dataset can be more easily adapted to other datasets or tasks,
promoting reusability of trained models.
 Reduced development time:
Data independence allows for faster experimentation with different data representations and
preprocessing techniques.

 Image classification:
A convolutional neural network trained on a dataset of images can be adapted to a new
dataset with different image resolutions or color spaces without requiring changes to the
convolutional layers or the overall architecture.
 Natural language processing:
A recurrent neural network trained on text data can be adapted to a new language or corpus
without requiring changes to the core RNN architecture or the training procedure.
 Time series analysis:
A neural network trained on time series data can be adapted to handle different time scales
or sampling frequencies without requiring changes to the network's architecture or training
loop

Achieving data independence:


 Using modular design:
Design neural networks with clearly defined modules for data input, processing, and output,
allowing for independent modification of each module.
 Employing data-agnostic architectures:
Utilize architectures that are inherently flexible and can adapt to different data formats and
structures.
 Leveraging data augmentation and preprocessing techniques:
Apply data augmentation and preprocessing techniques that can be easily integrated into the
data pipeline without affecting the core network.
 Using transfer learning:
Leverage pre-trained models on large datasets and fine-tune them on specific tasks,
reducing the need to train from scratch on new datasets.

You might also like