0% found this document useful (0 votes)
86 views151 pages

ML-Unit 1 Merged

Uploaded by

Vignesh Vignesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views151 pages

ML-Unit 1 Merged

Uploaded by

Vignesh Vignesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

UNIT-I

Human Learning - Types – Machine Learning - Types - Problems not to be solved - Applications -
Languages/Tools– Issues. Preparing to Model: Introduction - Machine Learning Activities - Types of data -
Exploring structure of data - Data quality and remediation - Data Pre-processing

Machine Learning

Machine learning is a growing technology which enables computers to learn automatically from past data.
Machine learning uses various algorithms for building mathematical models and making predictions using
historical data or information. Currently, it is being used for various tasks such as image
recognition, speech recognition, email filtering, Facebook auto-tagging, recommender system, and many
more.

What is Machine Learning

Machine Learning (ML) is that field of computer science with the help of which computer systems can provide
sense to data in much the same way as human beings do.

In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an
algorithm or method. The main focus of ML is to allow computer systems learn from experience without being
explicitly programmed or human intervention.

In the real world, we are surrounded by humans who can learn everything from their experiences with their
learning capability, and we have computers or machines which work on our instructions. But can a machine
also learn from experiences or past data like a human does? So here comes the role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the development
of algorithms which allow a computer to learn from the data and past experiences on their own. The term
machine learning was first introduced by Arthur Samuel in 1959. We can define it in a summarized way as:

1
Machine learning enables a machine to automatically learn from data, improve performance from experiences,
and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine learning algorithms build
a mathematical model that helps in making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for creating predictive models. Machine
learning constructs or uses the algorithms that learn from historical data. A machine has the ability to learn
if it can improve its performance by gaining more data.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the prediction models, and whenever it
receives new data, predicts the output for it. The accuracy of predicted output depends upon the amount of
data, as the huge amount of data helps to build a better model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead of writing a
code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms, machine
builds the logic as per the data and predict the output. Machine learning has changed our way of thinking
about the problem. The below block diagram explains the working of Machine Learning algorithm:

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset.


o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need for machine learning is
that it is capable of doing tasks that are too complex for a person to implement directly. As a human, we have
some limitations as we cannot access the huge amount of data manually, so for this, we need some computer
systems and here comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let them explore
the data, construct the models, and predict the required output automatically. The performance of the machine
learning algorithm depends on the amount of data, and it can be determined by the cost function. With the
help of machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its uses cases, Currently, machine learning
is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion by Facebook,

2
etc. Various top companies such as Netflix and Amazon have build machine learning models that are using a
vast amount of data to analyze the user interest and recommend product accordingly.

The importance of Machine Learning:

 Rapid increment in the production of data


 Solving complex problems, which are difficult for a human
 Decision making in various sector including finance
 Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

Supervised learning

Unsupervised learning

Semi-supervised learning

Reinforcement learning

Introduction to Machine Learning

1) Supervised Learning

Supervised learning is commonly used in real world applications, such as face and speech recognition,
products or movie recommendations, and sales forecasting. Supervised learning can be further classified into
two types - Regression and Classification.

Regression trains on and predicts a continuous-valued response, for example predicting real estate prices.

Classification attempts to find the appropriate class label, such as analyzing positive/negative sentiment, male
and female persons, benign and malignant tumors, secure and unsecure loans etc.

In supervised learning, learning data comes with description, labels, targets or desired outputs and the
objective is to find a general rule that maps inputs to outputs. This kind of learning data is called labeled data.
The learned rule is then used to label new data with unknown outputs.

Supervised learning involves building a machine learning model that is based on labeled samples. For
example, if we build a system to estimate the price of a plot of land or a house based on various features, such
as size, location, and so on, we first need to create a database and label it. We need to teach the algorithm what
features correspond to what prices. Based on this data, the algorithm will learn how to calculate the price of
real estate using the values of the input features.

Supervised learning deals with learning a function from available training data. Here, a learning algorithm
analyzes the training data and produces a derived function that can be used for mapping new examples. There
are many supervised learning algorithms such as Logistic Regression, Neural networks, Support Vector
Machines (SVMs), and Naive Bayes classifiers.

Common examples of supervised learning include classifying e-mails into spam and not-spam categories,
labeling webpages based on their content, and voice recognition.

3
2) Unsupervised Learning

Unsupervised learning is used to detect anomalies, outliers, such as fraud or defective equipment, or to group
customers with similar behaviors for a sales campaign. It is the opposite of supervised learning. There is no
labeled data here.

When learning data contains only some indications without any description or labels, it is up to the coder or
to the algorithm to find the structure of the underlying data, to discover hidden patterns, or to determine how
to describe the data. This kind of learning data is called unlabeled data.

Suppose that we have a number of data points, and we want to classify them into several groups. We may not
exactly know what the criteria of classification would be. So, an unsupervised learning algorithm tries to
classify the given dataset into a certain number of groups in an optimum way.

Unsupervised learning algorithms are extremely powerful tools for analyzing data and for identifying patterns
and trends. They are most commonly used for clustering similar input into logical groups. Unsupervised
learning algorithms include Kmeans, Random Forests, Hierarchical clustering and so on.

3) Semi-supervised Learning

If some learning samples are labeled, but some other are not labeled, then it is semi-supervised learning. It
makes use of a large amount of unlabeled data for training and a small amount of labeled data for testing.
Semi-supervised learning is applied in cases where it is expensive to acquire a fully labeled dataset while more
practical to label a small subset. For example, it often requires skilled experts to label certain remote sensing
images, and lots of field experiments to locate oil at a particular location, while acquiring unlabeled data is
relatively easy.

4) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for each
right action and gets a penalty for each wrong action. The agent learns automatically with these feedbacks and
improves its performance. In reinforcement learning, the agent interacts with the environment and explores it.
The goal of an agent is to get the most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of Reinforcement
learning.

Purpose of Machine Learning

Machine learning can be seen as a branch of AI or Artificial Intelligence, since, the ability to change
experience into expertise or to detect patterns in complex data is a mark of human or animal intelligence.

As a field of science, machine learning shares common concepts with other disciplines such as statistics,
information theory, game theory, and optimization.

As a subfield of information technology, its objective is to program machines so that they will learn.

However, it is to be seen that, the purpose of machine learning is not building an automated duplication of
intelligent behavior, but using the power of computers to complement and supplement human intelligence.
For example, machine learning programs can scan and process huge databases detecting patterns that are
beyond the scope of human perception.

4
Machine Learning at present:

Now machine learning has got a great advancement in its research, and it is present everywhere around us,
such as self-driving cars, Amazon Alexa, Catboats, recommender system, and many more. It includes
Supervised, unsupervised, and reinforcement learning with clustering, classification, decision tree, SVM
algorithms, etc.

Modern machine learning models can be used for making various predictions, including weather prediction,
disease prediction, stock market analysis, etc.

Prerequisites

Before learning machine learning, you must have the basic knowledge of followings so that you can easily
understand the concepts of machine learning:

 Fundamental knowledge of probability and linear algebra.


 The ability to code in any computer language, especially in Python language.
 Knowledge of Calculus, especially derivatives of single variable and multivariate functions.

Challenges in Machines Learning

While Machine Learning is rapidly evolving, making significant strides with cybersecurity and autonomous
cars, this segment of AI as whole still has a long way to go. The reason behind is that ML has not been able
to overcome number of challenges. The challenges that ML is facing currently are

Quality of data − Having good-quality data for ML algorithms is one of the biggest challenges. Use of low-
quality data leads to the problems related to data preprocessing and feature extraction.

Time-Consuming task − Another challenge faced by ML models is the consumption of time especially for
data acquisition, feature extraction and retrieval.

Lack of specialist persons − As ML technology is still in its infancy stage, availability of expert resources is
a tough job.

No clear objective for formulating business problems − Having no clear objective and well-defined goal
for business problems is another key challenge for ML because this technology is not that mature yet.

Issue of overfitting & underfitting − If the model is overfitting or underfitting, it cannot be represented well
for the problem.

Curse of dimensionality − Another challenge ML model faces is too many features of data points. This can
be a real hindrance.

Difficulty in deployment − Complexity of the ML model makes it quite difficult to be deployed in real life.

Applications of Machines Learning


Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. It is used
to solve many real-world complex problems which cannot be solved with traditional approach. Following are
some real-world applications of ML −

5
Applications of Machine learning

1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to identify objects,
persons, places, digital images, etc. The popular use case of image recognition and face detection is, Automatic
friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our
Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind this
is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition and person
identification in the picture.

2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a
popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech to
text", or "Computer speech recognition." At present, machine learning algorithms are widely used by various
applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest
route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the
help of two ways:

Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.

6
Everyone who is using Google Map is helping this app to make it better. It takes information from the user
and sends back to its database to improve the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such as Amazon,
Netflix, etc., for product recommendation to the user. Whenever we search for some product on Amazon, then
we started getting an advertisement for the same product while internet surfing on the same browser and this
is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the product as
per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and
this is also done with the help of machine learning.

5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a
significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on self-
driving car. It is using unsupervised learning method to train the car models to detect people and objects while
driving.
6. Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the
technology behind this is Machine learning. Below are some spam filters used by Gmail:

Content Filter
Header filter
General blacklists filter
Rules-based filters
Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes classifier
are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants can help us in
various ways just by our voice instructions such as Play music, call someone, Open an email, Scheduling an
appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever
we perform some online transaction, there may be various ways that a fraudulent transaction can take place
such as fake accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed Forward
Neural network helps us by checking whether it is a genuine transaction or a fraud transaction.

7
For each genuine transaction, the output is converted into some hash values, and these values become the input
for the next round. For each genuine transaction, there is a specific pattern which gets change for the fraud
transaction hence, it detects it and makes our online transactions more secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and
downs in shares, so for this machine learning's long short term memory neural network is used for the
prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing
very fast and able to build 3D models that can predict the exact position of lesions in the brain. It helps in
finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:


Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all, as for
this also machine learning helps us by converting the text into our known languages. Google's GNMT (Google
Neural Machine Translation) provide this feature, which is a Neural Machine Learning that translates the text
into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence-to-sequence learning algorithm, which is used
with image recognition and translates the text from one language to another language.

Machine Learning Tools

Machine learning is one of the most revolutionary technologies that is making lives simpler. It is a subfield of
Artificial Intelligence, which analyses the data, build the model, and make predictions. Due to its popularity
and great applications, every tech enthusiast wants to learn and build new machine learning Apps. However,
to build ML models, it is important to master machine learning tools. Mastering machine learning tools will
enable you to play with the data, train your models, discover new methods, and create algorithms.

There are different tools, software, and platform available for machine learning, and also new software and
tools are evolving day by day. Although there are many options and availability of Machine learning tools,
choosing the best tool per your model is a challenging task. If you choose the right tool for your model, you
can make it faster and more efficient. In this topic, we will discuss some popular and commonly used Machine
learning tools and their features.

8
Figure: Machine Learning Tools

1. TensorFlow
Machine Learning Tools
TensorFlow is one of the most popular open-source libraries used to train and build both machine learning
and deep learning models. It provides a JS library and was developed by Google Brain Team. It is much
popular among machine learning enthusiasts, and they use it for building different ML applications. It offers
a powerful library, tools, and resources for numerical computation, specifically for large scale machine
learning and deep learning projects. It enables data scientists/ML developers to build and deploy machine
learning applications efficiently. For training and building the ML models, TensorFlow provides a high-level
Keras API, which lets users easily start with TensorFlow and machine learning.

Features:
Below are some top features:

 TensorFlow enables us to build and train our ML models easily.


 It also enables you to run the existing models using the TensorFlow.js
 It provides multiple abstraction levels that allow the user to select the correct resource as per the
requirement.
 It helps in building a neural network.
 Provides support of distributed computing.
 While building a model, for more need of flexibility, it provides eager execution that enables
immediate iteration and intuitive debugging.
 This is open-source software and highly flexible.
 It also enables the developers to perform numerical computations using data flow graphs.
 Run-on GPUs and CPUs, and also on various mobile computing platforms.
 It provides a functionality of auto diff (Automatically computing gradients is called automatic
differentiation or auto diff).
 It enables to easily deploy and training the model in the cloud.
 It can be used in two ways, i.e., by installing through NPM or by script tags.
 It is free to use.
9
2. PyTorch

PyTorch is an open-source machine learning framework, which is based on the Torch library. This framework
is free and open-source and developed by FAIR(Facebook's AI Research lab). It is one of the popular ML
frameworks, which can be used for various applications, including computer vision and natural language
processing. PyTorch has Python and C++ interfaces; however, the Python interface is more interactive.
Different deep learning software is made up on top of PyTorch, such as PyTorch Lightning, Hugging Face's
Transformers, Tesla autopilot, etc.

It specifies a Tensor class containing an n-dimensional array that can perform tensor computations along with
GPU support.

Features:
Below are some top features:

 It enables the developers to create neural networks using Autograde Module.


 It is more suitable for deep learning researches with good speed and flexibility.
 It can also be used on cloud platforms.
 It includes tutorial courses, various tools, and libraries.
 It also provides a dynamic computational graph that makes this library more popular.
 It allows changing the network behaviour randomly without any lag.
 It is easy to use due to its hybrid front-end.
 It is freely available.
 3. Google Cloud ML Engine
 Machine Learning Tools
 While training a classifier with a huge amount of data, a computer system might not perform well.
However, various machine learning or deep learning projects requires millions or billions of training
datasets. Or the algorithm that is being used is taking a long time for execution. In such a case, one
should go for the Google Cloud ML Engine. It is a hosted platform where ML developers and data
scientists build and run optimum quality machine, learning models. It provides a managed service that
allows developers to easily create ML models with any type of data and of any size.

Features:
Below are the top features:

 Provides machine learning model training, building, deep learning and predictive modelling.
 The two services, namely, prediction and training, can be used independently or combinedly.
 It can be used by enterprises, i.e., for identifying clouds in a satellite image, responding faster to emails
of customers.
 It can be widely used to train a complex model.

4. Amazon Machine Learning (AML)

Amazon provides a great number of machine learning tools, and one of them is Amazon Machine Learning or
AML. Amazon Machine Learning (AML) is a cloud-based and robust machine learning software application,
which is widely used for building machine learning models and making predictions. Moreover, it integrates
data from multiple sources, including Redshift, Amazon S3, or RDS.

10
Features
Below are some top features:

 AML offers visualization tools and wizards.


 Enables the users to identify the patterns, build mathematical models, and make predictions.
 It provides support for three types of models, which are multi-class classification, binary classification,
and regression.
 It permits users to import the model into or export the model out from Amazon Machine Learning.
 It also provides core concepts of machine learning, including ML models, Data sources, Evaluations,
Real-time predictions and Batch predictions.
 It enables the user to retrieve predictions with the help of batch APIs for bulk requests or real-time
APIs for individual requests.

5. NET

Accord.Net is .Net based Machine Learning framework, which is used for scientific computing. It is combined
with audio and image processing libraries that are written in C#. This framework provides different libraries
for various applications in ML, such as Pattern Recognition, linear algebra, Statistical Data processing. One
popular package of the Accord.Net framework is Accord. Statistics, Accord.Math, and
Accord.MachineLearning.

Features
Below are some top features:

 It contains 38+ kernel Functions.


 Consists of more than 40 non-parametric and parametric estimation of statistical distributions.
 Used for creating production-grade computer audition, computer vision, signal processing, and
statistics apps.
 Contains more than 35 hypothesis tests that include two-way and one way ANOVA tests, non-
parametric tests such as the Kolmogorov-Smirnov test and many more.

6. Apache Mahout
Apache Mahout is an open-source project of Apache Software Foundation, which is used for developing
machine learning applications mainly focused on Linear Algebra. It is a distributed linear algebra framework
and mathematically expressive Scala DSL, which enable the developers to promptly implement their own
algorithms. It also provides Java/Scala libraries to perform Mathematical operations mainly based on linear
algebra and statistics.

Features:
Below are some top features:

 It enables developers to implement machine learning techniques, including recommendation,


clustering, and classification.
 It is an efficient framework for implementing scalable algorithms.
 It consists of matrix and vector libraries.
 It provides support for multiple distributed backends(including Apache Spark)

11
 It runs on top of Apache Hadoop using the MapReduce paradigm.

7. Shogun

Shogun is a free and open-source machine learning software library, which was created by Gunnar Raetsch
and Soeren Sonnenburg in the year 1999. This software library is written in C++ and supports interfaces for
different languages such as Python, R, Scala, C#, Ruby, etc., using SWIG(Simplified Wrapper and Interface
Generator). The main aim of Shogun is on different kernel-based algorithms such as Support Vector Machine
(SVM), K-Means Clustering, etc., for regression and classification problems. It also provides the complete
implementation of Hidden Markov Models.

Features:
Below are some top features:

 The main aim of Shogun is on different kernel-based algorithms such as Support Vector Machine
(SVM), K-Means Clustering, etc., for regression and classification problems.
 It provides support for the use of pre-calculated kernels.
 It also offers to use a combined kernel using Multiple kernel Learning Functionality.
 This was initially designed for processing a huge dataset that consists of up to 10 million samples.
 It also enables users to work on interfaces on different programming languages such as Lua, Python,
Java, C#, Octave, Ruby, MATLAB, and R.

8. Oryx2

It is a realization of the lambda architecture and built on Apache Kafka and Apache Spark. It is widely used
for real-time large-scale machine learning projects. It is a framework for building apps, including end-to-end
applications for filtering, packaged, regression, classification, and clustering. It is written in Java languages,
including Apache Spark, Hadoop, Tomcat, Kafka, etc. The latest version of Oryx2 is Oryx 2.8.0.

Features:
Below are some top features:

 It has three tiers: specialization on top providing ML abstractions, generic lambda architecture tier,
end-to-end implementation of the same standard ML algorithms.
 The original project of Oryx2 was Oryx1, and after some upgrades, Oryx2 was launched.
 It is well suited for large-scale real-time machine learning projects.
 It contains three layers which are arranged side-by-side, and these are named as Speed layer, batch
layer, and serving layer.
 It also has a data transport layer that transfer data between different layers and receives input from
external sources.

9. Apache Spark MLlib

Apache Spark MLlib is a scalable machine learning library that runs on Apache Mesos, Hadoop, Kubernetes,
standalone, or in the cloud. Moreover, it can access data from different data sources. It is an open-source
cluster-computing framework that offers an interface for complete clusters along with data parallelism and
fault tolerance.
12
For optimized numerical processing of data, MLlib provides linear algebra packages such as Breeze and netlib-
Java. It uses a query optimizer and physical execution engine for achieving high performance with both batch
and streaming data.
Features
Below are some top features:

 MLlib contains various algorithms, including Classification, Regression, Clustering,


recommendations, association rules, etc.
 It runs different platforms such as Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud
against diverse data sources.
 It contains high-quality algorithms that provide great results and performance.
 It is easy to use as it provides interfaces In Java, Python, Scala, R, and SQL.

10. Google ML kit for Mobile

For Mobile app developers, Google brings ML Kit, which is packaged with the expertise of machine learning
and technology to create more robust, optimized, and personalized apps. This tools kit can be used for face
detection, text recognition, landmark detection, image labelling, and barcode scanning applications. One can
also use it for working offline.

Features:
Below are some top features:

 The ML kit is optimized for mobile.


 It includes the advantages of different machine learning technologies.
 It provides easy-to-use APIs that enables powerful use cases in your mobile apps.
 It includes Vision API and Natural Language APIS to detect faces, text, and objects, and identify
different languages & provide reply suggestions.

Preparing to Model: Introduction

Step 1: Collect Data

Given the problem you want to solve, you will have to investigate and obtain data that you will use to feed
your machine. The quality and quantity of information you get are very important since it will directly impact
how well or badly your model will work. You may have the information in an existing database or you must
create it from scratch. If it is a small project, you can create a spreadsheet that will later be easily exported as
a CSV file. It is also common to use the web scraping technique to automatically collect information from
various sources such as APIs.

Step 2: Prepare the data

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Email

13
SIGN UP

This is a good time to visualize your data and check if there are correlations between the different
characteristics that we obtained. It will be necessary to make a selection of characteristics since the ones you
choose will directly impact the execution times and the results. You can also reduce dimensions by applying
PCA if necessary.

Additionally, you must balance the amount of data we have for each result -class- so that it is significant as
the learning may be biased towards a type of response and when your model tries to generalize knowledge it
will fail.

You must also separate the data into two groups: one for training and the other for model evaluation which
can be divided approximately in a ratio of 80/20 but it can vary depending on the case and the volume of data
we have.

At this stage, you can also pre-process your data by normalizing, eliminating duplicates, and making error
corrections.

Step 3: Choose the model

There are several models that you can choose according to the objective that you might have: you will use
algorithms of classification, prediction, linear regression, clustering, i.e. k-means or K-Nearest Neighbor,
Deep Learning, i.e Neural Networks, Bayesian, etc.

There are various models to be used depending on the data you are going to process such as images, sound,
text, and numerical values. In the following table, we will see some models and their applications that you can
apply in your projects:

Model Applications

Logistic Regression Price prediction

Fully connected networks Classification

Convolutional Neural Networks Image processing

Recurrent Neural Networks Voice recognition

Random Forest Fraud Detection

Reinforcement Learning Learning by trial and error

Generative Models Image creation

K-means Segmentation

14
Model Applications

K-Nearest Neighbors Recommendation systems

Bayesian Classifiers Spam and noise filtering

Linear Regression Classification

Step 4 Train your machine model

You will need to train the datasets to run smoothly and see an incremental improvement in the prediction rate.
Remember to initialize the weights of your model randomly -the weights are the values that multiply or affect
the relationships between the inputs and outputs- which will be automatically adjusted by the selected
algorithm the more you train them.

Step 5: Evaluation

You will have to check the machine created against your evaluation data set that contains inputs that the model
does not know and verify the precision of your already trained model. If the accuracy is less than or equal to
50%, that model will not be useful since it would be like tossing a coin to make decisions. If you reach 90%
or more, you can have good confidence in the results that the model gives you.

Step 6: Parameter Tuning

If during the evaluation you did not obtain good predictions and your precision is not the minimum desired, it
is possible that you have overfitting or underfitting problems and you must return to the training step before
making a new configuration of parameters in your model. You can increase the number of times you iterate
your training data- termed epochs. Another important parameter is the one known as the “learning rate”, which
is usually a value that multiplies the gradient to gradually bring it closer to the global -or local- minimum to
minimize the cost of the function.

Increasing your values by 0.1 units from 0.001 is not the same as this can significantly affect the model
execution time. You can also indicate the maximum error allowed for your model. You can go from taking a
few minutes to hours, and even days, to train your machine. These parameters are often called
Hyperparameters. This “tuning” is still more of an art than a science and will improve as you experiment.
There are usually many parameters to adjust and when combined they can trigger all your options. Each
algorithm has its own parameters to adjust. To name a few more, in Artificial Neural Networks (ANNs) you
must define in its architecture the number of hidden layers it will have and gradually test with more or less
and with how many neurons each layer. This will be a work of great effort and patience to give good results.

Step 7: Prediction or Inference

You are now ready to use your Machine Learning model inferring results in real-life scenarios.

Machine learning Life cycle

15
Machine learning has given the computer systems the abilities to automatically learn without being explicitly
programmed. But how does a machine learning system work? So, it can be described using the life cycle of
machine learning. Machine learning life cycle is a cyclic process to build an efficient machine learning project.
The main purpose of the life cycle is to find a solution to the problem or project.

Machine learning life cycle involves seven major steps, which are given below:

Gathering Data

Data preparation

Data Wrangling

Analyse Data

Train the model

Test the model

Deployment

Machine learning Life cycle

The most important thing in the complete process is to understand the problem and to know the purpose of
the problem. Therefore, before starting the life cycle, we need to understand the problem because the good
result depends on the better understanding of the problem.

In the complete life cycle process, to solve a problem, we create a machine learning system called "model",
and this model is created by providing "training". But to train a model, we need data, hence, life cycle starts
by collecting data.

1. Gathering Data:

Data Gathering is the first step of the machine learning life cycle. The goal of this step is to identify and obtain
all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from various sources such
as files, database, internet, or mobile devices. It is one of the most important steps of the life cycle. The
quantity and quality of the collected data will determine the efficiency of the output. The more will be the
data, the more accurate will be the prediction.

This step includes the below tasks:

 Identify various data sources


 Collect data
 Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in further
steps.

2. Data preparation
16
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we put our
data into a suitable place and prepare it to use in our machine learning training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:

Data exploration:

It is used to understand the nature of data that we have to work with. We need to understand the characteristics,
format, and quality of data.

A better understanding of data leads to an effective outcome. In this, we find Correlations, general trends, and
outliers.

Data pre-processing:

Now the next step is preprocessing of data for its analysis.

3. Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a useable format. It is the process of
cleaning the data, selecting the variable to use, and transforming the data in a proper format to make it more
suitable for analysis in the next step. It is one of the most important steps of the complete process. Cleaning
of data is required to address the quality issues.

It is not necessary that data we have collected is always of our use as some of the data may not be useful. In
real-world applications, collected data may have various issues, including:

 Missing Values
 Duplicate data
 Invalid data
 Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect the quality of the
outcome.

4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step. This step involves:

 Selection of analytical techniques


 Building models
 Review the result

The aim of this step is to build a machine learning model to analyze the data using various analytical techniques
and review the outcome. It starts with the determination of the type of the problems, where we select the
machine learning techniques such as Classification, Regression, Cluster analysis, Association, etc. then build
the model using prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build the model.
17
5. Train Model

Now the next step is to train the model, in this step we train our model to improve its performance for better
outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a model is required
so that it can understand the various patterns, rules, and, features.

6. Test Model

Once our machine learning model has been trained on a given dataset, then we test the model. In this step, we
check for the accuracy of our model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per the requirement of project or
problem.

7. Deployment

The last step of machine learning life cycle is deployment, where we deploy the model in the real-world
system.

If the above-prepared model is producing an accurate result as per our requirement with acceptable speed,
then we deploy the model in the real system. But before deploying the project, we will check whether it is
improving its performance using available data or not. The deployment phase is similar to making the final
report for a project.

Types of data

DATA: It can be any unprocessed fact, value, text, sound, or picture that is not being interpreted and analyzed.
Data is the most important part of all Data Analytics, Machine Learning, Artificial Intelligence. Without data,
we can’t train any model and all modern research and automation will go in vain. Big Enterprises are spending
lots of money just to gather as much certain data as possible.

INFORMATION: Data that has been interpreted and manipulated and has now some meaningful inference
for the users.

KNOWLEDGE: Combination of inferred information, experiences, learning, and insights. Results in


awareness or concept building for an individual or organization.

How we split data in Machine Learning?

Training Data: The part of data we use to train our model. This is the data that your model actually
sees(both input and output) and learns from.

18
Validation Data: The part of data that is used to do a frequent evaluation of the model, fit on the training
dataset along with improving involved hyperparameters (initially set parameters before the model begins
learning). This data plays its part when the model is actually training.

Testing Data: Once our model is completely trained, testing data provides an unbiased evaluation. When we
feed in the inputs of Testing data, our model will predict some values(without seeing actual output). After
prediction, we evaluate our model by comparing it with the actual output present in the testing data. This is
how we evaluate and see how much our model has learned from the experiences feed in as training data, set
at the time of training.

Consider an example:

There’s a Shopping Mart Owner who conducted a survey for which he has a long list of questions and answers
that he had asked from the customers, this list of questions and answers is DATA. Now every time when he
wants to infer anything and can’t just go through each and every question of thousands of customers to find
something relevant as it would be time-consuming and not helpful. In order to reduce this overhead and time
wastage and to make work easier, data is manipulated through software, calculations, graphs, etc. as per own
convenience, this inference from manipulated data is Information. So, Data is a must for Information. Now
Knowledge has its role in differentiating between two individuals having the same information. Knowledge
is actually not technical content but is linked to the human thought process.

Different Forms of Data

Numerical data:Such as house price, temperature, etc.

Categorical data:Such as Yes/No, True/False, Blue/green, etc.

Ordinal data:These data are similar to categorical data but can be measured on the basis of comparison.

Numeric Data : If a feature represents a characteristic measured in numbers , it is called a numeric feature.
Numerical data is any data where data points are exact numbers. Statisticians also might call numerical data,
quantitative data. This data has meaning as a measurement such as house prices or as a count, such as a
number of residential properties in Los Angeles or how many houses sold in the past year.

Numerical data can be characterized by continuous or discrete data. Continuous data can assume any value
within a range whereas discrete data has distinct values.

Figure: Numerical Data


19
For example, the number of students taking Python class would be a discrete data set. You can only have
discrete whole number values like 10, 25, or 33. A class cannot have 12.75 students enrolled. A student
either join a class or he doesn’t. On the other hand, continuous data are numbers that can fall anywhere
within a range. Like a student could have an average score of 88.25 which falls between 0 and 100.

The takeaway here is that numerical data is not ordered in time. They are just numbers that we have collected.

Categorical Data : A categorical feature is an attribute that can take on one of the limited , and usually fixed
number of possible values on the basis of some qualitative property . A categorical feature is also called a
nominal feature.

Categorical data represents characteristics, such as a hockey player’s position, team, hometown. Categorical
data can take numerical values. For example, maybe we would use 1 for the colour red and 2 for blue. But
these numbers don’t have a mathematical meaning. That is, we can’t add them together or take the average.

In the context of super classification, categorical data would be the class label. This would also be something
like if a person is a man or woman, or property is residential or commercial.

Figure: Categorical Data

Ordinal Data : This denotes a nominal variable with categories falling in an ordered list . Examples include
clothing sizes such as small, medium , and large , or a measurement of customer satisfaction on a scale from
“not at all happy” to “very happy”.

There is also something called ordinal data, which in some sense is a mix of numerical and categorical data.
In ordinal data, the data still falls into categories, but those categories are ordered or ranked in some particular
way. An example would be class difficulty, such as beginner, intermediate, and advanced. Those three types
of classes would be a way that we could label the classes, and they have a natural order in increasing difficulty.

Another example is that we just take quantitative data, and splitting it into groups, so we have bins or categories
of other types of data.

Figure: Ordinal Data


20
For plotting purposes, ordinal data is treated much in the same way as categorical data. But groups are usually
ordered from lowest to highest so that we can preserve this ordering.

Exploring structure of data

Data Structure for Machine Learning

Machine Learning is one of the hottest technologies used by data scientists or ML experts to deploy a real-
time project. However, only skills of machine learning are not sufficient for solving real-world problems and
designing a better product, but also you have to gain good exposure to the data structure.

The data structure used for machine learning is quite similar to other software development fields where it is
often used. Machine Learning is a subset of artificial intelligence that includes various complex algorithms
to solve mathematical problems to a great extent. Data structure helps to build and understand these complex
problems. Understanding the data structure also helps you to build ML models and algorithms in a much more
efficient way than other ML professionals. In this topic, "Data Structure for Machine Learning", we will
discuss various concepts of data structure used in Machine Learning, along with the relationship between data
structure and ML. So, let's start with a quick overview of Data structure and Machine Learning.

What is Data Structure?

The data structure is defined as the basic building block of computer programming that helps us to
organize, manage and store data for efficient search and retrieval.

In other words, the data structure is the collection of data type 'values' which are stored and organized in such
a way that it allows for efficient access and modification.

Types of Data Structure

The data structure is the ordered sequence of data, and it tells the compiler how a programmer is using the
data such as Integer, String, Boolean, etc.

There are two different types of data structures: Linear and Non-linear data structures.

21
1. Linear Data structure:

The linear data structure is a special type of data structure that helps to organize and manage data in a specific
order where the elements are attached adjacently.

There are mainly 4 types of linear data structure as follows:

Array:

An array is one of the most basic and common data structures used in Machine Learning. It is also used in
linear algebra to solve complex mathematical problems. You will use arrays constantly in machine learning,
whether it's:

o To convert the column of a data frame into a list format in pre-processing analysis
o To order the frequency of words present in datasets.
o Using a list of tokenized words to begin clustering topics.
o In word embedding, by creating multi-dimensional matrices.

An array contains index numbers to represent an element starting from 0. The lowest index is arr[0] and
corresponds to the first element.

Let's take an example of a Python array used in machine learning. Although the Python array is quite different
from than array in other programming languages, the Python list is more popular as it includes the flexibility
of data types and their length. If anyone is using Python in ML algorithms, then it's better to kick your journey
from array initially.

Python Array method:

Method Description

22
Append() It is used to add an element at the end of the list.

Clear() It is used to remove/clear all elements in the list.

Copy() It returns a copy of the list.

Count() It returns the count or total available element with an integer value.

Extend() It is used to add the element of a list to the end of the current list.

Index() It returns the index of the first element with the specified value.

Insert() It is used to add an element at a specific position using an index number.

Pop() It is used to remove an element from a specified position using an index number.

Remove() Used to remove the elements with specified values.

Reverse() Used to show list in reverse order

Sort() Used to sort the list in an array.

Stacks:

Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last Out). It is used for binary
classification in deep learning. Although stacks are easy to learn and implement in ML models but having a
good grasp can help in many computer science aspects such as parsing grammar, etc.

Stacks enable the undo and redo buttons on your computer as they function similar to a stack of blog content.
There is no sense in adding a blog at the bottom of the stack. However, we can only check the most recent one
that has been added. Addition and removal occur at the top of the stack.

Linked List:

A linked list is the type of collection having several separately allocated nodes. Or in other words, a list is
the type of collection of data elements that consist of a value and pointer that point to the next node in the
list.

In a linked list, insertion and deletion are constant time operations and are very efficient, but accessing a value
is slow and often requires scanning. So, a linked list is very significant for a dynamic array where the shifting
of elements is required. Although insertion of an element can be done at the head, middle or tail position, it is
relatively cost consuming. However, linked lists are easy to splice together and split apart. Also, the list can
be converted to a fixed-length array for fast access.

23
Queue:

A Queue is defined as the "FIFO" (first in, first out). It is useful to predict a queuing scenario in real-time
programs, such as people waiting in line to withdraw cash in the bank. Hence, the queue is significant in a
program where multiple lists of codes need to be processed.

The queue data structure can be used to record the split time of a car in F1 racing.

2. Non-linear Data Structures

As the name suggests, in Non-linear data structures, elements are not arranged in any sequence. All the
elements are arranged and linked with each other in a hierarchal manner, where one element can be linked
with one or more elements.

1) Trees

Binary Tree:

The concept of a binary tree is very much similar to a linked list, but the only difference of nodes and their
pointers. In a linked list, each node contains a data value with a pointer that points to the next node in the list,
whereas; in a binary tree, each node has two pointers to subsequent nodes instead of just one.

Binary trees are sorted, so insertion and deletion operations can be easily done with O(log N) time complexity.
Similar to the linked list, a binary tree can also be converted to an array on the basis of tree sorting.

In a binary tree, there are some child and parent nodes shown in the above image. Where the value of the left
child node is always less than the value of the parent node while the value of the right-side child nodes is
24
always more than the parent node. Hence, in a binary tree structure, data sorting is done automatically, which
makes insertion and deletion efficient.

2) Graphs

A graph data structure is also very much useful in machine learning for link prediction. Graphs are directed
or undirected concepts with nodes and ordered or unordered pairs. Hence, you must have good exposure to
the graph data structure for machine learning and deep learning.

3) Maps

Maps are the popular data structure in the programming world, which are mostly useful for minimizing the
run-time algorithms and fast searching the data. It stores data in the form of (key, value) pair, where the key
must be unique; however, the value can be duplicated. Each key corresponds to or maps a value; hence it is
named a Map.

In different programming languages, core libraries have built-in maps or, rather, HashMaps with different
names for each implementation.

o In Java: Maps
o In Python: Dictionaries
o C++: hash_map, unordered_map, etc.

Python Dictionaries are very useful in machine learning and data science as various functions and algorithms
return the dictionary as an output. Dictionaries are also much used for implementing sparse matrices, which
is very common in Machine Learning.

4) Heap data structure:

Heap is a hierarchically ordered data structure. Heap data structure is also very much similar to a tree, but it
consists of vertical ordering instead of horizontal ordering.

Ordering in a heap DS is applied along the hierarchy but not across it, where the value of the parent node is
always more than that of child nodes either on the left or right side.

25
Here, the insertion and deletion operations are performed on the basis of promotion. It means, firstly, the
element is inserted at the highest available position. After that, it gets compared with its parent and promoted
until it reaches the correct ranking position. Most of the heaps data structures can be stored in an array along
with the relationships between the elements.

Dynamic array data structure:

This is one of the most important types of data structure used in linear algebra to solve 1-D, 2-D, 3-D as well
as 4-D arrays for matrix arithmetic. Further, it requires good exposure to Python libraries such as Python
NumPy for programming in deep learning.

How is Data Structure used in Machine Learning?

For a Machine learning professional, apart from knowledge of machine learning skills, it is required to have
mastery of data structure and algorithms.

When we use machine learning for solving a problem, we need to evaluate the model performance, i.e., which
model is fastest and requires the smallest amount of space and resources with accuracy. Moreover, if a model
is built using algorithms, comparing and contrasting two algorithms to determine the best for the job is crucial
to the machine learning professional. For such cases, skills in data structures become important for ML
professionals.

Data quality and remediation

Data quality (DQ) is the degree to which a given dataset meets a user's needs. Data quality is an important
criterion for ensuring that data-driven decisions are made as accurately as possible.

High quality data is of sufficient quantity -- and has sufficient detail -- to meet its’ intended uses. It is
consistent with other sources, presented in appropriate ways and has a high degree of completeness. Other
key data quality components include:

 Accuracy -- The extent to which data represents real-world events accurately.


 Credibility -- The extent to which data is considered trustworthy and true.
 Timeliness -- The extent to which data meets the user's current needs.
 Consistency -- The extent to which the same data occurrences have the same value in different
datasets.
 Integrity -- The extent to which all data references have been joined accurately.

Machine learning algorithms trained on accurate, clean, and well-labelled data can identify the patterns hidden
in the data and produce models that provide predictions with high accuracy. It is for this reason that it is very
important to understand the input, detect and address any issues affecting its quality, before feeding the input
to the machine learning algorithm.

Data quality evaluation

There are many aspects of data quality and various dimensions that one can consider when evaluating the data
at hand. Some of the most common aspects examined in the data quality assessment process are the following:

Number of missing values. Most of the real-world datasets contain missing values, i.e., feature entries with
no data value stored. As many machine learning algorithms do not support missing values, detecting the
missing values and properly handling them, can have a significant impact.

26
Existence of duplicate values. Duplicate values can take various formats, such as multiple entries of the same
data point, multiple instances of an entire column, and repetition of the same value in an I.D. variable. While
duplicate instances might be valid in some datasets, they often arise because of errors in the data extraction
and integration processes. Hence, it is important to detect any duplicate values and decide if they correspond
to invalid values (true duplicates) or form a valid part of the dataset.

Existence of outliers/anomalies. Outliers are data points that differ substantially from the rest of data, and
they may arise due to the diversity of the dataset or because of errors/mistakes. As machine learning algorithms
are sensitive to the range and distribution of attribute values, identifying the outliers and their nature is
important for assessing the quality of the dataset.

Existence of invalid/bad formatted values. Datasets often contain inconsistent values, such as variables with
different units across the data points and variables with incorrect data type. For example, it is often the case
that some special numerical variables, such as percentages and fractions, are mistakenly stored as strings, and
one should detect and transform such cases so that the machine learning algorithm can work with the actual
numbers.

Improving data quality

After exploring the data to assess its quality and gain an in-depth understanding of the dataset, it is important
to resolve any detected issues before proceeding to the next stages of the machine learning pipeline. Below,
we give some of the most common ways for addressing such issues.

Handling missing values. There are different ways of dealing with missing data based on their number and
their data type:

Removing the missing data. If the number of data points containing missing values is small and the size of
the dataset is large enough, you may remove such data points. Also, if a variable is containing a very large
number of missing values, it may be removed.

Imputation. If the number of missing values is not small enough to be removed and not large enough to be a
substantial proportion of the variable entries, you can replace the missing values in a numerical variable with
the mean/median of the non-missing entries and the missing values in a categorical variable with the mode,
which is the most frequent entry of the variable.

Dealing with duplicate values. True duplicates, i.e., instances of the same data point, are usually removed.
In this way, the increase of the sample weight on these points is eliminated, and the risk of any artificial
inflation in the performance metrics is reduced.

Dealing with outliers. As with the case of missing values, common methods of handling the detected outliers
include removing the outliers and imputing new values. However, depending on the context of the dataset and
the number of the outliers, keeping the outliers unchanged might be the most suitable course of action. For
example, keeping the outliers would be suggested in a dataset where the number of outliers is not very small
as they might be necessary to correctly understand the dataset.

Converting bad formatted values. All malformed values are converted and stored with the correct datatype.
For example, numerical variables that are stored as strings are converted to the corresponding numbers, and
strings that represent dates are stored as date objects. Also, it is important to convert and ensure that all entries
in a variable correspond to the same unit as otherwise the comparisons between the variable entries will not
correspond to the true comparisons.

What is data remediation?

27
Data remediation is the process of cleansing, organizing and migrating data so that it’s properly protected and
best serves its intended purpose. There is a misconception that data remediation simply means deleting
business data that is no longer needed. It’s important to remember that the key word “remediation” derives
from the word “remedy,” which is to correct a mistake. Since the core initiative is to correct data, the data
remediation process typically involves replacing, modifying, cleansing or deleting any “dirty” data.

Data remediation terminology

Data Migration – The process of moving data between two or more systems, data formats or servers.

Data Discovery – A manual or automated process of searching for patterns in data sets to identify structured
and unstructured data in an organization’s systems.

ROT – An acronym that stands for redundant, obsolete and trivial data. According to the Association for
Intelligent Information Management, ROT data accounts for nearly 80 percent of the unstructured data that is
beyond its recommended retention period and no longer useful to an organization.

Dark Data – Any information that businesses collect, process and store, but do not use for other purposes.
Some examples include customer call records, raw survey data or email correspondences. Often, the storing
and securing of this type of data incurs more expense and sometimes even greater risk than it does value.

Dirty Data – Data that damages the integrity of the organization’s complete dataset. This can include data
that is unnecessarily duplicated, outdated, incomplete or inaccurate.

Data Overload – This is when an organization has acquired too much data, including low-quality or dark
data. Data overload makes the tasks of identifying, classifying and remediating data laborious.

Data Cleansing – Transforming data in its native state to a predefined standardized format.

Data Governance – Management of the availability, usability, integrity and security of the data stored within
an organization.

Data Pre-processing

Data pre-processing is a process of preparing the raw data and making it suitable for a machine learning model.
It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean and formatted
data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for
this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot
be directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and

28
making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine
learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a machine learning model
completely works on data. The collected data for a particular problem in a proper format is known as
the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a machine learning
model for business purpose, then dataset will be different with the dataset required for a liver patient. So each
dataset is different from another dataset. To use the dataset in our code, we usually put it into a CSV file.
However, sometimes, we may also need to use an HTML or xlsx file.

What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save the tabular data,
such as spreadsheets. It is useful for huge datasets and can use these datasets in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be downloaded from here,
"https://www.superdatascience.com/pages/machine-learning

. For real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets
, https://archive.ics.uci.edu/ml/index.php
etc.

We can also create our dataset by gathering data using various API with Python and put that data into a .csv
file.

2) Importing Libraries

In order to perform data preprocessing using Python, we need to import some predefined Python libraries.
These libraries are used to perform some specific jobs. There are three specific libraries that we will use for
data preprocessing, which are:

29
Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It is the
fundamental package for scientific calculation in Python. It also supports to add large, multidimensional arrays
and matrices. So, in Python, we can import it as:

1. import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with this library, we
need to import a sub-library pyplot. This library is used to plot any type of charts in Python for the code. It
will be imported as below:

1. import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and used for
importing and managing the datasets. It is an open-source data manipulation and analysis library. It will be
imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning project. But before
importing a dataset, we need to set the current directory as a working directory. To set a working directory in
Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Here, in the below image, we can see the Python file along with required dataset. Now, the current folder is
set as a working directory.

30
read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to read a csv file
and performs various operations on it. Using this function, we can read a csv file locally as well as through an
URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the function, we have passed the name
of our dataset. Once we execute the above line of code, it will successfully import the dataset in our code. We
can also check the imported dataset by clicking on the section variable explorer, and then double click
on data_set. Consider the below image:

31
As in the above image, indexing is started from 0, which is the default indexing in Python. We can also change
the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent variables) and dependent
variables from dataset. In our dataset, there are three independent variables that are Country, Age, and Salary,
and one is a dependent variable which is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the
required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the columns.
Here we have used :-1, because we don't want to take the last column as it contains the dependent variable. So
by doing this, we will get the matrix of features.

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
32
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of dependent variables.

By executing the above code, we will get output as:

Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

4) Handling Missing data:

The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains some
missing data, then it may create a huge problem for our machine learning model. Hence it is necessary to
handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In this way, we
just delete the specific row or column which consists of null values. But this way is not so efficient and
removing data may lead to loss of information which will not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row which contains any
missing value and will put it on the place of missing value. This strategy is useful for the features which have
numeric data such as age, salary, year, etc. Here, we will use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains various libraries for
building machine learning models. Here we will use Imputer class of sklearn.preprocessing library. Below
is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)

33
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the means of rest column
values.

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two categorical
variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our dataset would have
a categorical variable, then it may create trouble while building the model. So it is necessary to encode these
categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
34
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has successfully encoded
the variables into digits.

But in our case, there are three country variables, and as we can see in the above output, these variables are
encoded into 0, 1, and 2. By these values, the machine learning model may assume that there is some
correlation between these variables which will produce the wrong output. So to remove this issue, we will
use dummy encoding.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that variable
in a particular column, and rest variables become 0. With dummy encoding, we will have a number of columns
equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values. For Dummy
Encoding, we will use OneHotEncoder class of preprocessing library.

1. #for Country Variable


2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
35
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1 and divided into three
columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:

For Purchased Variable:

1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)

For the second categorical variable, we will only use labelencoder object of LableEncoder class. Here we are
not using OneHotEncoder class because the purchased variable has only two categories yes or no, and which
are automatically encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:

36
6) Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one of the
crucial steps of data preprocessing as by doing this, we can enhance the performance of our machine learning
model.

Suppose, if we have given training to our machine learning model by a dataset and we test it by a completely
different dataset. Then, it will create difficulties for our model to understand the correlations between the
models.

If we train our model very well and its training accuracy is also very high, but we provide a new dataset to it,
then it will decrease the performance. So we always try to make a machine learning model which performs
well with the training set and also with the test dataset. Here, we can define these datasets as:

37
Training Set: A subset of dataset to train the machine learning model, and we already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts the
output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into random train and test
subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are for arrays of
data, and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2, which tells
the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that you always get
the same result, and the most used value for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen under the variable explorer
section.

38
As we can see in the above image, the x and y variables are divided into 4 different variables with
corresponding values.

7) Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning. It is a technique to standardize the
independent variables of the dataset in a specific range. In feature scaling, we put our variables in the same
range and in the same scale so that no any variable dominate the other variable.

Consider the below dataset:

As we can see, the age and salary column values are not on the same scale. A machine learning model is based
on Euclidean distance, and if we do not scale the variable, then it will cause some issue in our machine
learning model.

Euclidean distance is given as:

39
If we compute any two values from age and salary, then salary values will dominate the age values, and it will
produce an incorrect result. So to remove this issue, we need to perform feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization

Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

40
1. from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or features. And then we
will fit and transform the training dataset.

1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead of fit_transform() because it is already
done in training set.

1. x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:

41
x_test:

As we can see in the above output, all the variables are scaled between values -1 to 1.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code more understandable.

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5. #importing datasets
6. data_set= pd.read_csv('Dataset.csv')
7. #Extracting Independent Variable
8. x= data_set.iloc[:, :-1].values
9. #Extracting Dependent variable
10. y= data_set.iloc[:, 3].values
11. #handling missing data(Replacing missing data with the mean value)
12. from sklearn.preprocessing import Imputer
13. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
14. #Fitting imputer object to the independent varibles x.
15. imputerimputer= imputer.fit(x[:, 1:3])
16. #Replacing missing data with the calculated mean value
17. x[:, 1:3]= imputer.transform(x[:, 1:3])
18. #for Country Variable

42
19. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
20. label_encoder_x= LabelEncoder()
21. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
22. #Encoding for dummy variables
23. onehot_encoder= OneHotEncoder(categorical_features= [0])
24. x= onehot_encoder.fit_transform(x).toarray()
25. #encoding for purchased variable
26. labelencoder_y= LabelEncoder()
27. y= labelencoder_y.fit_transform(y)
28. # Splitting the dataset into training and test set.
29. from sklearn.model_selection import train_test_split
30. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
31. #Feature Scaling of datasets
32. from sklearn.preprocessing import StandardScaler
33. st_x= StandardScaler()
34. x_train= st_x.fit_transform(x_train)
35. x_test= st_x.transform(x_test)

In the above code, we have included all the data preprocessing steps together. But there are some steps or lines
of code which are not necessary for all machine learning models. So we can exclude them from our code to
make it reusable for all models.

43
1 Unit II Notes || Machine Learning || MC4301

1. Model selection ?
A machine learning model is defined as a mathematical representation of the output of the training process.
Machine learning is the study of different algorithms that can improve automatically through experience &
old data and build the model. A machine learning model is similar to computer software designed to recognize
patterns or behaviors based on previous experience or data. The learning algorithm discovers patterns within
the training data, and it outputs an ML model which captures these patterns and makes predictions on new
data.
When solving a Machine Learning problem, we may zero down
to several candidate models for the problem. We may further
be interested in the selection of
1. The best choice among various ML algorithms (e.g., Logistic
regression, support vector machine, neural networks, etc.)
2. Variables for linear regression
3. Basis terms such as polynomials, splines, or wavelets in function
estimation
4. Most appropriate parametric family among several alternatives
When we are at it, what we should keep in our minds so that we
select the best model?
The two primary criteria for model selection are prediction
accuracy and model interpretability, which are listed below

1) Prediction Accuracy – One of the main objectives of Model


Selection in Machine Learning is to find a model with the highest
prediction accuracy. It can be measured in terms of
MSE/Misclassification Error depending upon whether the target
variable is quantitative or qualitative, respectively.
2. Model Interpretability – A highly complex model, with too many predictors, not only introduces The Overfitting
Problem but also is difficult to interpret. An appropriate model tries to eliminate irrelevant variables from the model to
make the model both simpler and accurate.

A good model selection technique will balance between prediction accuracy and simplicity.
Usually, we aim to find the model which works best on the test dataset. But, a designated test set is not
available when we are building a predictive model. To address this problem, two conventional approaches are
used to find the estimate of the test error.
1. Analytic Methods -We can indirectly estimate test error by making an adjustment to the training error to account for
the bias due to overfitting. In these groups of methods, the training error is calculated first and then a penalty is added
to the training error to estimate the testing error.
2. Resampling Methods - We can directly estimate the test error, using Resampling Methods. In resampling methods,
the model is fit on one dataset and is validated on the complementary dataset and the validation error is recorded for
each iteration. This process is repeated multiple times and the mean validation error is taken as an estimate for test error.

The Best Practices for Model Selection


Some general recommendations and best practices that are trendy in the data science community are listed below for
reference.

1. Keep in mind the objectives of model selection


2. Cross-Validation is the most attractive method for model selection.
3. 5 or 10-fold cross-validation fares well for the majority of the cases.

In the simple linear models with a large number of predictors(p) and sample size(n), analytic methods perform as good
as resampling methods and are computationally inexpensive.
2 Unit II Notes || Machine Learning || MC4301

2. What is training model in machine learning?

A training model is a dataset that is used to train an ML algorithm. It consists of the sample output data and
the corresponding sets of input data that have an influence on the output. The training model is used to run
the input data through the algorithm to correlate the processed output against the sample output. The result
from this correlation is used to modify the model.

This iterative process is called “model fitting”. The accuracy of the training dataset or the validation dataset
is critical for the precision of the model.

Model training in machine language is the process of feeding an ML algorithm with data to help identify and
learn good values for all attributes involved. There are several types of machine learning models, of which
the most common ones are supervised and unsupervised learning.

Supervised learning is possible when the training data contains both the input and output values. Each set of
data that has the inputs and the expected output is called a supervisory signal. The training is done based on
the deviation of the processed result from the documented result when the inputs are fed into the model.

Unsupervised learning involves determining patterns in the data. Additional data is then used to fit patterns
or clusters. This is also an iterative process that improves the accuracy based on the correlation to the
expected patterns or clusters. There is no reference output dataset in this method.

Types of ML Models
Amazon ML supports three types of ML models: binary classification, multiclass classification, and regression. The
type of model you should choose depends on the type of target that you want to predict.

Binary Classification Model


ML models for binary classification problems predict a binary outcome (one of two possible classes). To train binary
classification models, Amazon ML uses the industry-standard learning algorithm known as logistic regression.

Examples of Binary Classification Problems


"Is this email spam or not spam?"
"Will the customer buy this product?"
"Is this product a book or a farm animal?"
"Is this review written by a customer or a robot?"

Multiclass Classification Model


ML models for multiclass classification problems allow you to generate predictions for multiple classes (predict one
of more than two outcomes). For training multiclass models, Amazon ML uses the industry-standard learning
algorithm known as multinomial logistic regression.

Examples of Multiclass Problems


"Is this product a book, movie, or clothing?"
"Is this movie a romantic comedy, documentary, or thriller?"
"Which category of products is most interesting to this customer?"

Regression Model
ML models for regression problems predict a numeric value. For training regression models, Amazon ML uses the
industry-standard learning algorithm known as linear regression.

Examples of Regression Problems


"What will the temperature be in Seattle tomorrow?"
"For this product, how many units will sell?"
"What price will this house sell for?"
Training Process
To train an ML model, you need to specify the following:
3 Unit II Notes || Machine Learning || MC4301

 Input training datasource


 Name of the data attribute that contains the target to be predicted
 Required data transformation instructions
 Training parameters to control the learning algorithm

During the training process, Amazon ML automatically selects the correct learning algorithm for you, based on the
type of target that you specified in the training data source.

Creating a Model in Machine Learning


There are 7 primary steps involved in creating a machine learning model. Here is a brief summarized
overview of each of these steps:
1. Defining the Problem
Defining the problem statement is the first step towards identifying what an ML model should achieve.
This step also enables recognizing the appropriate inputs and their respective outputs; Questions like
“what is the main objective?”, “what is the input data?” and “what is the model trying to predict?”
must be answered at this stage.
2. Data Collection
After defining the problem statement, it is necessary to investigate and gather data that can be used to
feed the machine. This is an important stage in the process of creating an ML model because the
quantity and quality of the data used will decide how effective the model is going to be. Data can be
gathered from pre-existing databases or can be built from the scratch
3. Preparing the Data
The data preparation stage is when data is profiled, formatted and structured as needed to make it ready
for training the model. This is the stage where the appropriate characteristics and attributes of data are
selected. This stage is likely to have a direct impact on the execution time and results. This is also at
the stage where data is categorized into two groups – one for training the ML model and the other for
evaluating the model. Pre-processing of data by normalizing, eliminating duplicates and making error
corrections is also carried out at this stage.
4. Assigning Appropriate Model / Protocols
Picking and assigning a model or protocol has to be done according to the objective that the ML model
aims to achieve. There are several models to pick from, like linear regression, k-means and bayesian.
The choice of models largely depends on the type of data that is being used. For instance, image
processing convolutional neural networks would be the ideal pick and k-means would work best for
segmentation.
5. Training the Machine Model or “The Model Training”
This is the stage where the ML algorithm is trained by feeding datasets. This is the stage where the
learning takes place. Consistent training can significantly improve the prediction rate of the ML model.
The weights of the model must be initialized randomly. This way the algorithm will learn to adjust the
weights accordingly.
6. Evaluating and Defining Measure of Success
The machine model will have to be tested against the “validation dataset”. This helps assess the
accuracy of the model. Identifying the measures of success based on what the model is intended to
achieve is critical for justifying correlation.
7. Parameter Tuning
Selecting the correct parameter that will be modified to influence the ML model is key to attaining
accurate correlation. The set of parameters that are selected based on their influence on the model
architecture are called hyper parameters. The process of identifying the hyper parameters by tuning
the model is called parameter tuning. The parameters for correlation should be clearly defined in a
manner in which the point of diminishing returns for validation is as close to 100% accuracy as
possible.
4 Unit II Notes || Machine Learning || MC4301

3. What is an interpretable model ?


When humans easily understand the decisions a machine learning model makes, we have an “interpretable model”. In
short, we want to know what caused a specific decision. If we can tell how a model came to a decision, then that
model is interpretable.

For example,
we can train a random forest machine learning model to predict whether a specific passenger survived the
sinking of the Titanic in 1912. The model uses all the passenger’s attributes – such as their ticket class, gender, and
age – to predict whether they survived.

Now let’s say our random forest model predicts a 93% chance of survival for a particular passenger. How did it come
to this conclusion?

Random forest models can easily consist of hundreds or thousands of “trees.” This makes it nearly impossible to grasp
their reasoning.

But, we can make each individual decision interpretable using an approach borrowed from game theory.

SHAP plots show how the model used each passenger attribute and arrived at a prediction of 93% (or 0.93). In the
Shapely plot below, we can see the most important attributes the model factored in.
 the passenger was not in third class: survival chances increase substantially;
 the passenger was female: survival chances increase even more;
 the passenger was not in first class: survival chances fall slightly.

We can see that the model is performing as expected by combining this interpretation with what we know from
history: passengers with 1st or 2nd class tickets were prioritized for lifeboats, and women and children abandoned ship
before men.

By contrast, many other machine learning models are not currently possible to interpret. As machine learning is
increasingly used in medicine and law, understanding why a model makes a specific decision is important.

What do we gain from interpretable machine learning?


Interpretable models help us reach lots of the common goals for machine learning projects:

 Fairness: if we ensure our predictions are unbiased, we prevent discrimination against under-represented groups.
 Robustness: we need to be confident the model works in every setting, and that small changes in input don’t cause
large or unexpected changes in output.
 Privacy: if we understand the information a model uses, we can stop it from accessing sensitive information.
 Causality: we need to know the model only considers causal relationships and doesn’t pick up false correlations;
 Trust: if people understand how our model reaches its decisions, it’s easier for them to trust it.
5 Unit II Notes || Machine Learning || MC4301

Are some algorithms more interpretable than others?


Simpler algorithms like regression and decision trees are usually more interpretable than complex models like neural
networks. Having said that, lots of factors affect a model’s interpretability, so it’s difficult to generalize.

With very large datasets, more complex algorithms often prove more accurate, so there can be a trade-off between
interpretability and accuracy.

A chart showing interpretability on the y-axis and accuracy


on the x-axis. Linear regression is at the top left (very
interpretable, not very accurate) and negative correlation
runs through decision trees, SVMs, random forests, and
neural networks.
More accurate models are often more difficult to interpret.

Scope of interpretability
By looking at scope, we have another way to compare
models’ interpretability. We can ask if a model is globally or
locally interpretable:

 global interpretability is understanding how the complete model works;


 Local interpretability is understanding how a single decision was reached.

A model is globally interpretable if it’s small and simple enough for a human to understand it entirely. A model is
locally interpretable if a human can trace back a single decision and understand how the model reached that decision.
A model is globally interpretable if we understand each and every rule it factors in. For example, a simple model
helping banks decide on home loan approvals might consider:
 The applicant’s monthly salary,
 The size of the deposit, and
 The applicant’s credit rating.
A human could easily evaluate the same data and reach the same conclusion, but a fully transparent and globally
interpretable model can save time.

In contrast, a far more complicated model could consider thousands of factors, like where the applicant lives and
where they grew up, their family’s debt history, and their daily shopping habits. It might be possible to figure out why
a single home loan was denied, if the model made a questionable decision. But because of the model’s complexity, we
won’t fully understand how it comes to decisions in general. This is a locally interpretable model.

Various ways to evaluate a machine learning model’s performance?


The performance of our machine learning or deep learning model and why to use one in place of the other.
We will discuss terms like:

1. Confusion matrix
2. Accuracy
3. Precision
4. Recall
5. Specificity
6. F1 score
7. Precision-Recall or PR curve
8. ROC (Receiver Operating Characteristics) curve
9. PR vs ROC curve.
6 Unit II Notes || Machine Learning || MC4301

For simplicity, we will mostly discuss things in terms of a binary classification problem where let’s say we’ll
have to find if an image is of a cat or a dog. Or a patient is having cancer (positive) or is found healthy
(negative). Some common terms to be clear with are:
True positives (TP): Predicted positive and are actually positive.
False positives (FP): Predicted positive and are actually negative.
True negatives (TN): Predicted negative and are actually negative.
False negatives (FN): Predicted negative and are actually positive.
So let's get started!

Confusion matrix
It’s just a representation of the above parameters in a matrix format. Better visualization is always good :)

Accuracy
The most commonly used metric to judge a model and is actually not a clear indicator of the performance.
The worse happens when classes are imbalanced.

Take for example a cancer detection model. The chances of actually having cancer are very low. Let’s say out
of 100, 90 of the patients don’t have cancer and the remaining 10 actually have it. We don’t want to miss on a
patient who is having cancer but goes undetected (false negative). Detecting everyone as not having cancer
gives an accuracy of 90% straight. The model did nothing here but just gave cancer free for all the 100
predictions.
We surely need better alternatives.

Precision
Percentage of positive instances out of the total predicted positive instances. Here denominator is the model
prediction done as positive from the whole given dataset. Take it as to find out ‘how much the model is right
when it says it is right’.

Recall/Sensitivity/True Positive Rate


Percentage of positive instances out of the total actual positive instances. Therefore denominator (TP +
FN) here is the actual number of positive instances present in the dataset. Take it as to find out ‘how much
extra right ones, the model missed when it showed the right ones’.
7 Unit II Notes || Machine Learning || MC4301

Specificity
Percentage of negative instances out of the total actual negative instances. Therefore denominator (TN +
FP) here is the actual number of negative instances present in the dataset. It is similar to recall but the shift is
on the negative instances. Like finding out how many healthy patients were not having cancer and were told
they don’t have cancer. Kind of a measure to see how separate the classes are.

F1 score
It is the harmonic mean of precision and recall. This takes the contribution of both, so higher the F1 score, the
better. See that due to the product in the numerator if one goes low, the final F1 score goes down
significantly. So a model does well in F1 score if the positive predicted are actually positives (precision) and
doesn't miss out on positives and predicts them negative (recall).

One drawback is that both precision and recall are given equal importance due to which according to our
application we may need one higher than the other and F1 score may not be the exact metric for it. Therefore
either weighted-F1 score or seeing the PR or ROC curve can help.

PR curve
It is the curve between precision and recall for various threshold values. In the figure below we have 6
predictors showing their respective precision-recall curve for various threshold values. The top right part of
the graph is the ideal space where we get high precision and recall. Based on our application we can choose
the predictor and the threshold value. PR AUC is just the area under the curve. The higher its numerical value
the better.

ROC curve
ROC stands for receiver operating characteristic and the graph is plotted against TPR and FPR for various
threshold values. As TPR increases FPR also increases. As you can see in the first figure, we have four
categories and we want the threshold value that leads us closer to the top left corner. Comparing different
predictors (here 3) on a given dataset also becomes easy as you can see in figure 2, one can choose the
threshold according to the application at hand. ROC AUC is just the area under the curve, the higher its
numerical value the better.
8 Unit II Notes || Machine Learning || MC4301

PR vs ROC curve
Both the metrics are widely used to judge a models performance.
Which one to use PR or ROC?

The answer lies in TRUE NEGATIVES.

Due to the absence of TN in the precision-recall equation, they are useful in imbalanced classes. In the case
of class imbalance when there is a majority of the negative class. The metric doesn’t take much into
consideration the high number of TRUE NEGATIVES of the negative class which is in majority, giving
better resistance to the imbalance. This is important when the detection of the positive class is very
important.
Like to detect cancer patients, which has a high class imbalance because very few have it out of all the
diagnosed. We certainly don’t want to miss on a person having cancer and going undetected (recall) and be
sure the detected one is having it (precision).
Due to the consideration of TN or the negative class in the ROC equation, it is useful when both the classes
are important to us. Like the detection of cats and dog. The importance of true negatives makes sure that
both the classes are given importance, like the output of a CNN model in determining the image is of a cat or
a dog.

Conclusion
The evaluation metric to use depends heavily on the task at hand. For a long time, accuracy was the only
measure I used, which is really a vague option. I hope this blog would have been useful for you. That's all
from my side. Feel free to suggest corrections and improvements.

4. Improve Performance of ML Models ?


1. Choosing the Right Algorithms
Algorithms are the key factor used to train the ML models. The data feed into this that helps the model
to learn from and predict with accurate results. Hence, choosing the right algorithm is important to
ensure the performance of your machine learning model.

Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, kNN, K-Means, Random
Forest and Dimensionality Reduction Algorithms and Gradient Boosting are the leading ML
algorithms you can choose as per your ML model compatibility.
9 Unit II Notes || Machine Learning || MC4301

2. Use the Right Quantity of Data


The next important factor you can consider while developing a machine learning model is choosing
the right quantity of data sets. And there are multirole factors and for deep learning-based ML
models, a huge quantity of datasets is required for algorithms.

Depending on the complexities of problem and learning algorithms, model skill, data size evaluation
and use of statistical heuristic rule are the leading factors determine the quantity and types of training
data sets that help in improving the performance of the model.

3. Quality of Training Data Sets


Just like quantity, the quality of machine learning training data set is another key factor, you need to
keep in mind while developing an ML model. If the quality of machine learning training data sets is
not good or accurate your model will never give accurate results, affecting the overall performance
of the model not suitable to use in real-life.

Actually, there are different methods to measure the quality of the training data set. Standard quality-
assurance methods and detailed for in-depth quality assessment are the leading two popular methods
you can use to ensure the quality of data sets. Quality of data is important to get unbiased decisions
from the ML models, so you need to make sure to use the right quality of training data sets to
improve the performance of your ML model.

4. Supervised or Unsupervised ML
Moreover, the above-discussed ML algorithms, the performance of such AI-based models are
affected by methods or process of machine learning. And supervised, unsupervised and
reinforcement learning are the algorithm consist of a target/outcome variable (or dependent variable)
which is to be predicted from a given set of predictors (independent variables).

In unsupervised machine learning, a model is given any target or outcome variable to


predict/estimate. And, it is used for clustering population in different groups, which is widely used
for segmenting customers in different groups for specific intervention. For supervised ML, labeled or
annotated data is required, while for unsupervised ML the approach is different.

Similarly, reinforcement Learning is another important algorithm, used to train the model to make
specific decisions. In this training process, the machine learns from previous experiences and tries to
store the best suitable knowledge for the right predictions.

5. Model Validation and Testing


Building a machine learning model is not enough to get the right predictions, as you have to check
the accuracy and need to validate the same to ensure get the precise results. And validating the model
will improve the performance of the ML model.

Actually, there are various types of validation techniques you can follow but you need to make sure
choose the best one that is suitable for your ML model validation and help you to improve the overall
performance of your ML model and predict in an unbiased manner. Similarly, testing of the model is
also important to ensure its accuracy and performance.

Summing-up
Improving machine learning model performance will not only make the model predict in an unbiased
manner but make it one of the most reliable and acceptable in the AI world. Hence, a machine
learning engineer and data scientist need to make sure all these points while working on such models
to improve the overall performance of the AI model.
10 Unit II Notes || Machine Learning || MC4301

5. What is Feature Transformation ?


Feature transformation is a mathematical transformation in which we apply a mathematical formula to a
particular column (feature) and transform the values, which are useful for our further analysis. It is a
technique by which we can boost our model performance. It is also known as Feature Engineering, which
creates new features from existing features that may help improve the model performance.

It refers to the algorithm family that creates new features using the existing features. These new features
may not have the same interpretation as the original features, but they may have more explanatory power in
a different space rather than in the original space. This can also be used for Feature Reduction. It can be
done in many ways, by linear combinations of original features or using non-linear functions. It helps
machine learning algorithms to converge faster.

Why do we need Feature Transformations?


Like Linear and Logistic regression, some data science models assume that the variables follow a normal
distribution. More likely, variables in real datasets will follow a skewed distribution. By applying some
transformations to these skewed variables, we can map this skewed distribution to a normal distribution to
increase the performance of our models.

As we know, Normal Distribution is a very important distribution in Statistics, which is key to many
statisticians for solving problems in statistics. Usually, the data distribution in Nature follows a Normal
distribution like - age, income, height, weight, etc. But the features in the real-life data are not normally
distributed. However, it is the best approximation when we are unaware of the underlying distribution
pattern.

Feature Transformation Techniques


The following transformation techniques can be applied to data sets, such as:

Feature Transformation in Data Mining


1. Log Transformation: Generally, these transformations make our data close to a normal distribution but
cannot exactly abide by a normal distribution. This transformation is not applied to those features which
have negative values. This transformation is mostly applied to right-skewed data. Convert data from the
addictive scale to multiplicative scale, i.e., linearly distributed data.

2. Reciprocal Transformation: zThis transformation is not defined for zero. It is a powerful


transformation with a radical effect. This transformation reverses the order among values of the same sign,
so large values become smaller and vice-versa.

3. Square Transformation: This transformation mostly applies to left-skewed data.

4. Square Root Transformation: This transformation is defined only for positive numbers. This can be
used for reducing the skewness of right-skewed data. This transformation is weaker than Log
Transformation.
11 Unit II Notes || Machine Learning || MC4301

5. Custom Transformation: A Function Transformer forwards its X (and optionally y) arguments to a user-
defined function or function object and returns this function's result. The resulting transformer will not be
pickle able if lambda is used as the function. This is useful for stateless transformations such as taking the
log of frequencies, doing custom scaling, etc.

6. Power Transformations: Power transforms are a family of parametric, monotonic transformations that
make data more Gaussian-like. The optimal parameter for stabilizing variance and minimizing skewness is
estimated through maximum likelihood. This is useful for modeling issues related to non-constant variance
or other situations where normality is desired. Currently, Power Transformer supports the Box-Cox
transform and the Yeo-Johnson transform.

Box-cox requires the input data to be strictly positive (not even zero is acceptable), while Yeo-Johnson
supports both positive and negative data.

6. Feature Subset Selection ?


What is Feature Selection?
A feature is an attribute that has an impact on a problem or is useful for the problem, and choosing the
important features for the model is known as feature selection. Each machine learning process depends on
feature engineering, which mainly contains two processes; which are Feature Selection and Feature Extraction.
Although feature selection and extraction processes may have the same objective, both are completely
different from each other. The main difference between them is that feature selection is about selecting the
subset of the original feature set, whereas feature extraction creates new features. Feature selection is a way
of reducing the input variable for the model by using only relevant data in order to reduce overfitting in the
model.
So, we can define feature Selection as, "It is a process of automatically or manually selecting the subset of
most appropriate and relevant features to be used in model building." Feature selection is performed by either
including the important features or excluding the irrelevant features in the dataset without changing them.

Need for Feature Selection


Before implementing any technique, it is really important to understand, need for the technique and so for the
Feature Selection. As we know, in machine learning, it is necessary to provide a pre-processed and good input
dataset in order to get better outcomes. We collect a huge amount of data to train our model and help it to
learn better. Generally, the dataset consists of noisy data, irrelevant data, and some part of useful data.
Moreover, the huge amount of data also slows down the training process of the model, and with noise and
irrelevant data, the model may not predict and perform well. So, it is very necessary to remove such noises
and less-important data from the dataset and to do this, and Feature selection techniques are used.
Selecting the best features helps the model to perform well.
For example, suppose we want to create a model that automatically decides which car should be crushed for
a spare part, and to do this, we have a dataset. This dataset contains a Model of the car, Year, Owner's name,
Miles. So, in this dataset, the name of the owner does not contribute to the model performance as it does not
decide if the car should be crushed or not, so we can remove this column and select the rest of the features
(column) for the model building.
Below are some benefits of using feature selection in machine learning:
o It helps in avoiding the curse of dimensionality.
o It helps in the simplification of the model so that it can be easily interpreted by the researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.
12 Unit II Notes || Machine Learning || MC4301

Feature Selection Techniques

There are mainly two types of Feature Selection techniques, which are:
o Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable and can be used for the labelled
dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used for the
unlabelled dataset.

There are mainly three techniques under supervised feature Selection:

1. Wrapper Methods

In wrapper methodology, selection of features is done by considering it as a search problem, in which different
combinations are made, evaluated, and compared with other combinations. It trains the algorithm by using the
subset of features iteratively.

On the basis of the output of the model, features are added or subtracted, and
with this feature set, the model has trained again.

Some techniques of wrapper methods are:

o Forward selection - Forward selection is an iterative process,


which begins with an empty set of features. After each iteration, it keeps adding
on a feature and evaluates the performance to check whether it is improving the
performance or not. The process continues until the addition of a new
variable/feature does not improve the performance of the model.
o Backward elimination - Backward elimination is also an iterative approach, but it is the opposite
of forward selection. This technique begins the process by considering all the features and removes
the least significant feature. This elimination process continues until removing the features does not
improve the performance of the model.
o Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature selection
methods, which evaluates each feature set as brute-force. It means this method tries & make each
possible combination of features and return the best performing feature set.
13 Unit II Notes || Machine Learning || MC4301

o Recursive Feature Elimination-


Recursive feature elimination is a recursive greedy optimization approach, where features are
selected by recursively taking a smaller and smaller subset of features. Now, an estimator is trained
with each set of features, and the importance of each feature is determined using coef_attribute or
through a feature_importances_attribute.

2. Filter Methods

In Filter Method, features are selected on the basis of statistics measures. This method does not depend on the
learning algorithm and chooses the features as a pre-processing step.

The filter method filters out the irrelevant feature and redundant columns from the model by using different
metrics through ranking.

The advantage of using filter methods is that it needs low computational time and does not overfit the data.

Some common techniques of Filter methods are as follows:

o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio

Information Gain: Information gain determines the reduction in entropy


while transforming the dataset. It can be used as a feature selection technique
by calculating the information gain of each variable with respect to the target variable.

Chi-square Test: Chi-square test is a technique to determine the relationship between the categorical variables.
The chi-square value is calculated between each feature and the target variable, and the desired number of
features with the best chi-square value is selected.

Fisher's Score:

Fisher's score is one of the popular supervised technique of features selection. It returns the rank of the variable
on the fisher's criteria in descending order. Then we can select the variables with a large fisher's score.

Missing Value Ratio:

The value of the missing value ratio can be used for evaluating the feature set against the threshold value. The
formula for obtaining the missing value ratio is the number of missing values in each column divided by the
total number of observations. The variable is having more than the threshold value can be dropped.
14 Unit II Notes || Machine Learning || MC4301

3. Embedded Methods

Embedded methods combined the advantages of both filter and wrapper methods by considering the
interaction of features along with low computational cost. These are fast processing methods similar to the
filter method but more accurate than the filter method.

These methods are also iterative, which evaluates each iteration, and
optimally finds the most important features that contribute the most to
training in a particular iteration. Some techniques of embedded methods
are:

o Regularization- Regularization adds a penalty term to different


parameters of the machine learning model for avoiding overfitting in the
model. This penalty term is added to the coefficients; hence it shrinks
some coefficients to zero. Those features with zero coefficients can be
removed from the dataset. The types of regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).
o Random Forest Importance - Different tree-based methods of feature selection help us with feature
importance to provide a way of selecting features. Here, feature importance specifies which feature
has more importance in model building or has a great impact on the target variable. Random Forest is
such a tree-based method, which is a type of bagging algorithm that aggregates a different number of
decision trees. It automatically ranks the nodes by their performance or decrease in the impurity (Gini
impurity) over all the trees. Nodes are arranged as per the impurity values, and thus it allows to pruning
of trees below a specific node. The remaining nodes create a subset of the most important features.

How to choose a Feature Selection Method?

For machine learning engineers, it is very important to understand that which feature selection method will
work properly for their model. The more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
15 Unit II Notes || Machine Learning || MC4301

To know this, we need to first identify the type of input and output variables. In machine learning, variables
are of mainly two types:

o Numerical Variables: Variable with continuous values such as integer, float


o Categorical Variables: Variables with categorical values such as Boolean, ordinal, nominals.

Below are some univariate statistical measures, which can be used for filter-based feature selection:

1. Numerical Input, Numerical Output:

Numerical Input variables are used for predictive regression modelling. The common method to be used for
such a case is the Correlation coefficient.
o Pearson's correlation coefficient (For linear Correlation).
o Spearman's rank coefficient (for non-linear correlation).

2. Numerical Input, Categorical Output:

Numerical Input with categorical output is the case for classification predictive modelling problems. In this
case, also, correlation-based techniques should be used, but with categorical output.
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).

3. Categorical Input, Numerical Output:

This is the case of regression predictive modelling with categorical input. It is a different example of a
regression problem. We can use the same measures as discussed in the above case but in reverse order.

4. Categorical Input, Categorical Output:

This is a case of classification predictive modelling with categorical Input variables.


The commonly used technique for such a case is Chi-Squared Test. We can also use Information gain in this
case.

Conclusion

Feature selection is a very complicated and vast field of machine learning, and lots of studies are already made
to discover the best methods. There is no fixed rule of the best feature selection method. However, choosing
the method depend on a machine learning engineer who can combine and innovate approaches to find the best
method for a specific problem. One should try a variety of model fits on different subsets of features selected
through different statistical Measures.
Machine Learning MC4301

UNIT III
BAYESIAN LEARNING

Bayesian reasoning provides a probabilistic approach to inference. It is based on the


assumption that the quantities of interest are governed by probability distributions and that
optimal decisions can be made by reasoning about these probabilities together with observed
data

INTRODUCTION

Bayesian learning methods are relevant to study of machine learning for two different reasons.
1. First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems
2. The second reason is that they provide a useful perspective for understanding many
learning algorithms that do not explicitly manipulate probabilities.

Features of Bayesian Learning Methods

 Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example
 Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a
prior probability for each candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
 Bayesian methods can accommodate hypotheses that make probabilistic predictions
 New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
 Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.

1 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

Practical difficulty in applying Bayesian methods

1. One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known in
advance they are often estimated based on background knowledge, previously available
data, and assumptions about the form of the underlying distributions.
2. A second practical difficulty is the significant computational cost required to determine
the Bayes optimal hypothesis in the general case. In certain specialized situations, this
computational cost can be significantly reduced.

BAYES THEOREM

Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior
probability, the probabilities of observing various data given the hypothesis, and the observed
data itself.
Notations
 P(h) prior probability of h, reflects any background knowledge about the chance that h
is correct
 P(D) prior probability of D, probability that D will be observed
 P(D|h) probability of observing D given a world in which h holds
 P(h|D) posterior probability of h, reflects confidence that h holds after D has been
observed

Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to
calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D)
and P(D|h).

 P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
 P(h|D) decreases as P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in support of h.

2 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

Maximum a Posteriori (MAP) Hypothesis

 In many learning scenarios, the learner considers some set of candidate hypotheses H
and is interested in finding the most probable hypothesis h ∈ H given the observed data
D. Any such maximally probable hypothesis is called a maximum a posteriori (MAP)
hypothesis.
 Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided

 P(D) can be dropped, because it is a constant independent of h

Maximum Likelihood (ML) Hypothesis

 In some cases, it is assumed that every hypothesis in H is equally probable a priori


(P(hi) = P(hj) for all hi and hj in H).
 In this case the below equation can be simplified and need only consider the term P(D|h)
to find the most probable hypothesis.

P(D|h) is often called the likelihood of the data D given h, and any hypothesis that maximizes
P(D|h) is called a maximum likelihood (ML) hypothesis

Example
 Consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has particular form of cancer, and (2) that the patient does not. The
available data is from a particular laboratory test with two possible outcomes: +
(positive) and - (negative).

3 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

 We have prior knowledge that over the entire population of people only .008 have this
disease. Furthermore, the lab test is only an imperfect indicator of the disease.
 The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the
disease is not present. In other cases, the test returns the opposite result.
 The above situation can be summarized by the following probabilities:

Suppose a new patient is observed for whom the lab test returns a positive (+) result.
Should we diagnose the patient as having cancer or not?

The exact posterior probabilities can also be determined by normalizing the above quantities
so that they sum to 1

Basic formulas for calculating probabilities are summarized in Table

4 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

BAYES THEOREM AND CONCEPT LEARNING

What is the relationship between Bayes theorem and the problem of concept learning?

Since Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, and can use it as the basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable.

Brute-Force Bayes Concept Learning

Consider the concept learning problem


 Assume the learner considers some finite hypothesis space H defined over the instance
space X, in which the task is to learn some target concept c : X → {0,1}.
 Learner is given some sequence of training examples ((x1, d1) . . . (xm, dm)) where xi is
some instance from X and where di is the target value of xi (i.e., di = c(xi)).
 The sequence of target values are written as D = (d1 . . . dm).

We can design a straightforward concept learning algorithm to output the maximum a posteriori
hypothesis, based on Bayes theorem, as follows:

BRUTE-FORCE MAP LEARNING algorithm:

1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis hMAP with the highest posterior probability

In order specify a learning problem for the BRUTE-FORCE MAP LEARNING algorithm we
must specify what values are to be used for P(h) and for P(D|h) ?

Let’s choose P(h) and for P(D|h) to be consistent with the following assumptions:
 The training data D is noise free (i.e., di = c(xi))
 The target concept c is contained in the hypothesis space H
 Do not have a priori reason to believe that any hypothesis is more probable than any
other.

5 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

What values should we specify for P(h)?


 Given no prior knowledge that one hypothesis is more likely than another, it is
reasonable to assign the same prior probability to every hypothesis h in H.
 Assume the target concept is contained in H and require that these prior probabilities
sum to 1.

What choice shall we make for P(D|h)?


 P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the fixed set
of instances (x1 . . . xm), given a world in which hypothesis h holds
 Since we assume noise-free training data, the probability of observing classification di
given h is just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,

Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for the above
BRUTE-FORCE MAP LEARNING algorithm.

Recalling Bayes theorem, we have

Consider the case where h is inconsistent with the training data D

The posterior probability of a hypothesis inconsistent with D is zero

Consider the case where h is consistent with D

Where, VSH,D is the subset of hypotheses from H that are consistent with D

To summarize, Bayes theorem implies that the posterior probability P(h|D) under our assumed
P(h) and P(D|h) is

6 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

The Evolution of Probabilities Associated with Hypotheses

 Figure (a) all hypotheses have the same probability.


 Figures (b) and (c), As training data accumulates, the posterior probability for
inconsistent hypotheses becomes zero while the total probability summing to 1 is
shared equally among the remaining consistent hypotheses.

MAP Hypotheses and Consistent Learners

 A learning algorithm is a consistent learner if it outputs a hypothesis that commits zero


errors over the training examples.
 Every consistent learner outputs a MAP hypothesis, if we assume a uniform prior
probability distribution over H (P(hi) = P(hj) for all i, j), and deterministic, noise free
training data (P(D|h) =1 if D and h are consistent, and 0 otherwise).

Example:
 FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis under the
probability distributions P(h) and P(D|h) defined above.
 Are there other probability distributions for P(h) and P(D|h) under which FIND-S
outputs MAP hypotheses? Yes.
 Because FIND-S outputs a maximally specific hypothesis from the version space, its
output hypothesis will be a MAP hypothesis relative to any prior probability distribution
that favours more specific hypotheses.

Note
 Bayesian framework is a way to characterize the behaviour of learning algorithms
 By identifying probability distributions P(h) and P(D|h) under which the output is a
optimal hypothesis, implicit assumptions of the algorithm can be characterized
(Inductive Bias)
 Inductive inference is modelled by an equivalent probabilistic reasoning system based
on Bayes theorem

7 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR HYPOTHESES

Consider the problem of learning a continuous-valued target function such as neural network
learning, linear regression, and polynomial curve fitting

A straightforward Bayesian analysis will show that under certain assumptions any learning
algorithm that minimizes the squared error between the output hypothesis predictions and the
training data will output a maximum likelihood (ML) hypothesis

 Learner L considers an instance space X and a hypothesis space H consisting of some


class of real-valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training
examples of the form <xi,di>
 The problem faced by L is to learn an unknown target function f : X → R
 A set of m training examples is provided, where the target value of each example is
corrupted by random noise drawn according to a Normal probability distribution with
zero mean (di = f(xi) + ei)
 Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable
representing the noise.
– It is assumed that the values of the ei are drawn independently and that they are
distributed according to a Normal distribution with zero mean.
 The task of the learner is to output a maximum likelihood hypothesis or a MAP
hypothesis assuming all hypotheses are equally probable a priori.

Using the definition of hML we have

Assuming training examples are mutually independent given h, we can write P(D|h) as the
product of the various (di|h)

Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each
di must also obey a Normal distribution around the true targetvalue f(xi). Because we are
writing the expression for P(D|h), we assume h is the correct description of f.
Hence, µ = f(xi) = h(xi)

8 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

Maximize the less complicated logarithm, which is justified because of the monotonicity of
function p

The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding

Maximizing this negative quantity is equivalent to minimizing the corresponding positive


quantity

Finally, discard constants that are independent of h.

Thus, above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and the
hypothesis predictions h(xi)

Note:
Why is it reasonable to choose the Normal distribution to characterize noise?
 Good approximation of many types of noise in physical systems
 Central Limit Theorem shows that the sum of a sufficiently large number of
independent, identically distributed random variables itself obeys a Normal distribution
Only noise in the target value is considered, not in the attributes describing the instances
themselves

9 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

MAXIMUM LIKELIHOOD HYPOTHESES FOR PREDICTING PROBABILITIES

 Consider the setting in which we wish to learn a nondeterministic (probabilistic)


function f : X → {0, 1}, which has two discrete output values.
 We want a function approximator whose output is the probability that f(x) = 1. In other
words, learn the target function f ` : X → [0, 1] such that f ` (x) = P(f(x) = 1)

How can we learn f ` using a neural network?


 Use of brute force way would be to first collect the observed frequencies of 1's and 0's
for each possible value of x and to then train the neural network to output the target
frequency for each x.

What criterion should we optimize in order to find a maximum likelihood hypothesis for f' in
this setting?
 First obtain an expression for P(D|h)
 Assume the training data D is of the form D = {(x1, d1) . . . (xm, dm)}, where di is the
observed 0 or 1 value for f (xi).
 Both xi and di as random variables, and assuming that each training example is drawn
independently, we can write P(D|h) as

Applying the product rule

The probability P(di|h, xi)

Re-express it in a more mathematically manipulable form, as

Equation (4) to substitute for P(di |h, xi) in Equation (5) to obtain

We write an expression for the maximum likelihood hypothesis

10 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

The last term is a constant independent of h, so it can be dropped

It easier to work with the log of the likelihood, yielding

Equation (7) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting

Gradient Search to Maximize Likelihood in a Neural Net

 Derive a weight-training rule for neural network learning that seeks to maximize G(h,D)
using gradient ascent
 The gradient of G(h,D) is given by the vector of partial derivatives of G(h,D) with
respect to the various network weights that define the hypothesis h represented by the
learned network
 In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to
unit j is

 Suppose our neural network is constructed from a single layer of sigmoid units. Then,

where xijk is the kth input to unit j for the ith training example, and d(x) is the derivative
of the sigmoid squashing function.

 Finally, substituting this expression into Equation (1), we obtain a simple expression for
the derivatives that constitute the gradient

11 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

Because we seek to maximize rather than minimize P(D|h), we perform gradient ascent rather
than gradient descent search. On each iteration of the search the weight vector is adjusted in
the direction of the gradient, using the weight update rule

Where, η is a small positive constant that determines the step size of the i gradient ascent search

MINIMUM DESCRIPTION LENGTH PRINCIPLE

 A Bayesian perspective on Occam’s razor


 Motivated by interpreting the definition of hMAP in the light of basic concepts from
information theory.

which can be equivalently expressed in terms of maximizing the log2

or alternatively, minimizing the negative of this quantity

This equation (1) can be interpreted as a statement that short hypotheses are preferred,
assuming a particular representation scheme for encoding hypotheses and data

 -log2P(h): the description length of h under the optimal encoding for the hypothesis
space H, LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
 -log2P(D | h): the description length of the training data D given hypothesis h, under the
optimal encoding from the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h
is the optimal code for describing data D assuming that both the sender and receiver
know the hypothesis h.
 Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given
by the description length of the hypothesis plus the description length of the data given
the hypothesis.

Where, CH and CD|h are the optimal encodings for H and for D given h

12 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths of equ.

Minimum Description Length principle:

Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis

The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH,
and if we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP

Application to Decision Tree Learning

Apply the MDL principle to the problem of learning decision trees from some training data.
What should we choose for the representations C1 and C2 of hypotheses and data?
 For C1: C1 might be some obvious encoding, in which the description length grows with
the number of nodes and with the number of edges
 For C2: Suppose that the sequence of instances (x1 . . .xm) is already known to both the
transmitter and receiver, so that we need only transmit the classifications (f (x1) . . . f
(xm)).
 Now if the training classifications (f (x1) . . .f(xm)) are identical to the predictions of the
hypothesis, then there is no need to transmit any information about these examples. The
description length of the classifications given the hypothesis ZERO
 If examples are misclassified by h, then for each misclassification we need to transmit
a message that identifies which example is misclassified as well as its correct
classification
 The hypothesis hMDL under the encoding C1 and C2 is just the one that minimizes the
sum of these description lengths.

13 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

NAIVE BAYES CLASSIFIER

 The naive Bayes classifier applies to learning tasks where each instance x is described
by a conjunction of attribute values and where the target function f (x) can take on any
value from some finite set V.
 A set of training examples of the target function is provided, and a new instance is
presented, described by the tuple of attribute values (al, a2.. .a m).
 The learner is asked to predict the target value, or classification, for this new instance.

The Bayesian approach to classifying the new instance is to assign the most probable target
value, VMAP, given the attribute values (al, a2.. .am) that describe the instance

Use Bayes theorem to rewrite this expression as

 The naive Bayes classifier is based on the assumption that the attribute values are
conditionally independent given the target value. Means, the assumption is that given
the target value of the instance, the probability of observing the conjunction (al, a 2.. .am),
is just the product of the probabilities for the individual attributes:

Substituting this into Equation (1),

Naive Bayes classifier:

Where, VNB denotes the target value output by the naive Bayes classifier

14 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

An Illustrative Example
 Let us apply the naive Bayes classifier to a concept learning problem i.e., classifying
days according to whether someone will play tennis.
 The below table provides a set of 14 training examples of the target concept PlayTennis,
where each day is described by the attributes Outlook, Temperature, Humidity, and
Wind

Day Outlook Temperature Humidity Wind PlayTennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

 Use the naive Bayes classifier and the training data from this table to classify the
following novel instance:
< Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong >

 Our task is to predict the target value (yes or no) of the target concept PlayTennis for
this new instance

15 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

The probabilities of the different target values can easily be estimated based on their
frequencies over the 14 training examples
 P(P1ayTennis = yes) = 9/14 = 0.64
 P(P1ayTennis = no) = 5/14 = 0.36

Similarly, estimate the conditional probabilities. For example, those for Wind = strong
 P(Wind = strong | PlayTennis = yes) = 3/9 = 0.33
 P(Wind = strong | PlayTennis = no) = 3/5 = 0.60

Calculate VNB according to Equation (1)

Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new
instance, based on the probability estimates learned from the training data.

By normalizing the above quantities to sum to one, calculate the conditional probability that
the target value is no, given the observed attribute values

Estimating Probabilities

 We have estimated probabilities by the fraction of times the event is observed to occur
over the total number of opportunities.
 For example, in the above case we estimated P(Wind = strong | Play Tennis = no) by
the fraction nc /n where, n = 5 is the total number of training examples for which
PlayTennis = no, and nc = 3 is the number of these for which Wind = strong.
 When nc = 0, then nc /n will be zero and this probability term will dominate the quantity
calculated in Equation (2) requires multiplying all the other probability terms by this
zero value
 To avoid this difficulty we can adopt a Bayesian approach to estimating the probability,
using the m-estimate defined as follows

m -estimate of probability:

16 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

 p is our prior estimate of the probability we wish to determine, and m is a constant


called the equivalent sample size, which determines how heavily to weight p relative
to the observed data
 Method for choosing p in the absence of other information is to assume uniform
priors; that is, if an attribute has k possible values we set p = 1 /k.

BAYESIAN BELIEF NETWORKS

 The naive Bayes classifier makes significant use of the assumption that the values of the
attributes a1 . . .an are conditionally independent given the target value v.
 This assumption dramatically reduces the complexity of learning the target function

A Bayesian belief network describes the probability distribution governing a set of variables
by specifying a set of conditional independence assumptions along with a set of conditional
probabilities
Bayesian belief networks allow stating conditional independence assumptions that apply to
subsets of the variables

Notation
 Consider an arbitrary set of random variables Y1 . . . Yn , where each variable Yi can
take on the set of possible values V(Yi).
 The joint space of the set of variables Y to be the cross product V(Y1) x V(Y2) x. . .
V(Yn).
 In other words, each item in the joint space corresponds to one of the possible
assignments of values to the tuple of variables (Y1 . . . Yn). The probability distribution
over this joint' space is called the joint probability distribution.
 The joint probability distribution specifies the probability for each of the possible
variable bindings for the tuple (Y1 . . . Yn).
 A Bayesian belief network describes the joint probability distribution for a set of
variables.

Conditional Independence

Let X, Y, and Z be three discrete-valued random variables. X is conditionally independent of


Y given Z if the probability distribution governing X is independent of the value of Y given a
value for Z, that is, if

Where,

17 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

The above expression is written in abbreviated form as


P(X | Y, Z) = P(X | Z)

Conditional independence can be extended to sets of variables. The set of variables X1 . . . Xl


is conditionally independent of the set of variables Y1 . . . Ym given the set of variables Z1 . . .
Zn if

The naive Bayes classifier assumes that the instance attribute A 1 is conditionally independent
of instance attribute A2 given the target value V. This allows the naive Bayes classifier to
calculate P(Al, A2 | V) as follows,

Representation

A Bayesian belief network represents the joint probability distribution for a set of variables.
Bayesian networks (BN) are represented by directed acyclic graphs.

The Bayesian network in above f igure represents the joint probability distribution over the
boolean variables Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup

A Bayesian network (BN) represents the joint probability distribution by specifying a set of
conditional independence assumptions
 BN represented by a directed acyclic graph, together with sets of local conditional
probabilities
 Each variable in the joint space is represented by a node in the Bayesian network
 The network arcs represent the assertion that the variable is conditionally independent
of its non-descendants in the network given its immediate predecessors in the network.
 A conditional probability table (CPT) is given for each variable, describing the
probability distribution for that variable given the values of its immediate predecessors

18 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

The joint probability for any desired assignment of values (y1, . . . , yn ) to the tuple of network
variables (Y1 . . . Ym) can be computed by the formula

Where, Parents(Yi) denotes the set of immediate predecessors of Yi in the network .

Example:
Consider the node Campfire. The network nodes and arcs represent the assertion that Campfire
is conditionally independent of its non-descendants Lightning and Thunder, given its
immediate parents Storm and BusTourGroup.

This means that once we know the value of the variables Storm and BusTourGroup, the
variables Lightning and Thunder provide no additional information about Campfire
The conditional probability table associated with the variable Campfire. The assertion is

P(Campfire = True | Storm = True, BusTourGroup = True) = 0.4

Inference

 Use a Bayesian network to infer the value of some target variable (e.g., ForestFire) given
the observed values of the other variables.
 Inference can be straightforward if values for all of the other variables in the network
are known exactly.
 A Bayesian network can be used to compute the probability distribution for any subset
of network variables given the values or distributions for any subset of the remaining
variables.
 An arbitrary Bayesian network is known to be NP-hard

19 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

Learning Bayesian Belief Networks

Affective algorithms can be considered for learning Bayesian belief networks from training
data by considering several different settings for learning problem
 First, the network structure might be given in advance, or it might have to be inferred from
the training data.
 Second, all the network variables might be directly observable in each training example,
or some might be unobservable.
 In the case where the network structure is given in advance and the variables are fully
observable in the training examples, learning the conditional probability tables is
straightforward and estimate the conditional probability table entries
 In the case where the network structure is given but only some of the variable values
are observable in the training data, the learning problem is more difficult. The learning
problem can be compared to learning weights for an ANN.

Gradient Ascent Training of Bayesian Network

The gradient ascent rule which maximizes P(D|h) by following the gradient of ln P(D|h) with
respect to the parameters that define the conditional probability tables of the Bayesian network.

Let wijk denote a single entry in one of the conditional probability tables. In particular wijk
denote the conditional probability that the network variable Yi will take on the value yi, given
that its immediate parents Ui take on the values given by uik.

The gradient of ln P(D|h) is given by the derivatives for each of the w ijk.
As shown below, each of these derivatives can be calculated as

Derive the gradient defined by the set of derivatives for all i, j, and k. Assuming the
training examples d in the data set D are drawn independently, we write this derivative as

20 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

We write the abbreviation Ph(D) to represent P(D|h).

21 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

THE EM ALGORITHM

The EM algorithm can be used even for variables whose value is never directly observed,
provided the general form of the probability distribution governing these variables is known.

Estimating Means of k Gaussians

 Consider a problem in which the data D is a set of instances generated by a probability


distribution that is a mixture of k distinct Normal distributions.

 This problem setting is illustrated in Figure for the case where k = 2 and where the
instances are the points shown along the x axis.
 Each instance is generated using a two-step process.
 First, one of the k Normal distributions is selected at random.
 Second, a single random instance xi is generated according to this selected
distribution.
 This process is repeated to generate a set of data points as shown in the figure.

22 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

 To simplify, consider the special case


 The selection of the single Normal distribution at each step is based on choosing
each with uniform probability
 Each of the k Normal distributions has the same variance σ2, known value.
 The learning task is to output a hypothesis h = (μ1 , . . . ,μk) that describes the means of
each of the k distributions.
 We would like to find a maximum likelihood hypothesis for these means; that is, a
hypothesis h that maximizes p(D |h).

In this case, the sum of squared errors is minimized by the sample mean

 Our problem here, however, involves a mixture of k different Normal distributions, and
we cannot observe which instances were generated by which distribution.
 Consider full description of each instance as the triple (xi, zi1, zi2),
 where xi is the observed value of the ith instance and
 where zi1 and zi2 indicate which of the two Normal distributions was used to
generate the value xi
 In particular, zij has the value 1 if xi was created by the j th Normal distribution and 0
otherwise.
 Here xi is the observed variable in the description of the instance, and zil and zi2 are
hidden variables.
 If the values of zil and zi2 were observed, we could use following Equation to solve for
the means p1 and p2
 Because they are not, we will instead use the EM algorithm

EM algorithm

23 JAYA ENGINEERING COLLEGE-THIRUNINRAVUR


0 0
Machine Learning MC4301

24 JAYA ENGINEERING COLLEGE - THIRUNINRAVUR


0
1 Unit IV Notes || Machine Learning || MC4301

Unit IV – Parametric Machine Learning.

Logistic Regression: Classification and representation

Introduction to Logistic Regression:

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target
variable. The nature of target or dependent variable is dichotomous, which means there would be only two
possible classes.

In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML
algorithms that can be used for various classification problems such as spam detection, Diabetes prediction,
cancer detection etc.

Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression
is used for solving Regression problems, whereas Logistic regression is used for solving the classification
problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts
two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.

Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.

Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the logistic
function:
2 Unit IV Notes || Machine Learning || MC4301

Logistic Function (Sigmoid Function):


1. The sigmoid function is a mathematical function used to map the predicted values to probabilities.
2. It maps any real value into another value within a range of 0 and 1.
3. The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the sigmoid function or the logistic
function.
4. In logistic regression, we use the concept of the threshold value, which defines the probability of either
0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values
tends to 0.

Assumptions for Logistic Regression:


1. The dependent variable must be categorical in nature.
2. The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
1. We know the equation of the straight line can be written as:

2. In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):

3. But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:
3 Unit IV Notes || Machine Learning || MC4301

 Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
 Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of
the dependent variable, such as "cat", "dogs", or "sheep"
 Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".

Steps in Logistic Regression:

To implement the Logistic Regression using Python, we will use the same steps as we have done in previous
topics of Regression. Below are the steps:

1. Data Pre-processing step


2. Fitting Logistic Regression to the Training set
3. Predicting the test result
4. Test accuracy of the result(Creation of Confusion matrix)
5. Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in our
code efficiently. It will be the same as we have done in Data pre-processing topic. The code for this is given
below:

#Data Pre-procesing Step


# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)

2. Fitting Logistic Regression to the Training set:


4 Unit IV Notes || Machine Learning || MC4301

We have well prepared our dataset, and now we will train the dataset using the training set. For providing
training or fitting the model to the training set, we will import the LogisticRegression class of
the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic regression.
Below is the code for it:

#Fitting Logistic Regression to the training set


from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalt
y='l2', random_state=0, solver='warn', tol=0.0001, verbose=0, warm_start=False)

3. Predicting the Test Result


Our model is well trained on the training set, so we will now predict the result by using test set data. Below is
the code for it:
# Predicting the test set result
y_pred= classifier.predict(x_test)

4. Test Accuracy of the result:

Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we need
to import the confusion_matrix function of the sklearn library. After importing the function, we will call it
using a new variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:

#Creating the Confusion matrix


from sklearn.metrics import confusion_matrix
cm= confusion_matrix()

We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output, we
can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).

5. Visualizing the training set result:


Finally, we will visualize the training set result. To visualize the result, we will use ListedColormap class of
matplotlib library. Below is the code for it:

#Visualizing the training set result

from matplotlib.colors import ListedColormap


x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
5 Unit IV Notes || Machine Learning || MC4301

mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Logistic Regression (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()

In the above code, we have imported the ListedColormap class of Matplotlib library to create the colormap
for visualizing the result. We have created two new variables x_set and y_set to replace x_train and y_train.
After that, we have used the nm.meshgrid command to create a rectangular grid, which has a range of -
1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided colors
(purple and green). In this function, we have passed the classifier.predict to show the predicted data points
predicted by the classifier.

Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:

 In the above graph, we can see that there are some Green points within the green region and Purple
points within the purple region.
 All these data points are the observation points from the training set, which shows the result for
purchased variables.
 This graph is made by using two independent variables i.e., Age on the x-axis and Estimated salary
on the y-axis.
 The purple point observations are for which purchased (dependent variable) is probably 0, i.e., users
who did not purchase the SUV car.
 The green point observations are for which purchased (dependent variable) is probably 1 means user
who purchased the SUV car.
 We can also estimate from the graph that the users who are younger with low salary, did not purchase
the car, whereas older users with high estimated salary purchased the car.
6 Unit IV Notes || Machine Learning || MC4301

 But there are some purple points in the green region (Buying the car) and some green points in the
purple region (Not buying the car). So we can say that younger users with a high estimated salary
purchased the car, whereas an older user with a low estimated salary did not purchase the car.

Cost Function in Machine Learning

A Machine Learning model should have a very high level of accuracy in order to perform well with real-world
applications. But how to calculate the accuracy of the model, i.e., how good or poor our model will perform
in the real world? In such a case, the Cost function comes into existence. It is an important machine learning
parameter to correctly estimate the model.

Cost function also plays a crucial role in understanding that how well your model estimates the relationship
between the input and output parameters.

In this topic, we will explain the cost function in Machine Learning, Gradient descent, and types of cost
functions.

What is Cost Function?

A cost function is an important parameter that determines how well a machine learning model performs for a
given dataset. It calculates the difference between the expected value and predicted value and represents it as
a single real number.

In machine learning, once we train our model, then we want to see how well our model is performing.
Although there are various accuracy functions that tell you how your model is performing, but will not give
insights to improve them. So, we need a function that can find when the model is most accurate by finding the
spot between the undertrained and overtrained model.

In simple, "Cost function is a measure of how wrong the model is in estimating the relationship between
X(input) and Y(output) Parameter." A cost function is sometimes also referred to as Loss function, and it can
be estimated by iteratively running the model to compare estimated predictions against the known values of
Y.

The main aim of each ML model is to determine parameters or weights that can minimize the cost function.
7 Unit IV Notes || Machine Learning || MC4301

Types of Cost Function

Cost functions can be of various types depending on the problem. However, mainly it is of three types, which
are as follows:
1. Regression Cost Function
2. Binary Classification cost Functions
3. Multi-class Classification Cost Function.

1. Regression Cost Function


Regression models are used to make a prediction for the continuous variables such as the price of houses,
weather prediction, loan predictions, etc. When a cost function is used with Regression, it is known as the
"Regression Cost Function." In this, the cost function is calculated as the error based on the distance, such as:

Error= Actual Output-Predicted output

There are three commonly used Regression cost functions, which are as follows:

a. Means Error
In this type of cost function, the error is calculated for each training data, and then the mean of all error values
is taken.
It is one of the simplest ways possible.

The errors that occurred from the training data can be either negative or positive. While finding mean, they
can cancel out each other and result in the zero-mean error for the model, so it is not recommended cost
function for a model.

However, it provides a base for other cost functions of regression models.

b. Mean Squared Error (MSE)


Means Square error is one of the most commonly used Cost function methods. It improves the drawbacks of
the Mean error cost function, as it calculates the square of the difference between the actual value and predicted
value. Because of the square of the difference, it avoids any possibility of negative error.
The formula for calculating MSE is given below:

Mean squared error is also known as L2 Loss.


In MSE, each error is squared, and it helps in reducing a small deviation in prediction as compared to MAE.
But if the dataset has outliers that generate more prediction errors, then squaring of this error will further
increase the error multiple times. Hence, we can say MSE is less robust to outliers.
8 Unit IV Notes || Machine Learning || MC4301

c. Mean Absolute Error (MAE)


Mean Absolute error also overcome the issue of the Mean error cost function by taking the absolute difference
between the actual value and predicted value.
The formula for calculating Mean Absolute Error is given below:

This means the Absolute error cost function is also known as L1 Loss. It is not affected by noise or outliers,
hence giving better results if the dataset has noise or outlier.

2. Binary Classification Cost Functions


Classification models are used to make predictions of categorical variables, such as predictions for 0 or 1, Cat
or dog, etc. The cost function used in the classification problem is known as the Classification cost function.
However, the classification cost function is different from the Regression cost function.
One of the commonly used loss functions for classification is cross-entropy loss.
The binary Cost function is a special case of Categorical cross-entropy, where there is only one output class.
For example, classification between red and blue.
To better understand it, let's suppose there is only a single output variable Y

Cross-entropy(D) = - y*log(p) when y = 1

Cross-entropy(D) = - (1-y)*log(1-p) when y = 0

The error in binary classification is calculated as the mean of cross-entropy for all N training data. Which
means:
Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N

3. Multi-class Classification Cost Function


A multi-class classification cost function is used in the classification problems for which instances are
allocated to one of more than two classes. Here also, similar to binary class classification cost function, cross-
entropy or categorical cross-entropy is commonly used cost function.
It is designed in a way that it can be used with multi-class classification with the target values ranging from 0
to 1, 3, ….,n classes.

In a multi-class classification problem, cross-entropy will generate a score that summarizes the mean
difference between actual and anticipated probability distribution.
For a perfect cross-entropy, the value should be zero when the score is minimized.

Gradient Descent in Machine Learning

Gradient Descent is known as one of the most commonly used optimization algorithms to train machine
learning models by means of minimizing errors between actual and expected results. Further, gradient descent
is also used to train Neural Networks.
9 Unit IV Notes || Machine Learning || MC4301

In mathematical terminology, Optimization algorithm refers to the task of minimizing/maximizing an


objective function f(x) parameterized by x. Similarly, in machine learning, optimization is the task of
minimizing the cost function parameterized by the model's parameters. The main objective of gradient descent
is to minimize the convex function using iteration of parameter updates. Once these machine learning models
are optimized, these models can be used as powerful tools for Artificial Intelligence and various computer
science applications.

In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about gradient descent, the
role of cost functions specifically as a barometer within Machine Learning, types of gradient descents, learning
rates, etc.

What is Gradient Descent or Steepest Descent?

Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century. Gradient
Descent is defined as one of the most commonly used iterative optimization algorithms of machine learning
to train the machine learning and deep learning models. It helps in finding the local minimum of a function.

The best way to define the local minimum or local maximum of a function using gradient descent is as follows:

If we move towards a negative gradient or away from the gradient of the function at the current point, it will
give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the current point,
we will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main
objective of using a gradient descent algorithm is to minimize the cost function using iteration. To achieve
this goal, it performs two steps iteratively:

Calculates the first-order derivative of the function to compute the gradient or slope of that function.
Move away from the direction of the gradient, which means slope increased from the current point by alpha
times, where Alpha is defined as Learning Rate. It is a tuning parameter in the optimization process which
helps to decide the length of the steps.
10 Unit IV Notes || Machine Learning || MC4301

How does Gradient Descent work?

Before starting the working principle of gradient descent, we should know some basic concepts to find out the
slope of a line from linear regression. The equation for simple linear regression is given as:

Equation : Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.

The starting point (shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent line to
calculate the steepness of this slope. Further, this slope will inform the updates to the parameters (weights and
bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are generated,
then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which is called a
point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between expected and
actual. To minimize the cost function, two data points are required:

1. Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future iteration and allow it to the
point of convergence or local minimum or global minimum. Let's discuss learning rate factors in brief;

Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically a small value that is
evaluated and updated based on the behavior of the cost function. If the learning rate is high, it results in larger
steps but also leads to risks of overshooting the minimum. At the same time, a low learning rate shows the
small step sizes, which compromises overall efficiency but gives the advantage of more precision.
11 Unit IV Notes || Machine Learning || MC4301

Types of Gradient Descent

Based on the error in various training models, the Gradient Descent learning algorithm can be divided
into Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Let's
understand these different types of gradient descent:

1. Batch Gradient Descent:

Batch gradient descent (BGD) is used to find the error for each point in the training set and update the model
after evaluating all training examples. This procedure is known as the training epoch. In simple words, it is a
greedy approach where we have to sum over all examples for each update.

Advantages of Batch gradient descent:


o It produces less noise in comparison to other gradient descent.
o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent:

Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per iteration.
Or in other words, it processes a training epoch for each example within a dataset and updates each training
example's parameters one at a time. As it requires only one training example at a time, hence it is easier to
store in allocated memory. However, it shows some computational efficiency losses in comparison to batch
gradient systems as it shows frequent updates that require more detail and speed. Further, due to frequent
updates, it is also treated as a noisy gradient. However, sometimes it can be helpful in finding the global
minimum and also escaping the local minimum.

Advantages of Stochastic gradient descent:


In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few advantages
over other gradient descent.
o It is easier to allocate in desired memory.
o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.
12 Unit IV Notes || Machine Learning || MC4301

3. MiniBatch Gradient Descent:

Mini Batch gradient descent is the combination of both batch gradient descent and stochastic gradient descent.
It divides the training datasets into small batch sizes then performs the updates on those batches separately.
Splitting training datasets into smaller batches make a balance to maintain the computational efficiency of
batch gradient descent and speed of stochastic gradient descent. Hence, we can achieve a special type of
gradient descent with higher computational efficiency and less noisy gradient descent.

Advantages of Mini Batch gradient descent:


o It is easier to fit in allocated memory.
o It is computationally efficient.
o It produces stable gradient descent convergence.

Challenges with the Gradient Descent

Although we know Gradient Descent is one of the most popular methods for optimization problems, it still
also has some challenges. There are a few challenges as follows:

1. Local Minima and Saddle Point:


For convex problems, gradient descent can find the global minimum easily, while for non-convex problems,
it is sometimes difficult to find the global minimum, where the machine learning models achieve the best
results.

Whenever the slope of the cost function is at zero or just close to zero, this model stops learning further. Apart
from the global minimum, there occur some scenarios that can show this slop, which is saddle point and local
minimum. Local minima generate the shape similar to the global minimum, where the slope of the cost
function increases on both sides of the current points.
13 Unit IV Notes || Machine Learning || MC4301

In contrast, with saddle points, the negative gradient only occurs on one side of the point, which reaches a
local maximum on one side and a local minimum on the other side. The name of a saddle point is taken by
that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point in a local region.
In contrast, the name of the global minima is given so because the value of the loss function is minimum there,
globally across the entire domain the loss function.

2. Vanishing and Exploding Gradient


In a deep neural network, if the model is trained with gradient descent and backpropagation, there can occur
two more issues other than local minima and saddle point.

Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During backpropagation, this gradient
becomes smaller that causing the decrease in the learning rate of earlier layers than the later layer of the
network. Once this happens, the weight parameters update until they become insignificant.

Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too large and
creates a stable model. Further, in this scenario, model weight increases, and they will be represented as NaN.
This problem can be solved using the dimensionality reduction technique, which helps to minimize complexity
within the model.

Example program:
Import numpy as np
def gradient_descent(x,y):
m_curr = b_curr = 0
iterations = 10000
n = len(x)
learning_rate = 0.08

for i in range(iterations):
y_predicted = m_curr * x + b_curr
cost = (1/n) * sum([val**2 for val in (y-y_predicted)])
md = -(2/n)*sum(x*(y-y_predicted))
bd = -(2/n)*sum(y-y_predicted)
m_curr = m_curr - learning_rate * md
b_curr = b_curr - learning_rate * bd
print ("m {}, b {}, cost {} iteration {}".format(m_curr,b_curr,cost, i))
x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])
14 Unit IV Notes || Machine Learning || MC4301

gradient_descent(x,y)

Out[]:
m 4.96, b 1.44, cost 89.0 iteration 0
m 0.4991999999999983, b 0.26879999999999993, cost 71.10560000000002 iteration 1
m 4.451584000000002, b 1.426176000000001, cost 56.8297702400001 iteration 2
.
.
m 2.000000000000002, b 2.999999999999995, cost 1.0255191767873153e-29 iteration 9997
m 2.000000000000001, b 2.9999999999999947, cost 1.0255191767873153e-29 iteration 9998
m 2.000000000000002, b 2.999999999999995, cost 1.0255191767873153e-29 iteration 9999

Optimization in a Machine Learning


Machine learning optimization is the process of adjusting hyperparameters in order to minimize the cost
function by using one of the optimization techniques. It is important to minimize the cost function because it
describes the discrepancy between the true value of the estimated parameter and what the model has predicted.
Optimization plays an important part in a machine learning project in addition to fitting the learning algorithm
on the training dataset.
The step of preparing the data prior to fitting the model and the step of tuning a chosen model also can be
framed as an optimization problem. In fact, an entire predictive modeling project can be thought of as one
large optimization problem.

Parameters and hyperparameters of the model


Before we go any further, we need to understand the difference between parameters and
hyperparameters of a model. These two notions are easy to confuse but we ought not to.
You need to set hyperparameters before starting to train the model. They include a number of clusters,
learning rate, etc. Hyperparameters describe the structure of the model.
On the other hand, the parameters of the model are obtained during the training. There is no way to
get them in advance. Examples are weights and biases for neural networks. This data is internal to the
model and changes based on the inputs.

To tune the model, we need hyperparameter optimization. By finding the optimal combination of their
values, we can decrease the error and build the most accurate model.
15 Unit IV Notes || Machine Learning || MC4301

How hyperparameter tuning works:


As we said, the hyperparameters are set before training. But you can’t know in advance, for instance,
which learning rate (large or small) is best in this or that case. Therefore, to improve the model’s
performance, hyperparameters have to be optimized.

After each iteration, you compare the output with expected results, assess the accuracy, and adjust the
hyperparameters if necessary. This is a repeated process. You can do that manually or use one of the
many optimization techniques, which come in handy when you work with large amounts of data

Top optimization techniques in machine learning


Now let us talk about the techniques that you can use to optimize the hyperparameters of your model.

Exhaustive search
Exhaustive search, or brute-force search, is the process of looking for the most optimal
hyperparameters by checking whether each candidate is a good match. You perform the same thing
when you forget the code for your bike’s lock and try out all the possible options. In machine learning,
we do the same thing but the number of options is quite large, usually.

The exhaustive search method is simple. For example, if you are working with a k-means algorithm,
you will manually search for the right number of clusters. However, if there are hundreds and
thousands of options that you have to consider, it becomes unbearably heavy and slow. This makes
brute-force search inefficient in the majority of real-life cases.

Gradient descent
Gradient descent is the most common algorithm for model optimization for minimizing the error. In
order to perform gradient descent, you have to iterate over the training dataset while re-adjusting the
model.

Your goal is to minimize the cost function because it means you get the smallest possible error and
improve the accuracy of the model.

On the graph, you can see a graphical representation of how the gradient descent algorithm travels in
the variable space. To get started, you need to take a random point on the graph and arbitrarily choose
a direction. If you see that the error is getting larger, that means you chose the wrong direction.

When you are not able to improve (decrease the error) anymore, the optimization is over and you have
found a local minimum. In the following video, you will find a step-by-step explanation of how
gradient descent works.

Looks fine so far. However, classical gradient descent will not work well when there are a couple of
local minimums. Finding your first minimum, you will simply stop searching because the algorithm
only finds a local one, it is not made to find the global one.
16 Unit IV Notes || Machine Learning || MC4301

Note: In gradient descent, you proceed forward with steps of the same size. If you choose a learning rate that
is too large, the algorithm will be jumping around without getting closer to the right answer. If it’s too small,
the computation will start mimicking exhaustive search take, which is, of course, inefficient.

So you have to choose the learning rate very carefully. If done right, gradient descent becomes a computation-
efficient and rather quick method to optimize models.

Genetic algorithms
Genetic algorithms represent another approach to ML optimization. The principle that lays behind the
logic of these algorithms is an attempt to apply the theory of evolution to machine learning.

In the evolution theory, only those specimens get to survive and reproduce that have the best adaptation
mechanisms. How do you know what specimens are and aren’t the best in the case of machine learning
models?

Imagine you have a bunch of random algorithms at hand. This will be your population. Among multiple
models with some predefined hyperparameters, some are better adjusted than the others. Let’s find
them! First, you calculate the accuracy of each model. Then, you keep only those that worked out best.
Now you can generate some descendants with similar hyperparameters to the best models to get a
second generation of models.

We repeat this process many times and only the best models will survive at the end of the process.
Genetic algorithms help to avoid being stuck at local minima/maxima. They are common in optimizing
neural network models.
17 Unit IV Notes || Machine Learning || MC4301

Regularization in Machine Learning


What is Regularization?
Regularization is one of the most important concepts of machine learning. It is a technique to prevent the
model from overfitting by adding extra information to it.
Sometimes the machine learning model performs well with the training data but does not perform well with
the test data. It means the model is not able to predict the output when deals with unseen data by introducing
noise in the output, and hence the model is called overfitted. This problem can be deal with the help of a
regularization technique.
This technique can be used in such a way that it will allow to maintain all variables or features in the model
by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a generalization of the
model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In regularization
technique, we reduce the magnitude of the features by keeping the same number of features."
How does Regularization Work?
Regularization works by adding a penalty or complexity term to the complex model. Let's consider the simple
linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted
X1, X2, …Xn are the features for Y.
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias of
the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The equation for the cost
function for the linear model is given below:

Now, we will add a loss function and optimize parameter to make the model that can predict the accurate value
of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced
so that we can get better long-term predictions.
18 Unit IV Notes || Machine Learning || MC4301

o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It
is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added
to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda
to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:

o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the
cost function of the linear regression model. Hence, for the minimum value of λ, the model will
resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the independent
variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only
shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:

o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
Key Difference between Ridge Regression and Lasso Regression
o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the features
present in the model. It reduces the complexity of the model by shrinking the coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.
19 Unit IV Notes || Machine Learning || MC4301

What is Overfitting?

o Overfitting & underfitting are the two main errors/problems in the machine learning model, which
cause poor performance in Machine Learning.
o Overfitting occurs when the model fits more data than required, and it tries to capture each and every
datapoint fed to it. Hence it starts capturing noise and inaccurate data from the dataset, which degrades
the performance of the model.
o An overfitted model doesn't perform accurately with the test/unseen dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.
Example to Understand Overfitting
We can understand overfitting with a general example. Suppose there are three students, X, Y, and Z, and all
three are preparing for an exam. X has studied only three sections of the book and left all other sections. Y
has a good memory, hence memorized the whole book. And the third student, Z, has studied and practiced all
the questions. So, in the exam, X will only be able to solve the questions if the exam has questions related to
section 3. Student Y will only be able to solve questions if they appear exactly the same as given in the book.
Student Z will be able to solve all the exam questions in a proper way.
The same happens with machine learning; if the algorithm learns from a small part of the data, it is unable to
capture the required data points and hence under fitted.
Suppose the model learns the training dataset, like the Y student. They perform very well on the seen dataset
but perform badly on unseen data or unknown instances. In such cases, the model is said to be Overfitting.
And if the model performs well with the training dataset and also with the test/unseen dataset, similar to
student Z, it is said to be a good fit.
20 Unit IV Notes || Machine Learning || MC4301

How to detect Overfitting?


Overfitting in the model can only be detected once you test the data. To detect the issue, we can
perform Train/test split.
In the train-test split of the dataset, we can divide our dataset into random test and training datasets. We train
the model with a training dataset which is about 80% of the total dataset. After training the model, we test it
with the test dataset, which is 20 % of the total dataset.

Now, if the model performs well with the training dataset but not with the test dataset, then it is likely to have
an overfitting issue.
For example, if the model shows 85% accuracy with training data and 50% accuracy with the test dataset, it
means the model is not performing well.

Ways to prevent the Overfitting


Although overfitting is an error in Machine learning which reduces the performance of the model, however,
we can prevent it in several ways. With the use of the linear model, we can avoid overfitting; however, many
real-world problems are non-linear ones. It is important to prevent overfitting from the models. Below are
several ways that can be used to prevent overfitting:
21 Unit IV Notes || Machine Learning || MC4301

1. Early Stopping
2. Train with more data
3. Feature Selection
4. Cross-Validation
5. Data Augmentation
6. Regularization
1.Early Stopping
In this technique, the training is paused before the model starts learning the noise within the model. In this
process, while training the model iteratively, measure the performance of the model after each iteration.
Continue up to a certain number of iterations until a new iteration improves the performance of the model.
After that point, the model begins to overfit the training data; hence we need to stop the process before the
learner passes that point.
Stopping the training process before the model starts capturing noise from the data is known as early stopping.

However, this technique may lead to the underfitting problem if training is paused too early. So, it is very
important to find that "sweet spot" between underfitting and overfitting.
2.Train with More data
Increasing the training set by including more data can enhance the accuracy of the model, as it provides more
chances to discover the relationship between input and output variables.
It may not always work to prevent overfitting, but this way helps the algorithm to detect the signal better to
minimize the errors.
When a model is fed with more training data, it will be unable to overfit all the samples of data and forced to
generalize well.
But in some cases, the additional data may add more noise to the model; hence we need to be sure that data is
clean and free from in-consistencies before feeding it to the model.
22 Unit IV Notes || Machine Learning || MC4301

3.Feature Selection
While building the ML model, we have a number of parameters or features that are used to predict the outcome.
However, sometimes some of these features are redundant or less important for the prediction, and for this
feature selection process is applied. In the feature selection process, we identify the most important features
within training data, and other features are removed. Further, this process helps to simplify the model and
reduces noise from the data. Some algorithms have the auto-feature selection, and if not, then we can manually
perform this process.
4.Cross-Validation
Cross-validation is one of the powerful techniques to prevent overfitting.
In the general k-fold cross-validation technique, we divided the dataset into k-equal-sized subsets of data;
these subsets are known as folds.
5.Data Augmentation
Data Augmentation is a data analysis technique, which is an alternative to adding more data to prevent
overfitting. In this technique, instead of adding more training data, slightly modified copies of already existing
data are added to the dataset.
The data augmentation technique makes it possible to appear data sample slightly different every time it is
processed by the model. Hence each data set appears unique to the model and prevents overfitting.
6.Regularization
If overfitting occurs when a model is complex, we can reduce the number of features. However, overfitting
may also occur with a simpler model, more specifically the Linear model, and for such cases, regularization
techniques are much helpful.
Regularization is the most popular technique to prevent overfitting. It is a group of methods that forces the
learning algorithms to make a model simpler. Applying the regularization technique may slightly increase the
bias but slightly reduces the variance. In this technique, we modify the objective function by adding the
penalizing term, which has a higher value with a more complex model.
The two commonly used regularization techniques are L1 Regularization and L2 Regularization.
Ensemble Methods
In ensemble methods, prediction from different machine learning models is combined to identify the most
popular result.
The most commonly used ensemble methods are Bagging and Boosting.
In bagging, individual data points can be selected more than once. After the collection of several sample
datasets, these models are trained independently, and depending on the type of task-i.e., regression or
classification-the average of those predictions is used to predict a more accurate result. Moreover, bagging
reduces the chances of overfitting in complex models.
23 Unit IV Notes || Machine Learning || MC4301

Perceptron in Machine Learning

In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term for all folks. It
is the primary step to learn Machine Learning and Deep Learning technologies, which consists of a set of
weights, input values or scores, and a threshold. Perceptron is a building block of an Artificial Neural
Network. Initially, in the mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for performing
certain calculations to detect input data capabilities or business intelligence. Perceptron is a linear Machine
Learning algorithm used for supervised learning for various binary classifiers. This algorithm enables neurons
to learn elements and processes them one by one during preparation. In this tutorial, "Perceptron in Machine
Learning," we will discuss in-depth knowledge of Perceptron and its basic functions in brief. Let's start with
the basic introduction of Perceptron.

What is the Perceptron model in Machine Learning?


Perceptron is Machine Learning algorithm for supervised learning of various binary classification tasks.
Further, Perceptron is also understood as an Artificial Neuron or neural network unit that helps to detect
certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks. However,
it is a supervised learning algorithm of binary classifiers. Hence, we can consider it as a single-layer neural
network with four main parameters, i.e., input values, weights and Bias, net sum, and an activation
function.

What is Binary classifier in Machine Learning?


In Machine Learning, binary classifiers are defined as the function that helps in deciding whether input data
can be represented as vectors of numbers and belongs to some specific class.
Binary classifiers can be considered as linear classifiers. In simple words, we can understand it as
a classification algorithm that can predict linear predictor function in terms of weight and feature vectors.
Basic Components of Perceptron
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three main
components. These are as follows:

o Input Nodes or Input Layer:


This is the primary component of Perceptron which accepts the initial data into the system for further
processing. Each input node contains a real numerical value.
o Wight and Bias:
24 Unit IV Notes || Machine Learning || MC4301

Weight parameter represents the strength of the connection between units. This is another most important
parameter of Perceptron components. Weight is directly proportional to the strength of the associated input
neuron in deciding the output. Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire or not.
Activation Function can be considered primarily as a step function.

Types of Activation functions:


o Sign function
o Step function, and
o Sigmoid function

The data scientist uses the activation function to take a subjective decision based on various problem
statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in
perceptron models by checking whether the learning process is slow or has vanishing or exploding gradients.

How does Perceptron work?


In Machine Learning, Perceptron is considered as a single-layer neural network that consists of four main
parameters named input values (Input nodes), weights and Bias, net sum, and an activation function. The
perceptron model begins with the multiplication of all input values and their weights, then adds these values
together to create the weighted sum. Then this weighted sum is applied to the activation function 'f' to obtain
the desired output. This activation function is also known as the step function and is represented by 'f'.

This step function or Activation function plays a vital role in ensuring that output is mapped between required
values (0,1) or (-1,1). It is important to note that the weight of input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation function curve up or down.
25 Unit IV Notes || Machine Learning || MC4301

Perceptron model works in two important steps as follows:


Step-1
In the first step first, multiply all input values with corresponding weight values and then add them to
determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b

Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum, which gives us
output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model
Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron model consists
feed-forward network and also includes a threshold transfer function inside the model. The main objective of
the single-layer perceptron model is to analyze the linearly separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with inconstantly
allocated input for weight parameters. Further, it sums up all inputs (weight). After adding all inputs, if the
total sum of all inputs is more than a pre-determined value, the model gets activated and shows the output
value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this model is stated as
satisfied, and weight demand does not change. However, this model consists of a few discrepancies triggered
when multiple weight inputs values are fed into the model. Hence, to find desired output and minimize errors,
some changes should be necessary for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:


Like a single-layer perceptron model, a multi-layer perceptron model also has the same model structure but
has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in two
stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate on
the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on the
output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks having various
layers in which activation function does not remain linear, similar to a single layer perceptron model. Instead
of linear, activation function can be executed as sigmoid, TanH, ReLU, etc., for deployment.
26 Unit IV Notes || Machine Learning || MC4301

A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns.
Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:


o A multi-layered perceptron model can be used to solve complex non-linear problems.
o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:


o In Multi-layer perceptron, computations are difficult and time-consuming.
o In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects each
independent variable.
o The model functioning depends on the quality of the training.

What Is a Neural Network?


A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data
through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems
of neurons, either organic or artificial in nature.

Neural networks can adapt to changing input; so the network generates the best possible result without needing
to redesign the output criteria. The concept of neural networks, which has its roots in artificial intelligence, is
swiftly gaining popularity in the development of trading systems.

Pros
Can often work more efficiently and for longer than humans
Can be programmed to learn from prior outcomes to strive to make smarter future calculations
Often leverage online services that reduce (but do not eliminate) systematic risk
Are continually being expanded in new fields with more difficult problems
Cons
Still rely on hardware that may require labor and expertise to maintain
May take long periods of time to develop the code and algorithms
May be difficult to assess errors or adaptions to the assumptions if the system is self-learning but lacks
transparency
Usually report an estimated range or estimated amount that may not actualize
27 Unit IV Notes || Machine Learning || MC4301

Multi-Class Classification
Multi-class classification is perhaps the most popular machine learning job, aside from regression.

The science behind it is the same whether it’s spelled multiclass or multi-class. An ML classification problem
with more than two outputs or classes is known as multi feature classification. Because each image may be
classed as many distinct animal categories, using a machine learning model to identify animal species in
photographs from an encyclopedia is an example of multi-class classification. Multi-class classification also
necessitates the use of only one class in a sample (ie. an elephant is only an elephant; it is not also a lemur).

We are given a set of training samples separated into K distinct classes, and we create an ML model to forecast
which of those classes some previously unknown data belongs to. The model learns patterns specific to each
class from the training dataset and utilizes those patterns to forecast the classification of future data.
Approach –
1. Load dataset from the source.
2. Split the dataset into “training” and “test” data.
3. Train Decision tree, SVM, and KNN classifiers on the training data.
4. Use the above classifiers to predict labels for the test data.
5. Measure accuracy and visualize classification.
Decision tree classifier – A decision tree classifier is a systematic approach for multiclass classification. It
poses a set of questions to the dataset (related to its attributes/features). The decision tree classification
algorithm can be visualized on a binary tree. On the root and each of the internal nodes, a question is posed
and the data on that node is further split into separate records that have different characteristics. The leaves of
the tree refer to the classes in which the dataset is split. In the following code snippet, we train a decision tree
classifier in scikit-learn.
Example:
# importing necessary libraries
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# loading the iris dataset
iris = datasets.load_iris()
# X -> features, y -> label
X = iris.data
y = iris.target
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
dtree_predictions = dtree_model.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, dtree_predictions)
28 Unit IV Notes || Machine Learning || MC4301

What is a backpropagation algorithm?


Backpropagation, or backward propagation of errors, is an algorithm that is designed to test for errors working
back from output nodes to input nodes. It is an important mathematical tool for improving the accuracy of
predictions in data mining and machine learning. Essentially, backpropagation is an algorithm used to
calculate derivatives quickly.

There are two leading types of backpropagation networks:

1. Static backpropagation. Static backpropagation is a network developed to map static inputs for static
outputs. Static backpropagation networks can solve static classification problems, such as optical
character recognition (OCR).
2. Recurrent backpropagation. The recurrent backpropagation network is used for fixed-point learning.
Recurrent backpropagation activation feeds forward until it reaches a fixed value.

What is a backpropagation algorithm in a neural network?


Artificial neural networks use backpropagation as a learning algorithm to compute a gradient descent with
respect to weight values for the various inputs. By comparing desired outputs to achieved system outputs, the
systems are tuned by adjusting connection weights to narrow the difference between the two as much as
possible.

The algorithm gets its name because the weights are updated backward, from output to input.

The advantages of using a backpropagation algorithm are as follows:

 It does not have any parameters to tune except for the number of inputs.
 It is highly adaptable and efficient and does not require any prior knowledge about the network.
 It is a standard process that usually works well.
 It is user-friendly, fast and easy to program.
 Users do not need to learn any special functions.

The disadvantages of using a backpropagation algorithm are as follows:

 It prefers a matrix-based approach over a mini-batch approach.


 Data mining is sensitive to noise and irregularities.
 Performance is highly dependent on input data.
 Training is time- and resource-intensive.

What is a backpropagation algorithm in machine learning?


Backpropagation requires a known, desired output for each input value in order to calculate the loss
function gradient -- how a prediction differs from actual results -- as a type of supervised machine
learning. Along with classifiers such as Naïve Bayesian filters and decision trees, the backpropagation
training algorithm has emerged as an important part of machine learning applications that involve
predictive analytics.

What is the time complexity of a backpropagation algorithm?


The time complexity of each iteration -- how long it takes to execute each statement in an algorithm -
- depends on the network's structure. For multilayer perceptron, matrix multiplications dominate time.
29 Unit IV Notes || Machine Learning || MC4301

Non-Linear Activation Functions


Examples of non-linear activation functions include:

1. Sigmoid function: The Sigmoid function exists between 0 and 1 or -1 and 1. The use of a sigmoid function
is to convert a real value to a probability. In machine learning, the sigmoid function is generally used to refer
to the logistic function, also called the logistic sigmoid function; it is also the most widely used sigmoid
function (others are the hyperbolic tangent and the arctangent).

A sigmoid function is placed as the last layer of the model to convert the model’s output into a probability
score, which is easier to work with and interpret.

Another reason to use it mostly in the output layer is that it can otherwise cause a neural network to get stuck
in training time.

2. TanH function: It is the hyperbolic tangent function whose range lies between -1 and 1, hence also called
the zero-centred function. Because it is zero centred, it is much easier to model inputs with strongly negative,
positive or neutral values. TanH function is used instead of sigmoid function if the output is other than 0and1.
TanH functions usually find applications in RNN for natural language processing and speech recognition
tasks.

On the downside, in the case of both Sigmoid and TanH, if the weighted sum input is very large or very small,
the function’s gradient becomes very small and closer to zero.

3. ReLU function: Rectified Linear Unit, also called ReLU, is a widely favoured activation function for deep
learning applications. Compared to Sigmoid and TanH activation functions, ReLU offers an upper hand in
terms of performance and generalisation. In terms of computation too, ReLU is faster as it does not compute
exponentials and divisions. The disadvantage is that ReLU overfits more, as compared with Sigmoid.

4. Parametric ReLU (PReLU): ReLU has been one of the keys to the recent successes in deep learning. Its
use has lead to better solutions than that of sigmoid. This is partially due to the vanishing gradient problem in
case of sigmoid activations. But, we can still improve upon ReLU. LeakyReLU was introduced, which doesn’t
zero out the negative inputs as ReLU does. Instead, it multiplies the negative input by a small value (like 0.02)
and keeps the positive input as is. But this has shown a negligible increase in the accuracy of our models.

Dropout is a regularization

Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-
adaptations on training data. It is a very efficient way of performing model averaging with neural networks.
The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.

A simple and powerful regularization technique for neural networks and deep learning models is dropout. This
notebook will uncover the dropout regularization technique and how to apply it to deep learning models in
Python with Keras.

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out”
randomly. This means that their contribution to the activation of downstream neurons is temporally removed
on the forward pass and any weight updates are not applied to the neuron on the backward pass.

As a neural network learns, neuron weights settle into their context within the network. Weights of neurons
are tuned for specific features providing some specialization. Neighboring neurons become to rely on this
specialization, which if taken too far can result in a fragile model too specialized to the training data. This
reliant on context for a neuron during training is referred to complex co-adaptations.
UNIT V NON-PARAMETRIC MACHINE LEARNING
k- Nearest Neighbors- Decision Trees – Branching – Greedy Algorithm - Multiple Branches – Continuous
attributes – Pruning. Random Forests: ensemble learning. Boosting – Adaboost algorithm. Support Vector
Machines – Large Margin Intuition – Loss Function - Hinge Loss – SVM Kernels

1. K-Nearest Neighbor (KNN) Algorithm for Machine Learning


 K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
 It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So, for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dog’s images and based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram:

1
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:
 There is no particular way to determine the best value for "K", so we need to try some values to find
the best out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
 Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data points for all the
training samples.

Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV car.
The company wants to give the ads to the users who are interested in buying that SUV. So for this problem,
we have a dataset that contains multiple user's information through the social network. The dataset contains
lots of information but the Estimated Salary and Age we will consider for the independent variable and
the Purchased variable is for the dependent variable. Below is the dataset:

2
Steps to implement the K-NN algorithm:
 Data Pre-processing step
 Fitting the K-NN algorithm to the Training set
 Predicting the test result
 Test accuracy of the result(Creation of Confusion matrix)
 Visualizing the test set result.

Data Pre-Processing Step:


The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the code for it:
By executing the above code, our dataset is imported to our program and well pre-processed. After feature
scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.

Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the class, we will create
the Classifier object of the class. The Parameter of this class will be
o n_neighbors: To define the required neighbors of the algorithm. Usually, it takes 5.
o metric='minkowski': This is the default parameter and it decides the distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.

And then we will fit the classifier to the training data. Below is the code for it:
from sklearn.neighbors import KNeighborsClassifier #Fitting K-NN classifier to the training set
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
classifier.fit(x_train, y_train)

Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we did in Logistic
Regression. Below is the code for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)

Creating the Confusion Matrix:


Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the classifier. Below is
the code for it:
from sklearn.metrics import confusion_matrix #Creating the Confusion matrix
cm= confusion_matrix(y_test, y_pred)

3
Visualizing the Training set result:
Now, we will visualize the training set result for K-NN model. The code will remain same as we did in
Logistic Regression, except the name of the graph.

The output graph is different from the graph which we have occurred in Logistic Regression. It can be
understood in the below points:
 As we can see the graph is showing the red point and green points. The green points are for
Purchased(1) and Red Points for not Purchased(0) variable.
 The graph is showing an irregular boundary instead of showing any straight line or any curve
because it is a K-NN algorithm, i.e., finding the nearest neighbor.
 The graph has classified users in the correct categories as most of the users who didn't buy the SUV
are in the red region and users who bought the SUV are in the green region.
 The graph is showing good result but still, there are some green points in the red region and red
points in the green region. But this is no big issue as by doing this model is prevented from
overfitting issues.
 Hence our model is well trained.

Visualizing the Test set result:


After the training of the model, we will now test the result by putting a new dataset, i.e., Test dataset. Code
remains the same except some minor changes: such as x_train and y_train will be replaced by x_test and
y_test.

2.Decision Tree Classification Algorithm:


 Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
 In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
 Below diagram explains the general structure of a decision tree:
A decision tree can contain categorical data (YES/NO) as well as numeric data.

4
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:
 Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
 The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
 Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting
a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on
the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further.
It continues the process until it reaches the leaf node of the tree. The complete process can be better understood
using the below algorithm:
 Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
 Step-3: Divide the S into subsets that contains possible values for the best attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the
offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM).
The root node splits further into the next decision node (distance from the office) and one leaf node based on
the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and
one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

5
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute selection
measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
 Information Gain
 Gini Index

1. Information Gain:
 Information gain is the measurement of changes in entropy after the segmentation of a dataset based
on an attribute.
 It calculates how much information a feature provides us about a class.
 According to the value of information gain, we split the node and build the decision tree.
 A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first. It can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
 S= Total number of samples
 P(yes)= probability of yes
 P(no)= probability of no

2. Gini Index:
 Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
 An attribute with the low Gini index should be preferred as compared to the high Gini index.
 It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
 Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features of
the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of tree pruning technology used:
 Cost Complexity Pruning
 Reduced Error Pruning.

Advantages of the Decision Tree


 It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
 The decision tree contains lots of layers, which makes it complex.
 It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
 For more class labels, the computational complexity of the decision tree may increase.

6
Python Implementation of Decision Tree
Now we will implement the Decision tree using Python. For this, we will use the dataset "user_data.csv,"
which we have used in previous classification models. By using the same dataset, we can compare the Decision
tree classifier with other classification models such as KNN SVM, LogisticRegression, etc.
 Data Pre-processing step
 Fitting a Decision-Tree algorithm to the Training set
 Predicting the test result
 Test accuracy of the result(Creation of Confusion matrix)
 Visualizing the test set result.

1. Data Pre-Processing Step:


Below is the code for the pre-processing step:
import numpy as nm # importing libraries
import matplotlib.pyplot as mtp
import pandas as pd

data_set= pd.read_csv('user_data.csv') #importing datasets


x= data_set.iloc[:, [2,3]].values #Extracting Independent and dependent Variable
y= data_set.iloc[:, 4].values
from sklearn.model_selection import train_test_split # Splitting the dataset into training and test set.
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
from sklearn.preprocessing import StandardScaler #feature Scaling
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)

In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:

2. Fitting a Decision-Tree algorithm to the Training set


Now we will fit the model to the training set. For this, we will import the DecisionTreeClassifier class
from sklearn.tree library. Below is the code for it:
#Fitting Decision Tree classifier to the training set
From sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(x_train, y_train)

In the above code, we have created a classifier object, in which we have passed two main parameters;
 "criterion='entropy': Criterion is used to measure the quality of split, which is calculated by
information gain given by entropy.
 random_state=0": For generating the random states.
7
3. Predicting the test result
Now we will predict the test set result. We will create a new prediction vector y_pred. Below is the code for
it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)

Output:
In the below output image, the predicted output and real test output are given. We can clearly see that there
are some values in the prediction vector, which are different from the real vector values. These are prediction
errors.

4. Test accuracy of the result (Creation of Confusion matrix)


In the above output, we have seen that there were some incorrect predictions, so if we want to know the
number of correct and incorrect predictions, we need to use the confusion matrix. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:

In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to other
classification models, the Decision Tree classifier made a good prediction.

5. Visualizing the training set result:


Here we will visualize the training set result. To visualize the training set result we will plot a graph for the
decision tree classifier. The classifier will predict yes or No for the users who have either Purchased or Not
purchased the SUV car as we did in Logistic Regression. Below is the code for it:
#Visulaizing the trianing set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
8
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
fori, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Decision Tree Algorithm (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:

The above output is completely different from the rest classification models. It has both vertical and horizontal
lines that are splitting the dataset according to the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.

6. Visualizing the test set result:

Visualization of test set result will be similar to the visualization of the training set except that the training set
will be replaced with the test set.
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
fori, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Decision Tree Algorithm(Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()

9
Output:

3. Greedy Algorithm:
The greedy method is one of the strategies like Divide and conquer used to solve the problems. This
method is used for solving optimization problems. An optimization problem is a problem that demands
either maximum or minimum results.
The Greedy method is the simplest and straightforward approach. It is not an algorithm, but it is a
technique. The main function of this approach is that the decision is taken on the basis of the currently
available information. Whatever the current information is present, the decision is made without
worrying about the effect of the current decision in future.
This technique is basically used to determine the feasible solution that may or may not be optimal. The
feasible solution is a subset that satisfies the given criteria. The optimal solution is the solution which
is the best and the most favorable solution in the subset. In the case of feasible, if more than one
solution satisfies the given criteria then those solutions will be considered as the feasible, whereas the
optimal solution is the best solution among all the solutions.

Characteristics of Greedy method


The following are the characteristics of a greedy method:
 To construct the solution in an optimal way, this algorithm creates two sets where one set contains all
the chosen items, and another set contains the rejected items.
 A Greedy algorithm makes good local choices in the hope that the solution should be either feasible
or optimal.
Components of Greedy Algorithm
The components that can be used in the greedy algorithm are:
 Candidate set: A solution that is created from the set is known as a candidate set.
 Selection function: This function is used to choose the candidate or subset which can be added in the
solution.
 Feasibility function: A function that is used to determine whether the candidate or subset can be used
to contribute to the solution or not.
 Objective function: A function is used to assign the value to the solution or the partial solution.
 Solution function: This function is used to intimate whether the complete function has been reached
or not.

Applications of Greedy Algorithm


 It is used in finding the shortest path.
 It is used to find the minimum spanning tree using the prim's algorithm or the Kruskal's algorithm.
 It is used in a job sequencing with a deadline.
 This algorithm is also used to solve the fractional knapsack problem.

10
Pseudo code of Greedy Algorithm
Algorithm Greedy (a, n)
{
Solution : = 0;
for i = 0 to n do
{
x: = select(a);
if feasible(solution, x)
{
Solution: = union(solution , x)
}
return solution;
}}

The above is the greedy algorithm. Initially, the solution is assigned with zero value. We pass the array and
number of elements in the greedy algorithm. Inside the for loop, we select the element one by one and checks
whether the solution is feasible or not. If the solution is feasible, then we perform the union.
Let's understand through an example.
Suppose there is a problem 'P'. I want to travel from A to B shown as below:
P:A→B

The problem is that we have to travel this journey from A to B. There are various solutions to go from A to
B. We can go from A to B by walk, car, bike, train, aeroplane, etc. There is a constraint in the journey that
we have to travel this journey within 12 hrs. If I go by train or aeroplane then only, I can cover this distance
within 12 hrs. There are many solutions to this problem but there are only two solutions that satisfy the
constraint.
If we say that we have to cover the journey at the minimum cost. This means that we have to travel this
distance as minimum as possible, so this problem is known as a minimization problem. Till now, we have two
feasible solutions, i.e., one by train and another one by air. Since travelling by train will lead to the minimum
cost so it is an optimal solution. An optimal solution is also the feasible solution, but providing the best result
so that solution is the optimal solution with the minimum cost. There would be only one optimal solution.
The problem that requires either minimum or maximum result then that problem is known as an optimization
problem. Greedy method is one of the strategies used for solving the optimization problems.

Disadvantages of using Greedy algorithm


Greedy algorithm makes decisions based on the information available at each phase without considering the
broader problem. So, there might be a possibility that the greedy solution does not give the best solution for
every problem.
It follows the local optimum choice at each stage with a intend of finding the global optimum. Let's understand
through an example.

Consider the graph which is given below:

We have to travel from the source to the destination at the minimum cost. Since we have three feasible
solutions having cost paths as 10, 20, and 5. 5 is the minimum cost path so it is the optimal solution. This is

11
the local optimum, and in this way, we find the local optimum at each stage in order to calculate the global
optimal solution.

Continuous attributes
What are Continuous Variables?
Simply put, if a variable can take any value between its minimum and maximum value, then it is called a
continuous variable. By nature, a lot of things we deal with fall in this category: age, weight, height being
some of them.

Just to make sure the difference is clear, let me ask you to classify whether a variable is continuous or
categorical:
1. Gender of a person
2. Number of siblings of a Person
3. Time on which a laptop runs on battery

Methods to deal with Continuous Variables

Binning The Variable:


Binning refers to dividing a list of continuous variables into groups. It is done to discover set of patterns in
continuous variables, which are difficult to analyze otherwise. Also, bins are easy to analyze and interpret. But,
it also leads to loss of information and loss of power. Once the bins are created, the information gets
compressed into groups which later affects the final model. Hence, it is advisable to create small bins initially.
This would help in minimal loss of information and produces better results.

Normalization:
In simpler words, it is a process of comparing variables at a ‘neutral’ or ‘standard’ scale. It helps to obtain
same range of values. Normally distributed data is easy to read and interpret. As shown below, in a normally
distributed data, 99.7% of the observations lie within 3 standard deviations from the mean. Also, the mean is
zero and standard deviation is one. Normalization technique is commonly used in algorithms such as k-means,
clustering etc.

A commonly used normalization method is z-scores. Z score of an observation is the number of standard
deviations it falls above or below the mean. It’s formula is shown below.

x = observation, μ = mean (population), σ = standard deviation (population)


For example:

12
Randy scored 76 in maths test. Katie score 86 in science test. Maths test has (mean = 70, sd = 2). Science test
has (mean = 80, sd = 3)
z(Randy) = (76 – 70)/2 = 3
z(Katie) = (86 – 80)/3 = 2

Transformations for Skewed Distribution:


Transformation is required when we encounter highly skewed data. It is suggested not to work on skewed data
in its raw form. Because, it reduces the impact of low frequency values which could be equally significant. At
times, skewness is influenced by presence of outliers. Hence, we need to be careful while using this approach.
The technique to deal with outliers is explained in next sections.

There are various types of transformation methods. Some are Log, sqrt, exp, Box-cox, power etc. The
commonly used method is Log Transformation.

Principal Component Analysis:


Sometime data set has too many variables. May be, 100, 200 variables or even more. In such cases, you can’t
build a model on all variables. Reason being, 1) It would be time consuming. 2) It might have lots of noise 3)
A lot of variables will tell similar information

Hence, to avoid such situation we use PCA a.k.a Principal Component Analysis. It is nothing but, finding out
few ‘principal ‘variables which explain significant amount of variation in dependent variable. Using this
technique, a large number of variables are reduced to few significant variables. This technique helps to reduce
noise, redundancy and enables quick computations.

Factor Analysis:
Factor Analysis was invented by Charles Spearman (1904). This is a variable reduction technique. It is used
to determine factor structure or model. It also explains the maximum amount of variance in the model. Let’s
say some variables are highly correlated. These variables can be grouped by their correlations i.e., all variables
in a particular group can be highly correlated among themselves but have low correlation with variables of
other group(s). Here each group represents a single underlying construct or factor. Factor analysis is of two
types:
1. EFA (Exploratory Factor Analysis) – It identifies and summarizes the underlying correlation structure
in a data set
2. CFA (Confirmatory Factor Analysis) – It attempts to confirm hypothesis using the correlation structure
and rate ‘goodness of fit’.

Methods to work with Date & Time Variable


Presence of Data Time variable in a data set usually give lots of confidence. Seriously! It does. Because, in
data-time variable, you get lots of scope to practice the techniques learnt above. You can create bins, you can
create new features, convert its type etc. Date & Time is commonly found in this format:
DD-MM-YYY HH:SS or MM-DD-YYY HH:SS
13
Pruning
When the size of the features exceeds a certain limit, regression trees become inapplicable due to overfitting.
The decision tree’s overfitting problem is caused by other factors as well as synch as branches sometimes are
impacted by noise and outliers of data. Pruning is a critical step in constructing tree based machine
learning models that help overcome these issues.
1. A snippet about decision trees
2. About pruning
3. Strategies for pruning
4. Pruning methods
A decision tree is a traditional supervised machine learning technique.

About pruning
Pruning is the process of eliminating weight connections from a network to speed up inference and reduce
model storage size. Decision trees and neural networks, in general, are overparameterized. Pruning a network
entails deleting unneeded parameters from an overly parameterized network.
Pruning mostly serves as an architectural search inside the tree or network. In fact, because pruning functions
as a regularizer, a model will often generalise slightly better at low levels of sparsity. The trimmed model will
match the baseline at higher levels. If you push it too far, the model will start to generalise worse than the
baseline, but with greater performance.

Need for pruning


Pruning a classifier simplifies it by combining disjuncts that are adjacent in instance space. By removing error-
prone components, the classifier’s performance may be improved. It also permits additional model analysis
for the aim of knowledge gain. Pruning should never be used to remove predicted components of a classifier.
As a result, the pruning operation needs a technique for determining if a group of disjuncts is predictive or
should be merged into a single, bigger disjunct.
The pruned disjunct represents the “null hypothesis” in a significance test, whereas the unpruned disjuncts
represent the “alternative hypothesis.” The test determines if the data offer adequate evidence to support the
alternative. If this is the case, the unpruned disjuncts are left alone; otherwise, pruning continues.
The obvious rationale for significance tests is that they evaluate whether the apparent correlation between a
collection of disjuncts and the data is likely to be attributable to chance alone. They do so by calculating the
likelihood of generating a random relationship as least as strong as the observed association if the null
hypothesis is confirmed. If the observed relationship is unlikely to be attributable to chance and this likelihood
does not exceed a set threshold, the unpruned disjuncts are deemed to be predictive; otherwise, the model is
simplified. The aggressiveness of the pruning operation is determined by the “significance level” criterion
used in the test.

Strategies for pruning


Pruning is a critical step in developing a decision tree model. Pruning is commonly employed to alleviate the
overfitting issue in decision trees. Pre-pruning and post-pruning are two common model tree generating
procedures.
Pre pruning
Prepruning is the process of pruning the model by halting the tree’s formation in advance. When construction
is completed, the leaf nodes inherit the label of the most common class in the subset that is connected to the
current node. There are various ways for pre-pruning, including the following
 When the model reaches a specific height, the decision tree’s growth is stopped.
 When the eigenvectors of instances associated with a node are identical, the tree model stops
developing.
 When the number of occurrences within a node falls below a certain threshold, the tree stops growing.
The downside of this strategy is that it is inapplicable not in particular circumstances where the amount
of data is tiny.
 An expansion is a process of dividing a node into two child nodes. When the gain value of an expansion
falls below a certain threshold, the tree model stops expanding as well.

14
The major disadvantage of pre-pruning is the narrow viewing field, which implies that the tree’s current
expansion may not match the standards, but later expansion may. In this situation, the decision tree’s
development is halted early.

Post-pruning
The decision tree generation is divided into two steps by post-pruning. The first step is the tree-building
process, with the termination condition that the fraction of a certain class in the node reaches 100%, and the
second phase is pruning the tree structure gained in the first phase.
Post-pruning techniques circumvent the problem of a narrow viewing field in this way. As a result, post-
pruning procedures are often more accurate than pre-pruning methods, therefore post-pruning methods are
more widely utilised than pre-pruning methods. The pruning procedure identifies the node as a leaf node by
using the label of the most common class in the subset associated with the current node, which is the same as
in pre-pruning.

Pruning methods
The goal of pruning is to remove sections of a classification model that explain random variation in the training
sample rather than actual domain characteristics. This makes the model more understandable to the user and,
perhaps, more accurate on fresh data that was not used to train the classifier. An effective approach for
differentiating sections of a classifier that are attributable to random effects from parts that describe significant
structure is required for pruning. There are different methods for pruning listed in this article used in both
strategies.

Reduced Error Pruning (REP)


The aim is to discover the most accurate subtree with the shortest version to the pruning set.
The pruning set is used to evaluate the efficacy of a subtree (branch) of a fully grown tree in this approach,
which is conceptually the simplest. It starts with the entire tree and compares the number of classification
mistakes made on the pruning set when the subtree is retained to the number of classification errors made
when internal nodes are transformed into leaves and assigned to the best class for each internal node of the
tree. The simplified tree can sometimes outperform the original tree. It is best to prune the subtree in this
scenario. This branch trimming procedure is continued on the simplified tree until the misclassification rate
rises. Another restriction limits the pruning condition: the internal node can be pruned only if it includes no
subtree with a lower error rate than the internal node itself. This indicates that trimmed nodes are evaluated
using a bottom-up traversal technique.
The advantage of this strategy is its linear computing complexity, as each node is only visited once to evaluate
the possibility of trimming it. REP, on the other hand, has a proclivity towards over-pruning. This is because
all evidence contained in the training set and used to construct a fully grown tree is ignored during the pruning
step. This issue is most obvious when the pruning set is significantly smaller than the training set, but it
becomes less significant as the percentage of instances in the pruning set grows.

Pessimistic Error Pruning (PEP)


The fact that the same training set is utilised for both growing and trimming a tree distinguishes this pruning
strategy. The apparent error rate, that is, the error rate on the training set, is optimistic and cannot be used to
select the best-pruned tree. As a result, the continuity correction for the binomial distribution was proposed,
which may give “a more realistic error rate.”
The distribution of errors at the node is roughly a binomial distribution. The binomial distribution’s mean and
variance are the likelihood of success and failure; the binomial distribution converges to a normal distribution.
The PEP approach is regarded as one of the most accurate decision tree pruning algorithms available today.
However, because the mechanism for traversing PEP is similar to pre-pruning, PEP suffers from excessive
pruning. Furthermore, due to its top-down nature, each subtree in the tree only has to be consulted once, and
the time complexity is in the worst-case linear with the number of non-leaf nodes in the decision tree.

15
Minimum Error Pruning (MEP)
This method is a bottom-up strategy that seeks a single tree with the lowest “anticipated error rate on an
independent data set.” This does not indicate the adoption of a pruning set, but rather that the developer wants
to estimate the error rate for unknown scenarios. Indeed, both the original and enhanced versions described
exploiting just information from the training set.
In the presence of noisy data, Laplace probability estimation is employed to improve the performance of ID3.
Later, the Bayesian technique was employed to enhance this procedure, and the approach is known as an m-
probability estimation. There were two modifications:
 Prior probabilities are used in estimate rather than assuming a uniform starting distribution of classes.
 Several trees with differing degrees of pruning may be generated by adjusting the value of the
parameter. The degree of pruning is now decided by parameters rather than the number of classes.
Furthermore, factors like the degree of noise in the training data may be changed based on domain
expertise or the complexity of the problem.
The predicted error rate for each internal node is estimated in the minimal error pruning approach and is
referred to as static error. The anticipated error rate of the branch with the node is then estimated as a weighted
sum of the expected error rates of the node’s children, where each weight represents the chance that
observation in the node would reach the associated child.

Critical Value Pruning (CVP)


This post-pruning approach is quite similar to pre-pruning. Indeed, a crucial value threshold is defined for the
node selection measure. Then, if the value returned by the selection measure for each test connected with
edges flowing out of that node does not exceed the critical value, an internal node of the tree is pruned.
However, a node may meet the pruning criterion but not all of its offspring. The branch is retained in this
scenario because it includes significant nodes. This additional check is typical of a bottom-up strategy and
distinguishes it from pre-pruning methods that prohibit a tree from developing even if future tests prove to be
important.
The degree of pruning changes obviously with the critical value: a greater critical value results in more extreme
pruning. The approach is divided into two major steps:
 Prune the mature tree to increase crucial values.
 Choose the best tree from the sequence of trimmed trees by weighing the tree’s overall relevance and
forecasting abilities.

Cost-Complexity Pruning (CCP)


The CART pruning algorithm is another name for this approach. It is divided into two steps:
1. Using certain techniques, select a parametric family of subtrees from a fully formed tree.
2. The optimal tree is chosen based on an estimation of the real error rates of the trees in the parametric
family.
In terms of the first phase, the primary concept is to prune the branches that exhibit the least increase in
apparent error rate per cut leaf to produce the next best tree from the best tree. When a tree is pruned at a node,
the apparent error rate increases by a certain amount while the number of leaves reduces by a certain number
of units. As a result, the following ratio of the error rate increase to leaf reduction measures the rise in apparent
error rate per trimmed leaf. The next best tree in the parametric family is then created by trimming all nodes
in the subtree with the lowest value of the above-mentioned ratio.
The best tree in the entire grown tree in terms of predicted accuracy is picked in the second phase. The real
error rate of each tree in the family may be estimated in two ways: one using cross-validation sets and the
other using an independent pruning set.

16
Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It
can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

Assumptions for Random Forest


Since the random forest combines multiple trees to predict the class of the dataset, it is possible that some
decision trees may predict the correct output, while others may not. But together, all the trees predict the
correct output. Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?


Below are some points that explain why we should use the Random Forest algorithm: <="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?


Random Forest works in two-phase first is to create the random forest by combining N decision tree, and
second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the
category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random
forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase,
each decision tree produces a prediction result, and when a new data point occurs, then based on the majority
of results, the Random Forest classifier predicts the final decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Advantages of Random Forest
o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

17
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression tasks, it is not more suitable
for Regression tasks.

Python Implementation of Random Forest Algorithm


Now we will implement the Random Forest Algorithm tree using Python. For this, we will use the same
dataset "user_data.csv", which we have used in previous classification models. By using the same dataset, we
can compare the Random Forest classifier with other classification models such as Decision tree
Classifier, KNN, SVM, Logistic Regression, etc.

Implementation Steps are given below:


o Data Pre-processing step
o Fitting the Random forest algorithm to the Training set
o Predicting the test result
o Test accuracy of the result (Creation of Confusion matrix)
o Visualizing the test set result.

1.Data Pre-Processing Step:


Below is the code for the pre-processing step:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:

2. Fitting the Random Forest algorithm to the training set:


Now we will fit the Random forest algorithm to the training set. To fit it, we will import
the RandomForestClassifier class from the sklearn.ensemble library. The code is given below:
1. #Fitting Decision Tree classifier to the training set
2. from sklearn.ensemble import RandomForestClassifier
3. classifier= RandomForestClassifier(n_estimators= 10, criterion="entropy")
4. classifier.fit(x_train, y_train)

In the above code, the classifier object takes below parameters:


o n_estimators= The required number of trees in the Random Forest. The default value is 10. We can
choose any number but need to take care of the overfitting issue.
o criterion= It is a function to analyze the accuracy of the split. Here we have taken "entropy" for the
information gain.

18
3. Predicting the Test Set result
Since our model is fitted to the training set, so now we can predict the test result. For prediction, we will create
a new prediction vector y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
Output:
The prediction vector is given as:
By checking the above prediction vector and test set real vector, we can determine the incorrect predictions
done by the classifier.

4. Creating the Confusion Matrix


Now we will create the confusion matrix to determine the correct and incorrect predictions. Below is the code
for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output: As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92
correct predictions.

5. Visualizing the training Set result


Here we will visualize the training set result. To visualize the training set result we will plot a graph for the
Random forest classifier. The classifier will predict yes or No for the users who have either Purchased or Not
purchased the SUV car as we did in Logistic Regression. Below is the code for it:
1. from matplotlib.colors import ListedColormap
2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.0
1),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
6. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('purple', 'green'))(i), label = j)
12. mtp.title('Random Forest Algorithm (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()

6. Visualizing the test set result


Now we will visualize the test set result. Below is the code for it:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.0
1),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))

19
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Random Forest Algorithm(Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()

Ensemble Learning
ensemble learning helps improve machine learning results by combining several models. This approach
allows the production of better predictive performance compared to a single model. Basic idea is to learn a
set of classifiers (experts) and to allow them to vote.

Advantage: Improvement in predictive accuracy.


Disadvantage : It is difficult to understand an ensemble of classifiers.

Main Challenge for Developing Ensemble Models?


The main challenge is not to obtain highly accurate base models, but rather to obtain base models which
make different kinds of errors. For example, if ensembles are used for classification, high accuracies can be
accomplished if different base models misclassify different training examples, even if the base classifier
accuracy is low.
Methods for Independently Constructing Ensembles –
 Majority Vote
 Bagging and Random Forest
 Randomness Injection
 Feature-Selection Ensembles
 Error-Correcting Output Coding

Methods for Coordinated Construction of Ensembles –


 Boosting
 Stacking

Types of Ensemble Classifier –


Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree. Suppose a set D of d
tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap).
Then a classifier model Mi is learned for each training set D < i. Each classifier M i returns its class
prediction. The bagged classifier M* counts the votes and assigns the class with the most votes to X
(unknown sample).
Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples, selecting observations
with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each other.
4. The final predictions are determined by combining the predictions from all the models.

20
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a decision tree classifier
and is generated using a random selection of attributes at each node to determine the split. During
classification, each tree votes and the most popular class is returned.

Implementation steps of Random Forest –


1. Multiple subsets are created from the original data set, selecting observations with replacement.
2. A subset of features is selected randomly and whichever feature gives the best split is used to split
the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the aggregation of predictions from n number
of trees.

Boosting in Machine Learning - Boosting and AdaBoost


Boosting is an ensemble modeling technique that attempts to build a strong classifier from the
number of weak classifiers. It is done by building a model by using weak models in series. Firstly,
a model is built from the training data. Then the second model is built which tries to correct the
errors present in the first model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for the purpose of binary
classification. AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that
combines multiple “weak classifiers” into a single “strong classifier”. It was formulated by Yoav Freund
and Robert Schapire. They also won the 2003 Gödel Prize for their work.

21
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End

Explanation:
The above diagram explains the AdaBoost algorithm in a very simple way. Let’s try to understand it in a
stepwise process:
 B1 consists of 10 data points which consist of two types namely plus(+) and minus(-) and 5 of
which are plus(+) and the other 5 are minus(-) and each one has been assigned equal weight
initially. The first model tries to classify the data points and generates a vertical separator line
but it wrongly classifies 3 plus(+) as minus(-).
 B2 consists of the 10 data points from the previous model in which the 3 wrongly classified
plus(+) are weighted more so that the current model tries more to classify these pluses(+)
correctly. This model generates a vertical separator line that correctly classifies the previously
wrongly classified pluses(+) but in this attempt, it wrongly classifies three minuses(-).
 B3 consists of the 10 data points from the previous model in which the 3 wrongly classified
minus(-) are weighted more so that the current model tries more to classify these minuses(-)
correctly. This model generates a horizontal separator line that correctly classifies the previously
wrongly classified minuses(-).
 B4 combines together B1, B2, and B3 in order to build a strong prediction model which is much
better than any individual model used.

Making Predictions with AdaBoost


Predictions are made by calculating the weighted average of the weak classifiers.
For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted
values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a
the sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the
second class is predicted.
For example, 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks
like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage

22
values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an
output of -0.8, which would be an ensemble prediction of -1.0 or the second class.

Data Preparation for AdaBoost


This section lists some heuristics for best preparing your data for AdaBoost.
 Quality Data: Because the ensemble method continues to attempt to correct misclassifications in the
training data, you need to be careful that the training data is of a high-quality.
 Outliers: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases
that are unrealistic. These could be removed from the training dataset.
 Noisy Data: Noisy data, specifically noise in the output variable can be problematic. If possible,
attempt to isolate and clean these from your training dataset.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called
as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in
which there are two different categories that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify whether
it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train our model
with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary between these two data (cat
and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:

23
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space,
but we need to find out the best decision boundary that helps to classify the data points. This best boundary is
known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the
data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane
are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that
has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify
the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be
multiple lines that can separate these classes. Consider the below image:

24
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points
are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the
goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2

25
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine


Now we will implement the SVM algorithm using Python. Here we will use the same dataset user_data,
which we have used in Logistic regression and KNN classification.

26
Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
#Data Pre-processing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the dataset as:

The scaled output for the test set will be:

27
Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:
from sklearn.svm import SVC # "Support vector classifier"
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable data.
However, we can change it for non-linear data. And then we fitted the classifier to the training dataset(x_train,
y_train)

Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

The model performance can be altered by changing the value of C(Regularization factor), gamma, and
kernel.
Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new vector y_pred. Below is the code
for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)

After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.

Output: Below is the output for the prediction of the test set:

Creating the confusion matrix:


Now we will see the performance of the SVM classifier that how many incorrect predictions are there as
28
compared to the Logistic regression classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the function, we will call it using a
new variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:

As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10 correct
predictions. Therefore we can say that our SVM model improved as compared to the Logistic regression
model.
Visualizing the training set result:
Now we will visualize the training set result, below is the code for it:
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:

29
As we can see, the above output is appearing similar to the Logistic regression output. In the output, we got
the straight line as hyperplane because we have used a linear kernel in the classifier. And we have also
discussed above that for the 2d space, the hyperplane in SVM is a straight line.
Visualizing the test set result:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users into two regions (Purchased
or Not purchased). Users who purchased the SUV are in the red region with the red scatter points. And users
who did not purchase the SUV are in the green region with green scatter points. The hyperplane has divided
the two classes into Purchased and not purchased variable.

30
Large Margin Intuition
SVM Decision Boundary
Consider a case where we set constant C to be a very large value, when minimizing the optimization objective,
we are going to be highly motivated to choose a value, so that the first term is equal to 0. So what would it
take to make this first term equal to 0.

When the first term is equal to 0, we need to minimize (ignored θ0).

Linear separable case


The obtained decision boundary when minimizing the optimization objective will have the margin as large as
possible (hence the name Large Margin Intuition).
This means SVM will choose the black decision boundary instead of the pink and green one:

Mathematics Behind Large Margin Intuition

Vector Inner Product


p = length of projection of v onto u. p can be positive or negative.

31
SVM Decision Boundary
We can rewrite the optimization objective of SVM as follow:
where p(i) is the projection of x(u) onto the vector θ.
Simplification: θ0 = 0.
According to the illustration below, with the minimal value of the magnitude of θ, the absolute value of p will
large as much as possible (hence the large margin).

In logistic regression, we take the output of the linear function and squash the value within the range of [0,1]
using the sigmoid function. If the squashed value is greater than a threshold value(0.5) we assign it a label 1,
else we assign it a label 0. In SVM, we take the output of the linear function and if that output is greater than
1, we identify it with one class and if the output is -1, we identify is with another class. Since the threshold
values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as
margin.

Loss Function

In Machine learning, the loss function is determined as the difference between the actual output and the
predicted output from the model for the single training example while the average of the loss function for all
the training examples is termed as the cost function. This computed difference from the loss functions( such

32
as Regression Loss, Binary Classification, and Multiclass Classification loss function) is termed the error
value; this error value is directly proportional to the actual and predicted value.

How does Loss Functions Work?


The word ‘Loss’ states the penalty for failing to achieve the expected output. If the deviation in the predicted
value than the expected value by our model is large, then the loss function gives the higher number as output,
and if the deviation is small & much closer to the expected value, it outputs a smaller number.

It is important to note that, amount of deviation doesn’t matter; the thing which matters here is whether the
value predicted by our model is right or wrong. Loss functions are different based on your problem statement
to which machine learning is being applied. The cost function is another term used interchangeably for the
loss function, but it holds a slightly different meaning. A loss function is for a single training example, while
a cost function is an average loss over the complete train dataset.

Types of Loss Functions in Machine Learning


Below are the different types of the loss function in machine learning which are as follows:

1. Regression loss functions


Linear regression is a fundamental concept of this function. Regression loss functions establish a linear
relationship between a dependent variable (Y) and an independent variable (X); hence we try to fit the best
line in space on these variables.
Y = X0 + X1 + X2 + X3 + X4….+ Xn
 X = Independent variables
 Y = Dependent variable

Mean Squared Error Loss


MSE(L2 error) measures the average squared difference between the actual and predicted values by the model.
The output is a single number associated with a set of values. Our aim is to reduce MSE to improve the
accuracy of the model.

Consider the linear equation, y = mx + c, we can derive MSE as:


MSE=1/N ∑i=1 to n (y(i)−(mx(i)+b))2
Here, N is the total number of data points, 1/N ∑i=1 to n is the mean value, and y(i) is the actual value and
mx(i)+b its predicted value.

Mean Squared Logarithmic Error Loss (MSLE)


MSLE measures the ratio between actual and predicted value. It introduces an asymmetry in the error curve.
MSLE only cares about the percentual difference between actual and predicted values. It can be a good choice
as a loss function when we want to predict house sales prices, bakery sales prices, and the data is continuous.
Here, the loss can be calculated as the mean of observed data of the squared differences between the log-
transformed actual and predicted values, which can be given as:
L=1nn∑i=1(log(y(i)+1)−log(^y(i)+1))2

Mean Absolute Error (MAE)


MAE calculates the sum of absolute differences between actual and predicted variables. That means it
measures the average magnitude of errors in a set of predicted values. Using the mean square error is easier
to solve, but using the absolute error is more robust to outliers. Outliers are those values, which deviate
extremely from other observed data points.

MAE can be calculated as:


L=1nn∑i=1∣∣y(i)−^y(i)∣∣

33
2. Binary Classification Loss Functions
These loss functions are made to measure the performances of the classification model. In this, data points are
assigned one of the labels, i.e. either 0 or 1. Further, they can be classified as:
Binary Cross-Entropy
It’s a default loss function for binary classification problems. Cross-entropy loss calculates the performance
of a classification model, which gives an output of a probability value between 0 and 1. Cross-entropy loss
increases as the predicted probability value deviate from the actual label.

Hinge loss
Hinge loss can be used as an alternative to cross-entropy, which was initially developed to use with a support
vector machine algorithm. Hinge loss works best with the classification problem because target values are in
the set of {-1,1}. It allows to assign more error when there is a difference in sign between actual and predicted
values. Hence resulting in better performance than cross-entropy.

Squared Hinge loss


An extension of hinge loss, which simply calculates the square of the hinge loss score. It reduces the error
function and makes it numerically easier to work with. It finds the classification boundary that specifies the
maximum margin between the data points of various classes. Squared hinge loss fits perfect for YES OR NO
kind of decision problems, where probability deviation is not the concern.

3. Multi-class Classification Loss Functions


Multi-class classification is the predictive models in which the data points are assigned to more than two
classes. Each class is assigned a unique value from 0 to (Number_of_classes – 1). It is highly recommended
for image or text classification problems, where a single paper can have multiple topics.

Multi-class Cross-Entropy
In this case, the target values are in the set of 0 to n i.e {0,1,2,3…n}. It calculates a score that takes an average
difference between actual and predicted probability values, and the score is minimized to reach the best
possible accuracy. Multi-class cross-entropy is the default loss function for text classification problems.

Sparse Multi-class Cross-Entropy


One hot encoding process makes multi-class cross-entropy difficult to handle a large number of data points.
Sparse cross-entropy solves this problem by performing the calculation of error without using one-hot
encoding.

Kullback Leibler Divergence Loss


KL divergence loss calculates the divergence between probability distribution and baseline distribution and
finds out how much information is lost in terms of bits. The output is a non-negative value that specifies how
close two probability distributions are. To describe KL divergence in terms of probabilistic view, the
likelihood ratio is used.

Hinge Loss
The hinge loss is a specific type of cost function that incorporates a margin or distance from the
classification boundary into the cost calculation. Even if new observations are classified correctly, they
can incur a penalty if the margin from the decision boundary is not large enough. The hinge loss
increases linearly.
The hinge loss is mostly associated with soft-margin support vector machines.

34
If you are familiar with the construction of hyperplanes and their margins in support vector machines, you
probably know that margins are often defined as having a distance equal to 1 from the data-separating-
hyperplane. Otherwise, check out my post on support vector machines (link opens in new tab), where I explain
the details of maximum margins classifiers. We want data points to not only fall on the correct side of the
hyperplane but also to be located beyond the margin.

Support vector machines address a classification problem where observations either have an outcome of +1
or -1. The support vector machine produces a real-valued output that is negative or positive depending on
which side of the decision boundary it falls. Only if an observation is classified correctly and the distance from
the plane is larger than the margin will it incur no penalty. The distance from the hyperplane can be regarded
as a measure of confidence. The further an observation lies from the plane, the more confident it is in the
classification.

For example, if an observation was associated with an actual outcome of +1, and the SVM produced an output
of 1.5, the loss would equal 0.
Contrary to methods like linear regression, where we try to find a line that minimizes the distance from the
data points, an SVM tries to maximize the distance. If you are interested, check out my post on constructing
regression lines. Comparing the two approaches nicely illustrates the difference between the nature of
regression and classification problems.

35
An observation that is located directly on the boundary would incur a loss of 1 regardless of whether the real
outcome was +1 or -1.

Observations that fall on the correct side of the decision boundary (hyperplane) but are within the margin incur
a cost between 0 and 1.

All observations that end up on the wrong side of the hyperplane will incur a loss that is greater than 1 and
increases linearly. If the actual outcome was 1 and the classifier predicted 0.5, the corresponding loss would
be 0.5 even though the classification is correct.
Now that we have a strong intuitive understanding of the hinge loss, understanding the math will be a breeze.

HInge Loss Formula


The loss is defined according to the following formula, where t is the actual outcome (either 1 or -1), and y is
the output of the classifier.
l(y) = max(0, 1 -t \cdot y)l(y)=max(0,1−t⋅y)
Let’s plug in the values from our last example. The outcome was 1, and the prediction was 0.5.
l(y) = max(0, 1 - 1 \cdot 0.5) = 0.5l(y)=max(0,1−1⋅0.5)=0.5
If, on the other hand, the outcome was -1, the loss would be higher since we’ve misclassified our example.
l(y) = max(0, 1 - (-1) \cdot 0.5) = 1.5l(y)=max(0,1−(−1)⋅0.5)=1.5
36
Instead of using a labelling convention of -1, and 1 we could also use 0 and 1 and use the formula for cross-
entropy to set one of the terms equal to zero. But the math checks out more beautifully in the former case.
With the hinge loss defined, we are now in a position to understand the loss function for the support vector
machine. But before we do this, we’ll briefly discuss why and when we actually need a cost function.

Hard Margin vs Soft Margin Support Vector Machine


In a hard margin SVM, we want to linearly separate the data without misclassification. This implies that the
data actually has to be linearly separable.

In this case, the blue and red data points are linearly separable, allowing for a hard margin classifier.
If the data is not linearly separable, hard margin classification is not applicable.
Even though support vector machines are linear classifiers, they are still able to separate data points that are
not linearly separable by applying the kernel trick.

The blue and the red data points are not linearly separable.
Furthermore, if the margin of the SVM is very small, the model is more likely to overfit. In these cases, we
can choose to cut the model some slack by allowing for misclassifications. We call this a soft margin support
vector machine. But if the model produces too many misclassifications, its utility declines. Therefore, we need
to penalize the misclassified samples by introducing a cost function.
In summary, the soft margin support vector machine requires a cost function while the hard margin SVM does
not.

SVM Cost
In the post on support vectors, we’ve established that the optimization objective of the support vector classifier
is to minimize the term w, which is a vector orthogonal to the data-separating hyperplane onto which we
project our data points.
\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2wmin21i=1∑nwi2

37
This minimization problem represents the primal form of the hard margin SVM, which doesn’t account for
classification errors.
For the soft-margin SVM, we combine the minimization objective with a loss function such as the hinge loss.

\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2 + \sum^m_{j=1} max(0, 1 -t_j \cdot y_j)wmin21i=1∑nwi2


+j=1∑mmax(0,1−tj⋅yj)

The first term sums over the number of features (n), while the second term sums over the number of samples
in the data (m).
The t variable is the output produced by the model as a product of the weight parameter w and the data input
x.
t_i = w^Tx_jti=wTxj
To understand how the model generates this output, refer to the post on support vectors (link opens in new
tab).
The loss term has a regularizing effect on the model. But how can we control the regularization? That is how
can we control how aggressively the model should try to avoid misclassifications. To manually control the
number of misclassifications during training, we introduce an additional parameter, C, which we multiply with
the loss term.
\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2 + C\sum^m_{j=1} max(0, 1 -t_j \cdot y_j)wmin21i=1∑nwi2
+Cj=1∑mmax(0,1−tj⋅yj)

The smaller C is, the stronger the regularization. Accordingly, the model will attempt to maximize the margin
and be more tolerant towards misclassifications.

Cost function with a small regularization parameter C


If we set C to a large number, then the SVM will pursue outliers more aggressively, which potentially comes
at the cost of a smaller margin and may lead to overfitting on the training data. The classifier might be less
robust on unseen data.

Cost function with a large regularization parameter C leading to less regularization.

38
SVM Kernels
Kernel Function is a method used to take data as input and transform it into the required form of
processing data. “Kernel” is used due to a set of mathematical functions used in Support Vector Machine
providing the window to manipulate the data. So, Kernel Function generally transforms the training set of
data so that a non-linear decision surface is able to transform to a linear equation in a higher number of
dimension spaces. Basically, It returns the inner product between two points in a standard feature
dimension.
Standard Kernel Function Equation :

Major Kernel Functions :-


For Implementing Kernel Functions, first of all, we have to install the “scikit-learn” library using the
command prompt terminal:

pip install scikit-learn

Gaussian Kernel: It is used to perform transformation when there is no prior knowledge about data.

 Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function, adding radial
basis method to improve the transformation.

Gaussian Kernel Graph


Code:
from sklearn.svm import SVC
classifier = SVC(kernel ='rbf', random_state = 0)
# training set in x, y axis
classifier.fit(x_train, y_train)

Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model of the neural network,
which is used as an activation function for artificial neurons.

39
Sigmoid Kernel Graph

Code:

from sklearn.svm import SVC


classifier = SVC(kernel ='sigmoid')
classifier.fit(x_train, y_train) # training set in x, y axis

Polynomial Kernel: It represents the similarity of vectors in the training set of data in a feature space
over polynomials of the original variables used in the kernel.

Polynomial Kernel Graph


Code:
from sklearn.svm import SVC
classifier = SVC(kernel ='poly', degree = 4)
classifier.fit(x_train, y_train) # training set in x, y axis

40

You might also like