ML-Unit 1 Merged
ML-Unit 1 Merged
Human Learning - Types – Machine Learning - Types - Problems not to be solved - Applications -
Languages/Tools– Issues. Preparing to Model: Introduction - Machine Learning Activities - Types of data -
Exploring structure of data - Data quality and remediation - Data Pre-processing
Machine Learning
Machine learning is a growing technology which enables computers to learn automatically from past data.
Machine learning uses various algorithms for building mathematical models and making predictions using
historical data or information. Currently, it is being used for various tasks such as image
recognition, speech recognition, email filtering, Facebook auto-tagging, recommender system, and many
more.
Machine Learning (ML) is that field of computer science with the help of which computer systems can provide
sense to data in much the same way as human beings do.
In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an
algorithm or method. The main focus of ML is to allow computer systems learn from experience without being
explicitly programmed or human intervention.
In the real world, we are surrounded by humans who can learn everything from their experiences with their
learning capability, and we have computers or machines which work on our instructions. But can a machine
also learn from experiences or past data like a human does? So here comes the role of Machine Learning.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the development
of algorithms which allow a computer to learn from the data and past experiences on their own. The term
machine learning was first introduced by Arthur Samuel in 1959. We can define it in a summarized way as:
1
Machine learning enables a machine to automatically learn from data, improve performance from experiences,
and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine learning algorithms build
a mathematical model that helps in making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for creating predictive models. Machine
learning constructs or uses the algorithms that learn from historical data. A machine has the ability to learn
if it can improve its performance by gaining more data.
A Machine Learning system learns from historical data, builds the prediction models, and whenever it
receives new data, predicts the output for it. The accuracy of predicted output depends upon the amount of
data, as the huge amount of data helps to build a better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead of writing a
code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms, machine
builds the logic as per the data and predict the output. Machine learning has changed our way of thinking
about the problem. The below block diagram explains the working of Machine Learning algorithm:
The need for machine learning is increasing day by day. The reason behind the need for machine learning is
that it is capable of doing tasks that are too complex for a person to implement directly. As a human, we have
some limitations as we cannot access the huge amount of data manually, so for this, we need some computer
systems and here comes the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let them explore
the data, construct the models, and predict the required output automatically. The performance of the machine
learning algorithm depends on the amount of data, and it can be determined by the cost function. With the
help of machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its uses cases, Currently, machine learning
is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion by Facebook,
2
etc. Various top companies such as Netflix and Amazon have build machine learning models that are using a
vast amount of data to analyze the user interest and recommend product accordingly.
Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
1) Supervised Learning
Supervised learning is commonly used in real world applications, such as face and speech recognition,
products or movie recommendations, and sales forecasting. Supervised learning can be further classified into
two types - Regression and Classification.
Regression trains on and predicts a continuous-valued response, for example predicting real estate prices.
Classification attempts to find the appropriate class label, such as analyzing positive/negative sentiment, male
and female persons, benign and malignant tumors, secure and unsecure loans etc.
In supervised learning, learning data comes with description, labels, targets or desired outputs and the
objective is to find a general rule that maps inputs to outputs. This kind of learning data is called labeled data.
The learned rule is then used to label new data with unknown outputs.
Supervised learning involves building a machine learning model that is based on labeled samples. For
example, if we build a system to estimate the price of a plot of land or a house based on various features, such
as size, location, and so on, we first need to create a database and label it. We need to teach the algorithm what
features correspond to what prices. Based on this data, the algorithm will learn how to calculate the price of
real estate using the values of the input features.
Supervised learning deals with learning a function from available training data. Here, a learning algorithm
analyzes the training data and produces a derived function that can be used for mapping new examples. There
are many supervised learning algorithms such as Logistic Regression, Neural networks, Support Vector
Machines (SVMs), and Naive Bayes classifiers.
Common examples of supervised learning include classifying e-mails into spam and not-spam categories,
labeling webpages based on their content, and voice recognition.
3
2) Unsupervised Learning
Unsupervised learning is used to detect anomalies, outliers, such as fraud or defective equipment, or to group
customers with similar behaviors for a sales campaign. It is the opposite of supervised learning. There is no
labeled data here.
When learning data contains only some indications without any description or labels, it is up to the coder or
to the algorithm to find the structure of the underlying data, to discover hidden patterns, or to determine how
to describe the data. This kind of learning data is called unlabeled data.
Suppose that we have a number of data points, and we want to classify them into several groups. We may not
exactly know what the criteria of classification would be. So, an unsupervised learning algorithm tries to
classify the given dataset into a certain number of groups in an optimum way.
Unsupervised learning algorithms are extremely powerful tools for analyzing data and for identifying patterns
and trends. They are most commonly used for clustering similar input into logical groups. Unsupervised
learning algorithms include Kmeans, Random Forests, Hierarchical clustering and so on.
3) Semi-supervised Learning
If some learning samples are labeled, but some other are not labeled, then it is semi-supervised learning. It
makes use of a large amount of unlabeled data for training and a small amount of labeled data for testing.
Semi-supervised learning is applied in cases where it is expensive to acquire a fully labeled dataset while more
practical to label a small subset. For example, it often requires skilled experts to label certain remote sensing
images, and lots of field experiments to locate oil at a particular location, while acquiring unlabeled data is
relatively easy.
4) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for each
right action and gets a penalty for each wrong action. The agent learns automatically with these feedbacks and
improves its performance. In reinforcement learning, the agent interacts with the environment and explores it.
The goal of an agent is to get the most reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of Reinforcement
learning.
Machine learning can be seen as a branch of AI or Artificial Intelligence, since, the ability to change
experience into expertise or to detect patterns in complex data is a mark of human or animal intelligence.
As a field of science, machine learning shares common concepts with other disciplines such as statistics,
information theory, game theory, and optimization.
As a subfield of information technology, its objective is to program machines so that they will learn.
However, it is to be seen that, the purpose of machine learning is not building an automated duplication of
intelligent behavior, but using the power of computers to complement and supplement human intelligence.
For example, machine learning programs can scan and process huge databases detecting patterns that are
beyond the scope of human perception.
4
Machine Learning at present:
Now machine learning has got a great advancement in its research, and it is present everywhere around us,
such as self-driving cars, Amazon Alexa, Catboats, recommender system, and many more. It includes
Supervised, unsupervised, and reinforcement learning with clustering, classification, decision tree, SVM
algorithms, etc.
Modern machine learning models can be used for making various predictions, including weather prediction,
disease prediction, stock market analysis, etc.
Prerequisites
Before learning machine learning, you must have the basic knowledge of followings so that you can easily
understand the concepts of machine learning:
While Machine Learning is rapidly evolving, making significant strides with cybersecurity and autonomous
cars, this segment of AI as whole still has a long way to go. The reason behind is that ML has not been able
to overcome number of challenges. The challenges that ML is facing currently are
Quality of data − Having good-quality data for ML algorithms is one of the biggest challenges. Use of low-
quality data leads to the problems related to data preprocessing and feature extraction.
Time-Consuming task − Another challenge faced by ML models is the consumption of time especially for
data acquisition, feature extraction and retrieval.
Lack of specialist persons − As ML technology is still in its infancy stage, availability of expert resources is
a tough job.
No clear objective for formulating business problems − Having no clear objective and well-defined goal
for business problems is another key challenge for ML because this technology is not that mature yet.
Issue of overfitting & underfitting − If the model is overfitting or underfitting, it cannot be represented well
for the problem.
Curse of dimensionality − Another challenge ML model faces is too many features of data points. This can
be a real hindrance.
Difficulty in deployment − Complexity of the ML model makes it quite difficult to be deployed in real life.
5
Applications of Machine learning
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to identify objects,
persons, places, digital images, etc. The popular use case of image recognition and face detection is, Automatic
friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our
Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind this
is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition and person
identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a
popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech to
text", or "Computer speech recognition." At present, machine learning algorithms are widely used by various
applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest
route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the
help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
6
Everyone who is using Google Map is helping this app to make it better. It takes information from the user
and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such as Amazon,
Netflix, etc., for product recommendation to the user. Whenever we search for some product on Amazon, then
we started getting an advertisement for the same product while internet surfing on the same browser and this
is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the product as
per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and
this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a
significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on self-
driving car. It is using unsupervised learning method to train the car models to detect people and objects while
driving.
6. Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the
technology behind this is Machine learning. Below are some spam filters used by Gmail:
Content Filter
Header filter
General blacklists filter
Rules-based filters
Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes classifier
are used for email spam filtering and malware detection.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.
7
For each genuine transaction, the output is converted into some hash values, and these values become the input
for the next round. For each genuine transaction, there is a specific pattern which gets change for the fraud
transaction hence, it detects it and makes our online transactions more secure.
The technology behind the automatic translation is a sequence-to-sequence learning algorithm, which is used
with image recognition and translates the text from one language to another language.
Machine learning is one of the most revolutionary technologies that is making lives simpler. It is a subfield of
Artificial Intelligence, which analyses the data, build the model, and make predictions. Due to its popularity
and great applications, every tech enthusiast wants to learn and build new machine learning Apps. However,
to build ML models, it is important to master machine learning tools. Mastering machine learning tools will
enable you to play with the data, train your models, discover new methods, and create algorithms.
There are different tools, software, and platform available for machine learning, and also new software and
tools are evolving day by day. Although there are many options and availability of Machine learning tools,
choosing the best tool per your model is a challenging task. If you choose the right tool for your model, you
can make it faster and more efficient. In this topic, we will discuss some popular and commonly used Machine
learning tools and their features.
8
Figure: Machine Learning Tools
1. TensorFlow
Machine Learning Tools
TensorFlow is one of the most popular open-source libraries used to train and build both machine learning
and deep learning models. It provides a JS library and was developed by Google Brain Team. It is much
popular among machine learning enthusiasts, and they use it for building different ML applications. It offers
a powerful library, tools, and resources for numerical computation, specifically for large scale machine
learning and deep learning projects. It enables data scientists/ML developers to build and deploy machine
learning applications efficiently. For training and building the ML models, TensorFlow provides a high-level
Keras API, which lets users easily start with TensorFlow and machine learning.
Features:
Below are some top features:
PyTorch is an open-source machine learning framework, which is based on the Torch library. This framework
is free and open-source and developed by FAIR(Facebook's AI Research lab). It is one of the popular ML
frameworks, which can be used for various applications, including computer vision and natural language
processing. PyTorch has Python and C++ interfaces; however, the Python interface is more interactive.
Different deep learning software is made up on top of PyTorch, such as PyTorch Lightning, Hugging Face's
Transformers, Tesla autopilot, etc.
It specifies a Tensor class containing an n-dimensional array that can perform tensor computations along with
GPU support.
Features:
Below are some top features:
Features:
Below are the top features:
Provides machine learning model training, building, deep learning and predictive modelling.
The two services, namely, prediction and training, can be used independently or combinedly.
It can be used by enterprises, i.e., for identifying clouds in a satellite image, responding faster to emails
of customers.
It can be widely used to train a complex model.
Amazon provides a great number of machine learning tools, and one of them is Amazon Machine Learning or
AML. Amazon Machine Learning (AML) is a cloud-based and robust machine learning software application,
which is widely used for building machine learning models and making predictions. Moreover, it integrates
data from multiple sources, including Redshift, Amazon S3, or RDS.
10
Features
Below are some top features:
5. NET
Accord.Net is .Net based Machine Learning framework, which is used for scientific computing. It is combined
with audio and image processing libraries that are written in C#. This framework provides different libraries
for various applications in ML, such as Pattern Recognition, linear algebra, Statistical Data processing. One
popular package of the Accord.Net framework is Accord. Statistics, Accord.Math, and
Accord.MachineLearning.
Features
Below are some top features:
6. Apache Mahout
Apache Mahout is an open-source project of Apache Software Foundation, which is used for developing
machine learning applications mainly focused on Linear Algebra. It is a distributed linear algebra framework
and mathematically expressive Scala DSL, which enable the developers to promptly implement their own
algorithms. It also provides Java/Scala libraries to perform Mathematical operations mainly based on linear
algebra and statistics.
Features:
Below are some top features:
11
It runs on top of Apache Hadoop using the MapReduce paradigm.
7. Shogun
Shogun is a free and open-source machine learning software library, which was created by Gunnar Raetsch
and Soeren Sonnenburg in the year 1999. This software library is written in C++ and supports interfaces for
different languages such as Python, R, Scala, C#, Ruby, etc., using SWIG(Simplified Wrapper and Interface
Generator). The main aim of Shogun is on different kernel-based algorithms such as Support Vector Machine
(SVM), K-Means Clustering, etc., for regression and classification problems. It also provides the complete
implementation of Hidden Markov Models.
Features:
Below are some top features:
The main aim of Shogun is on different kernel-based algorithms such as Support Vector Machine
(SVM), K-Means Clustering, etc., for regression and classification problems.
It provides support for the use of pre-calculated kernels.
It also offers to use a combined kernel using Multiple kernel Learning Functionality.
This was initially designed for processing a huge dataset that consists of up to 10 million samples.
It also enables users to work on interfaces on different programming languages such as Lua, Python,
Java, C#, Octave, Ruby, MATLAB, and R.
8. Oryx2
It is a realization of the lambda architecture and built on Apache Kafka and Apache Spark. It is widely used
for real-time large-scale machine learning projects. It is a framework for building apps, including end-to-end
applications for filtering, packaged, regression, classification, and clustering. It is written in Java languages,
including Apache Spark, Hadoop, Tomcat, Kafka, etc. The latest version of Oryx2 is Oryx 2.8.0.
Features:
Below are some top features:
It has three tiers: specialization on top providing ML abstractions, generic lambda architecture tier,
end-to-end implementation of the same standard ML algorithms.
The original project of Oryx2 was Oryx1, and after some upgrades, Oryx2 was launched.
It is well suited for large-scale real-time machine learning projects.
It contains three layers which are arranged side-by-side, and these are named as Speed layer, batch
layer, and serving layer.
It also has a data transport layer that transfer data between different layers and receives input from
external sources.
Apache Spark MLlib is a scalable machine learning library that runs on Apache Mesos, Hadoop, Kubernetes,
standalone, or in the cloud. Moreover, it can access data from different data sources. It is an open-source
cluster-computing framework that offers an interface for complete clusters along with data parallelism and
fault tolerance.
12
For optimized numerical processing of data, MLlib provides linear algebra packages such as Breeze and netlib-
Java. It uses a query optimizer and physical execution engine for achieving high performance with both batch
and streaming data.
Features
Below are some top features:
For Mobile app developers, Google brings ML Kit, which is packaged with the expertise of machine learning
and technology to create more robust, optimized, and personalized apps. This tools kit can be used for face
detection, text recognition, landmark detection, image labelling, and barcode scanning applications. One can
also use it for working offline.
Features:
Below are some top features:
Given the problem you want to solve, you will have to investigate and obtain data that you will use to feed
your machine. The quality and quantity of information you get are very important since it will directly impact
how well or badly your model will work. You may have the information in an existing database or you must
create it from scratch. If it is a small project, you can create a spreadsheet that will later be easily exported as
a CSV file. It is also common to use the web scraping technique to automatically collect information from
various sources such as APIs.
THE BELAMY
13
SIGN UP
This is a good time to visualize your data and check if there are correlations between the different
characteristics that we obtained. It will be necessary to make a selection of characteristics since the ones you
choose will directly impact the execution times and the results. You can also reduce dimensions by applying
PCA if necessary.
Additionally, you must balance the amount of data we have for each result -class- so that it is significant as
the learning may be biased towards a type of response and when your model tries to generalize knowledge it
will fail.
You must also separate the data into two groups: one for training and the other for model evaluation which
can be divided approximately in a ratio of 80/20 but it can vary depending on the case and the volume of data
we have.
At this stage, you can also pre-process your data by normalizing, eliminating duplicates, and making error
corrections.
There are several models that you can choose according to the objective that you might have: you will use
algorithms of classification, prediction, linear regression, clustering, i.e. k-means or K-Nearest Neighbor,
Deep Learning, i.e Neural Networks, Bayesian, etc.
There are various models to be used depending on the data you are going to process such as images, sound,
text, and numerical values. In the following table, we will see some models and their applications that you can
apply in your projects:
Model Applications
K-means Segmentation
14
Model Applications
You will need to train the datasets to run smoothly and see an incremental improvement in the prediction rate.
Remember to initialize the weights of your model randomly -the weights are the values that multiply or affect
the relationships between the inputs and outputs- which will be automatically adjusted by the selected
algorithm the more you train them.
Step 5: Evaluation
You will have to check the machine created against your evaluation data set that contains inputs that the model
does not know and verify the precision of your already trained model. If the accuracy is less than or equal to
50%, that model will not be useful since it would be like tossing a coin to make decisions. If you reach 90%
or more, you can have good confidence in the results that the model gives you.
If during the evaluation you did not obtain good predictions and your precision is not the minimum desired, it
is possible that you have overfitting or underfitting problems and you must return to the training step before
making a new configuration of parameters in your model. You can increase the number of times you iterate
your training data- termed epochs. Another important parameter is the one known as the “learning rate”, which
is usually a value that multiplies the gradient to gradually bring it closer to the global -or local- minimum to
minimize the cost of the function.
Increasing your values by 0.1 units from 0.001 is not the same as this can significantly affect the model
execution time. You can also indicate the maximum error allowed for your model. You can go from taking a
few minutes to hours, and even days, to train your machine. These parameters are often called
Hyperparameters. This “tuning” is still more of an art than a science and will improve as you experiment.
There are usually many parameters to adjust and when combined they can trigger all your options. Each
algorithm has its own parameters to adjust. To name a few more, in Artificial Neural Networks (ANNs) you
must define in its architecture the number of hidden layers it will have and gradually test with more or less
and with how many neurons each layer. This will be a work of great effort and patience to give good results.
You are now ready to use your Machine Learning model inferring results in real-life scenarios.
15
Machine learning has given the computer systems the abilities to automatically learn without being explicitly
programmed. But how does a machine learning system work? So, it can be described using the life cycle of
machine learning. Machine learning life cycle is a cyclic process to build an efficient machine learning project.
The main purpose of the life cycle is to find a solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
Gathering Data
Data preparation
Data Wrangling
Analyse Data
Deployment
The most important thing in the complete process is to understand the problem and to know the purpose of
the problem. Therefore, before starting the life cycle, we need to understand the problem because the good
result depends on the better understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning system called "model",
and this model is created by providing "training". But to train a model, we need data, hence, life cycle starts
by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to identify and obtain
all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from various sources such
as files, database, internet, or mobile devices. It is one of the most important steps of the life cycle. The
quantity and quality of the collected data will determine the efficiency of the output. The more will be the
data, the more accurate will be the prediction.
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in further
steps.
2. Data preparation
16
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we put our
data into a suitable place and prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
Data exploration:
It is used to understand the nature of data that we have to work with. We need to understand the characteristics,
format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find Correlations, general trends, and
outliers.
Data pre-processing:
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is the process of
cleaning the data, selecting the variable to use, and transforming the data in a proper format to make it more
suitable for analysis in the next step. It is one of the most important steps of the complete process. Cleaning
of data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the data may not be useful. In
real-world applications, collected data may have various issues, including:
Missing Values
Duplicate data
Invalid data
Noise
It is mandatory to detect and remove the above issues because it can negatively affect the quality of the
outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
The aim of this step is to build a machine learning model to analyze the data using various analytical techniques
and review the outcome. It starts with the determination of the type of the problems, where we select the
machine learning techniques such as Classification, Regression, Cluster analysis, Association, etc. then build
the model using prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
17
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its performance for better
outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a model is required
so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model. In this step, we
check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of project or
problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the real-world
system.
If the above-prepared model is producing an accurate result as per our requirement with acceptable speed,
then we deploy the model in the real system. But before deploying the project, we will check whether it is
improving its performance using available data or not. The deployment phase is similar to making the final
report for a project.
Types of data
DATA: It can be any unprocessed fact, value, text, sound, or picture that is not being interpreted and analyzed.
Data is the most important part of all Data Analytics, Machine Learning, Artificial Intelligence. Without data,
we can’t train any model and all modern research and automation will go in vain. Big Enterprises are spending
lots of money just to gather as much certain data as possible.
INFORMATION: Data that has been interpreted and manipulated and has now some meaningful inference
for the users.
Training Data: The part of data we use to train our model. This is the data that your model actually
sees(both input and output) and learns from.
18
Validation Data: The part of data that is used to do a frequent evaluation of the model, fit on the training
dataset along with improving involved hyperparameters (initially set parameters before the model begins
learning). This data plays its part when the model is actually training.
Testing Data: Once our model is completely trained, testing data provides an unbiased evaluation. When we
feed in the inputs of Testing data, our model will predict some values(without seeing actual output). After
prediction, we evaluate our model by comparing it with the actual output present in the testing data. This is
how we evaluate and see how much our model has learned from the experiences feed in as training data, set
at the time of training.
Consider an example:
There’s a Shopping Mart Owner who conducted a survey for which he has a long list of questions and answers
that he had asked from the customers, this list of questions and answers is DATA. Now every time when he
wants to infer anything and can’t just go through each and every question of thousands of customers to find
something relevant as it would be time-consuming and not helpful. In order to reduce this overhead and time
wastage and to make work easier, data is manipulated through software, calculations, graphs, etc. as per own
convenience, this inference from manipulated data is Information. So, Data is a must for Information. Now
Knowledge has its role in differentiating between two individuals having the same information. Knowledge
is actually not technical content but is linked to the human thought process.
Ordinal data:These data are similar to categorical data but can be measured on the basis of comparison.
Numeric Data : If a feature represents a characteristic measured in numbers , it is called a numeric feature.
Numerical data is any data where data points are exact numbers. Statisticians also might call numerical data,
quantitative data. This data has meaning as a measurement such as house prices or as a count, such as a
number of residential properties in Los Angeles or how many houses sold in the past year.
Numerical data can be characterized by continuous or discrete data. Continuous data can assume any value
within a range whereas discrete data has distinct values.
The takeaway here is that numerical data is not ordered in time. They are just numbers that we have collected.
Categorical Data : A categorical feature is an attribute that can take on one of the limited , and usually fixed
number of possible values on the basis of some qualitative property . A categorical feature is also called a
nominal feature.
Categorical data represents characteristics, such as a hockey player’s position, team, hometown. Categorical
data can take numerical values. For example, maybe we would use 1 for the colour red and 2 for blue. But
these numbers don’t have a mathematical meaning. That is, we can’t add them together or take the average.
In the context of super classification, categorical data would be the class label. This would also be something
like if a person is a man or woman, or property is residential or commercial.
Ordinal Data : This denotes a nominal variable with categories falling in an ordered list . Examples include
clothing sizes such as small, medium , and large , or a measurement of customer satisfaction on a scale from
“not at all happy” to “very happy”.
There is also something called ordinal data, which in some sense is a mix of numerical and categorical data.
In ordinal data, the data still falls into categories, but those categories are ordered or ranked in some particular
way. An example would be class difficulty, such as beginner, intermediate, and advanced. Those three types
of classes would be a way that we could label the classes, and they have a natural order in increasing difficulty.
Another example is that we just take quantitative data, and splitting it into groups, so we have bins or categories
of other types of data.
Machine Learning is one of the hottest technologies used by data scientists or ML experts to deploy a real-
time project. However, only skills of machine learning are not sufficient for solving real-world problems and
designing a better product, but also you have to gain good exposure to the data structure.
The data structure used for machine learning is quite similar to other software development fields where it is
often used. Machine Learning is a subset of artificial intelligence that includes various complex algorithms
to solve mathematical problems to a great extent. Data structure helps to build and understand these complex
problems. Understanding the data structure also helps you to build ML models and algorithms in a much more
efficient way than other ML professionals. In this topic, "Data Structure for Machine Learning", we will
discuss various concepts of data structure used in Machine Learning, along with the relationship between data
structure and ML. So, let's start with a quick overview of Data structure and Machine Learning.
The data structure is defined as the basic building block of computer programming that helps us to
organize, manage and store data for efficient search and retrieval.
In other words, the data structure is the collection of data type 'values' which are stored and organized in such
a way that it allows for efficient access and modification.
The data structure is the ordered sequence of data, and it tells the compiler how a programmer is using the
data such as Integer, String, Boolean, etc.
There are two different types of data structures: Linear and Non-linear data structures.
21
1. Linear Data structure:
The linear data structure is a special type of data structure that helps to organize and manage data in a specific
order where the elements are attached adjacently.
Array:
An array is one of the most basic and common data structures used in Machine Learning. It is also used in
linear algebra to solve complex mathematical problems. You will use arrays constantly in machine learning,
whether it's:
o To convert the column of a data frame into a list format in pre-processing analysis
o To order the frequency of words present in datasets.
o Using a list of tokenized words to begin clustering topics.
o In word embedding, by creating multi-dimensional matrices.
An array contains index numbers to represent an element starting from 0. The lowest index is arr[0] and
corresponds to the first element.
Let's take an example of a Python array used in machine learning. Although the Python array is quite different
from than array in other programming languages, the Python list is more popular as it includes the flexibility
of data types and their length. If anyone is using Python in ML algorithms, then it's better to kick your journey
from array initially.
Method Description
22
Append() It is used to add an element at the end of the list.
Count() It returns the count or total available element with an integer value.
Extend() It is used to add the element of a list to the end of the current list.
Index() It returns the index of the first element with the specified value.
Pop() It is used to remove an element from a specified position using an index number.
Stacks:
Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last Out). It is used for binary
classification in deep learning. Although stacks are easy to learn and implement in ML models but having a
good grasp can help in many computer science aspects such as parsing grammar, etc.
Stacks enable the undo and redo buttons on your computer as they function similar to a stack of blog content.
There is no sense in adding a blog at the bottom of the stack. However, we can only check the most recent one
that has been added. Addition and removal occur at the top of the stack.
Linked List:
A linked list is the type of collection having several separately allocated nodes. Or in other words, a list is
the type of collection of data elements that consist of a value and pointer that point to the next node in the
list.
In a linked list, insertion and deletion are constant time operations and are very efficient, but accessing a value
is slow and often requires scanning. So, a linked list is very significant for a dynamic array where the shifting
of elements is required. Although insertion of an element can be done at the head, middle or tail position, it is
relatively cost consuming. However, linked lists are easy to splice together and split apart. Also, the list can
be converted to a fixed-length array for fast access.
23
Queue:
A Queue is defined as the "FIFO" (first in, first out). It is useful to predict a queuing scenario in real-time
programs, such as people waiting in line to withdraw cash in the bank. Hence, the queue is significant in a
program where multiple lists of codes need to be processed.
The queue data structure can be used to record the split time of a car in F1 racing.
As the name suggests, in Non-linear data structures, elements are not arranged in any sequence. All the
elements are arranged and linked with each other in a hierarchal manner, where one element can be linked
with one or more elements.
1) Trees
Binary Tree:
The concept of a binary tree is very much similar to a linked list, but the only difference of nodes and their
pointers. In a linked list, each node contains a data value with a pointer that points to the next node in the list,
whereas; in a binary tree, each node has two pointers to subsequent nodes instead of just one.
Binary trees are sorted, so insertion and deletion operations can be easily done with O(log N) time complexity.
Similar to the linked list, a binary tree can also be converted to an array on the basis of tree sorting.
In a binary tree, there are some child and parent nodes shown in the above image. Where the value of the left
child node is always less than the value of the parent node while the value of the right-side child nodes is
24
always more than the parent node. Hence, in a binary tree structure, data sorting is done automatically, which
makes insertion and deletion efficient.
2) Graphs
A graph data structure is also very much useful in machine learning for link prediction. Graphs are directed
or undirected concepts with nodes and ordered or unordered pairs. Hence, you must have good exposure to
the graph data structure for machine learning and deep learning.
3) Maps
Maps are the popular data structure in the programming world, which are mostly useful for minimizing the
run-time algorithms and fast searching the data. It stores data in the form of (key, value) pair, where the key
must be unique; however, the value can be duplicated. Each key corresponds to or maps a value; hence it is
named a Map.
In different programming languages, core libraries have built-in maps or, rather, HashMaps with different
names for each implementation.
o In Java: Maps
o In Python: Dictionaries
o C++: hash_map, unordered_map, etc.
Python Dictionaries are very useful in machine learning and data science as various functions and algorithms
return the dictionary as an output. Dictionaries are also much used for implementing sparse matrices, which
is very common in Machine Learning.
Heap is a hierarchically ordered data structure. Heap data structure is also very much similar to a tree, but it
consists of vertical ordering instead of horizontal ordering.
Ordering in a heap DS is applied along the hierarchy but not across it, where the value of the parent node is
always more than that of child nodes either on the left or right side.
25
Here, the insertion and deletion operations are performed on the basis of promotion. It means, firstly, the
element is inserted at the highest available position. After that, it gets compared with its parent and promoted
until it reaches the correct ranking position. Most of the heaps data structures can be stored in an array along
with the relationships between the elements.
This is one of the most important types of data structure used in linear algebra to solve 1-D, 2-D, 3-D as well
as 4-D arrays for matrix arithmetic. Further, it requires good exposure to Python libraries such as Python
NumPy for programming in deep learning.
For a Machine learning professional, apart from knowledge of machine learning skills, it is required to have
mastery of data structure and algorithms.
When we use machine learning for solving a problem, we need to evaluate the model performance, i.e., which
model is fastest and requires the smallest amount of space and resources with accuracy. Moreover, if a model
is built using algorithms, comparing and contrasting two algorithms to determine the best for the job is crucial
to the machine learning professional. For such cases, skills in data structures become important for ML
professionals.
Data quality (DQ) is the degree to which a given dataset meets a user's needs. Data quality is an important
criterion for ensuring that data-driven decisions are made as accurately as possible.
High quality data is of sufficient quantity -- and has sufficient detail -- to meet its’ intended uses. It is
consistent with other sources, presented in appropriate ways and has a high degree of completeness. Other
key data quality components include:
Machine learning algorithms trained on accurate, clean, and well-labelled data can identify the patterns hidden
in the data and produce models that provide predictions with high accuracy. It is for this reason that it is very
important to understand the input, detect and address any issues affecting its quality, before feeding the input
to the machine learning algorithm.
There are many aspects of data quality and various dimensions that one can consider when evaluating the data
at hand. Some of the most common aspects examined in the data quality assessment process are the following:
Number of missing values. Most of the real-world datasets contain missing values, i.e., feature entries with
no data value stored. As many machine learning algorithms do not support missing values, detecting the
missing values and properly handling them, can have a significant impact.
26
Existence of duplicate values. Duplicate values can take various formats, such as multiple entries of the same
data point, multiple instances of an entire column, and repetition of the same value in an I.D. variable. While
duplicate instances might be valid in some datasets, they often arise because of errors in the data extraction
and integration processes. Hence, it is important to detect any duplicate values and decide if they correspond
to invalid values (true duplicates) or form a valid part of the dataset.
Existence of outliers/anomalies. Outliers are data points that differ substantially from the rest of data, and
they may arise due to the diversity of the dataset or because of errors/mistakes. As machine learning algorithms
are sensitive to the range and distribution of attribute values, identifying the outliers and their nature is
important for assessing the quality of the dataset.
Existence of invalid/bad formatted values. Datasets often contain inconsistent values, such as variables with
different units across the data points and variables with incorrect data type. For example, it is often the case
that some special numerical variables, such as percentages and fractions, are mistakenly stored as strings, and
one should detect and transform such cases so that the machine learning algorithm can work with the actual
numbers.
After exploring the data to assess its quality and gain an in-depth understanding of the dataset, it is important
to resolve any detected issues before proceeding to the next stages of the machine learning pipeline. Below,
we give some of the most common ways for addressing such issues.
Handling missing values. There are different ways of dealing with missing data based on their number and
their data type:
Removing the missing data. If the number of data points containing missing values is small and the size of
the dataset is large enough, you may remove such data points. Also, if a variable is containing a very large
number of missing values, it may be removed.
Imputation. If the number of missing values is not small enough to be removed and not large enough to be a
substantial proportion of the variable entries, you can replace the missing values in a numerical variable with
the mean/median of the non-missing entries and the missing values in a categorical variable with the mode,
which is the most frequent entry of the variable.
Dealing with duplicate values. True duplicates, i.e., instances of the same data point, are usually removed.
In this way, the increase of the sample weight on these points is eliminated, and the risk of any artificial
inflation in the performance metrics is reduced.
Dealing with outliers. As with the case of missing values, common methods of handling the detected outliers
include removing the outliers and imputing new values. However, depending on the context of the dataset and
the number of the outliers, keeping the outliers unchanged might be the most suitable course of action. For
example, keeping the outliers would be suggested in a dataset where the number of outliers is not very small
as they might be necessary to correctly understand the dataset.
Converting bad formatted values. All malformed values are converted and stored with the correct datatype.
For example, numerical variables that are stored as strings are converted to the corresponding numbers, and
strings that represent dates are stored as date objects. Also, it is important to convert and ensure that all entries
in a variable correspond to the same unit as otherwise the comparisons between the variable entries will not
correspond to the true comparisons.
27
Data remediation is the process of cleansing, organizing and migrating data so that it’s properly protected and
best serves its intended purpose. There is a misconception that data remediation simply means deleting
business data that is no longer needed. It’s important to remember that the key word “remediation” derives
from the word “remedy,” which is to correct a mistake. Since the core initiative is to correct data, the data
remediation process typically involves replacing, modifying, cleansing or deleting any “dirty” data.
Data Migration – The process of moving data between two or more systems, data formats or servers.
Data Discovery – A manual or automated process of searching for patterns in data sets to identify structured
and unstructured data in an organization’s systems.
ROT – An acronym that stands for redundant, obsolete and trivial data. According to the Association for
Intelligent Information Management, ROT data accounts for nearly 80 percent of the unstructured data that is
beyond its recommended retention period and no longer useful to an organization.
Dark Data – Any information that businesses collect, process and store, but do not use for other purposes.
Some examples include customer call records, raw survey data or email correspondences. Often, the storing
and securing of this type of data incurs more expense and sometimes even greater risk than it does value.
Dirty Data – Data that damages the integrity of the organization’s complete dataset. This can include data
that is unnecessarily duplicated, outdated, incomplete or inaccurate.
Data Overload – This is when an organization has acquired too much data, including low-quality or dark
data. Data overload makes the tasks of identifying, classifying and remediating data laborious.
Data Cleansing – Transforming data in its native state to a predefined standardized format.
Data Governance – Management of the availability, usability, integrity and security of the data stored within
an organization.
Data Pre-processing
Data pre-processing is a process of preparing the raw data and making it suitable for a machine learning model.
It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and formatted
data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for
this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot
be directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and
28
making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine
learning model.
To create a machine learning model, the first thing we required is a dataset as a machine learning model
completely works on data. The collected data for a particular problem in a proper format is known as
the dataset.
Dataset may be of different formats for different purposes, such as, if we want to create a machine learning
model for business purpose, then dataset will be different with the dataset required for a liver patient. So each
dataset is different from another dataset. To use the dataset in our code, we usually put it into a CSV file.
However, sometimes, we may also need to use an HTML or xlsx file.
CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save the tabular data,
such as spreadsheets. It is useful for huge datasets and can use these datasets in programs.
Here we will use a demo dataset for data preprocessing, and for practice, it can be downloaded from here,
"https://www.superdatascience.com/pages/machine-learning
. For real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets
, https://archive.ics.uci.edu/ml/index.php
etc.
We can also create our dataset by gathering data using various API with Python and put that data into a .csv
file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined Python libraries.
These libraries are used to perform some specific jobs. There are three specific libraries that we will use for
data preprocessing, which are:
29
Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It is the
fundamental package for scientific calculation in Python. It also supports to add large, multidimensional arrays
and matrices. So, in Python, we can import it as:
1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with this library, we
need to import a sub-library pyplot. This library is used to plot any type of charts in Python for the code. It
will be imported as below:
Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and used for
importing and managing the datasets. It is an open-source data manipulation and analysis library. It will be
imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
Now we need to import the datasets which we have collected for our machine learning project. But before
importing a dataset, we need to set the current directory as a working directory. To set a working directory in
Spyder IDE, we need to follow the below steps:
Here, in the below image, we can see the Python file along with required dataset. Now, the current folder is
set as a working directory.
30
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read a csv file
and performs various operations on it. Using this function, we can read a csv file locally as well as through an
URL.
1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have passed the name
of our dataset. Once we execute the above line of code, it will successfully import the dataset in our code. We
can also check the imported dataset by clicking on the section variable explorer, and then double click
on data_set. Consider the below image:
31
As in the above image, indexing is started from 0, which is the default indexing in Python. We can also change
the format of our dataset by clicking on the format option.
In machine learning, it is important to distinguish the matrix of features (independent variables) and dependent
variables from dataset. In our dataset, there are three independent variables that are Country, Age, and Salary,
and one is a dependent variable which is Purchased.
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the
required rows and columns from the dataset.
1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the columns.
Here we have used :-1, because we don't want to take the last column as it contains the dependent variable. So
by doing this, we will get the matrix of features.
As we can see in the above output, there are only three variables.
1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent variables.
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains some
missing data, then it may create a huge problem for our machine learning model. Hence it is necessary to
handle missing values present in the dataset.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this way, we
just delete the specific row or column which consists of null values. But this way is not so efficient and
removing data may lead to loss of information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row which contains any
missing value and will put it on the place of missing value. This strategy is useful for the features which have
numeric data such as age, salary, year, etc. Here, we will use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains various libraries for
building machine learning models. Here we will use Imputer class of sklearn.preprocessing library. Below
is the code for it:
1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
33
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
As we can see in the above output, the missing values have been replaced with the means of rest column
values.
Categorical data is data which has some categories such as, in our dataset; there are two categorical
variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if our dataset would have
a categorical variable, then it may create trouble while building the model. So it is necessary to encode these
categorical variables into numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
34
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
In above code, we have imported LabelEncoder class of sklearn library. This class has successfully encoded
the variables into digits.
But in our case, there are three country variables, and as we can see in the above output, these variables are
encoded into 0, 1, and 2. By these values, the machine learning model may assume that there is some
correlation between these variables which will produce the wrong output. So to remove this issue, we will
use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that variable
in a particular column, and rest variables become 0. With dummy encoding, we will have a number of columns
equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values. For Dummy
Encoding, we will use OneHotEncoder class of preprocessing library.
Output:
As we can see in the above output, all the variables are encoded into numbers 0 and 1 and divided into three
columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object of LableEncoder class. Here we are
not using OneHotEncoder class because the purchased variable has only two categories yes or no, and which
are automatically encoded into 0 and 1.
Output:
36
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one of the
crucial steps of data preprocessing as by doing this, we can enhance the performance of our machine learning
model.
Suppose, if we have given training to our machine learning model by a dataset and we test it by a completely
different dataset. Then, it will create difficulties for our model to understand the correlations between the
models.
If we train our model very well and its training accuracy is also very high, but we provide a new dataset to it,
then it will decrease the performance. So we always try to make a machine learning model which performs
well with the training set and also with the test dataset. Here, we can define these datasets as:
37
Training Set: A subset of dataset to train the machine learning model, and we already know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts the
output.
For splitting the dataset, we will use the below lines of code:
Explanation:
o In the above code, the first line is used for splitting arrays of the dataset into random train and test
subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are for arrays of
data, and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2, which tells
the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that you always get
the same result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen under the variable explorer
section.
38
As we can see in the above image, the x and y variables are divided into 4 different variables with
corresponding values.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to standardize the
independent variables of the dataset in a specific range. In feature scaling, we put our variables in the same
range and in the same scale so that no any variable dominate the other variable.
As we can see, the age and salary column values are not on the same scale. A machine learning model is based
on Euclidean distance, and if we do not scale the variable, then it will cause some issue in our machine
learning model.
39
If we compute any two values from age and salary, then salary values will dominate the age values, and it will
produce an incorrect result. So to remove this issue, we need to perform feature scaling for machine learning.
Standardization
Normalization
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
40
1. from sklearn.preprocessing import StandardScaler
Now, we will create the object of StandardScaler class for independent variables or features. And then we
will fit and transform the training dataset.
1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead of fit_transform() because it is already
done in training set.
1. x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test as:
x_train:
41
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.
Now, in the end, we can combine all the steps together to make our complete code more understandable.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5. #importing datasets
6. data_set= pd.read_csv('Dataset.csv')
7. #Extracting Independent Variable
8. x= data_set.iloc[:, :-1].values
9. #Extracting Dependent variable
10. y= data_set.iloc[:, 3].values
11. #handling missing data(Replacing missing data with the mean value)
12. from sklearn.preprocessing import Imputer
13. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
14. #Fitting imputer object to the independent varibles x.
15. imputerimputer= imputer.fit(x[:, 1:3])
16. #Replacing missing data with the calculated mean value
17. x[:, 1:3]= imputer.transform(x[:, 1:3])
18. #for Country Variable
42
19. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
20. label_encoder_x= LabelEncoder()
21. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
22. #Encoding for dummy variables
23. onehot_encoder= OneHotEncoder(categorical_features= [0])
24. x= onehot_encoder.fit_transform(x).toarray()
25. #encoding for purchased variable
26. labelencoder_y= LabelEncoder()
27. y= labelencoder_y.fit_transform(y)
28. # Splitting the dataset into training and test set.
29. from sklearn.model_selection import train_test_split
30. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
31. #Feature Scaling of datasets
32. from sklearn.preprocessing import StandardScaler
33. st_x= StandardScaler()
34. x_train= st_x.fit_transform(x_train)
35. x_test= st_x.transform(x_test)
In the above code, we have included all the data preprocessing steps together. But there are some steps or lines
of code which are not necessary for all machine learning models. So we can exclude them from our code to
make it reusable for all models.
43
1 Unit II Notes || Machine Learning || MC4301
1. Model selection ?
A machine learning model is defined as a mathematical representation of the output of the training process.
Machine learning is the study of different algorithms that can improve automatically through experience &
old data and build the model. A machine learning model is similar to computer software designed to recognize
patterns or behaviors based on previous experience or data. The learning algorithm discovers patterns within
the training data, and it outputs an ML model which captures these patterns and makes predictions on new
data.
When solving a Machine Learning problem, we may zero down
to several candidate models for the problem. We may further
be interested in the selection of
1. The best choice among various ML algorithms (e.g., Logistic
regression, support vector machine, neural networks, etc.)
2. Variables for linear regression
3. Basis terms such as polynomials, splines, or wavelets in function
estimation
4. Most appropriate parametric family among several alternatives
When we are at it, what we should keep in our minds so that we
select the best model?
The two primary criteria for model selection are prediction
accuracy and model interpretability, which are listed below
A good model selection technique will balance between prediction accuracy and simplicity.
Usually, we aim to find the model which works best on the test dataset. But, a designated test set is not
available when we are building a predictive model. To address this problem, two conventional approaches are
used to find the estimate of the test error.
1. Analytic Methods -We can indirectly estimate test error by making an adjustment to the training error to account for
the bias due to overfitting. In these groups of methods, the training error is calculated first and then a penalty is added
to the training error to estimate the testing error.
2. Resampling Methods - We can directly estimate the test error, using Resampling Methods. In resampling methods,
the model is fit on one dataset and is validated on the complementary dataset and the validation error is recorded for
each iteration. This process is repeated multiple times and the mean validation error is taken as an estimate for test error.
In the simple linear models with a large number of predictors(p) and sample size(n), analytic methods perform as good
as resampling methods and are computationally inexpensive.
2 Unit II Notes || Machine Learning || MC4301
A training model is a dataset that is used to train an ML algorithm. It consists of the sample output data and
the corresponding sets of input data that have an influence on the output. The training model is used to run
the input data through the algorithm to correlate the processed output against the sample output. The result
from this correlation is used to modify the model.
This iterative process is called “model fitting”. The accuracy of the training dataset or the validation dataset
is critical for the precision of the model.
Model training in machine language is the process of feeding an ML algorithm with data to help identify and
learn good values for all attributes involved. There are several types of machine learning models, of which
the most common ones are supervised and unsupervised learning.
Supervised learning is possible when the training data contains both the input and output values. Each set of
data that has the inputs and the expected output is called a supervisory signal. The training is done based on
the deviation of the processed result from the documented result when the inputs are fed into the model.
Unsupervised learning involves determining patterns in the data. Additional data is then used to fit patterns
or clusters. This is also an iterative process that improves the accuracy based on the correlation to the
expected patterns or clusters. There is no reference output dataset in this method.
Types of ML Models
Amazon ML supports three types of ML models: binary classification, multiclass classification, and regression. The
type of model you should choose depends on the type of target that you want to predict.
Regression Model
ML models for regression problems predict a numeric value. For training regression models, Amazon ML uses the
industry-standard learning algorithm known as linear regression.
During the training process, Amazon ML automatically selects the correct learning algorithm for you, based on the
type of target that you specified in the training data source.
For example,
we can train a random forest machine learning model to predict whether a specific passenger survived the
sinking of the Titanic in 1912. The model uses all the passenger’s attributes – such as their ticket class, gender, and
age – to predict whether they survived.
Now let’s say our random forest model predicts a 93% chance of survival for a particular passenger. How did it come
to this conclusion?
Random forest models can easily consist of hundreds or thousands of “trees.” This makes it nearly impossible to grasp
their reasoning.
But, we can make each individual decision interpretable using an approach borrowed from game theory.
SHAP plots show how the model used each passenger attribute and arrived at a prediction of 93% (or 0.93). In the
Shapely plot below, we can see the most important attributes the model factored in.
the passenger was not in third class: survival chances increase substantially;
the passenger was female: survival chances increase even more;
the passenger was not in first class: survival chances fall slightly.
We can see that the model is performing as expected by combining this interpretation with what we know from
history: passengers with 1st or 2nd class tickets were prioritized for lifeboats, and women and children abandoned ship
before men.
By contrast, many other machine learning models are not currently possible to interpret. As machine learning is
increasingly used in medicine and law, understanding why a model makes a specific decision is important.
Fairness: if we ensure our predictions are unbiased, we prevent discrimination against under-represented groups.
Robustness: we need to be confident the model works in every setting, and that small changes in input don’t cause
large or unexpected changes in output.
Privacy: if we understand the information a model uses, we can stop it from accessing sensitive information.
Causality: we need to know the model only considers causal relationships and doesn’t pick up false correlations;
Trust: if people understand how our model reaches its decisions, it’s easier for them to trust it.
5 Unit II Notes || Machine Learning || MC4301
With very large datasets, more complex algorithms often prove more accurate, so there can be a trade-off between
interpretability and accuracy.
Scope of interpretability
By looking at scope, we have another way to compare
models’ interpretability. We can ask if a model is globally or
locally interpretable:
A model is globally interpretable if it’s small and simple enough for a human to understand it entirely. A model is
locally interpretable if a human can trace back a single decision and understand how the model reached that decision.
A model is globally interpretable if we understand each and every rule it factors in. For example, a simple model
helping banks decide on home loan approvals might consider:
The applicant’s monthly salary,
The size of the deposit, and
The applicant’s credit rating.
A human could easily evaluate the same data and reach the same conclusion, but a fully transparent and globally
interpretable model can save time.
In contrast, a far more complicated model could consider thousands of factors, like where the applicant lives and
where they grew up, their family’s debt history, and their daily shopping habits. It might be possible to figure out why
a single home loan was denied, if the model made a questionable decision. But because of the model’s complexity, we
won’t fully understand how it comes to decisions in general. This is a locally interpretable model.
1. Confusion matrix
2. Accuracy
3. Precision
4. Recall
5. Specificity
6. F1 score
7. Precision-Recall or PR curve
8. ROC (Receiver Operating Characteristics) curve
9. PR vs ROC curve.
6 Unit II Notes || Machine Learning || MC4301
For simplicity, we will mostly discuss things in terms of a binary classification problem where let’s say we’ll
have to find if an image is of a cat or a dog. Or a patient is having cancer (positive) or is found healthy
(negative). Some common terms to be clear with are:
True positives (TP): Predicted positive and are actually positive.
False positives (FP): Predicted positive and are actually negative.
True negatives (TN): Predicted negative and are actually negative.
False negatives (FN): Predicted negative and are actually positive.
So let's get started!
Confusion matrix
It’s just a representation of the above parameters in a matrix format. Better visualization is always good :)
Accuracy
The most commonly used metric to judge a model and is actually not a clear indicator of the performance.
The worse happens when classes are imbalanced.
Take for example a cancer detection model. The chances of actually having cancer are very low. Let’s say out
of 100, 90 of the patients don’t have cancer and the remaining 10 actually have it. We don’t want to miss on a
patient who is having cancer but goes undetected (false negative). Detecting everyone as not having cancer
gives an accuracy of 90% straight. The model did nothing here but just gave cancer free for all the 100
predictions.
We surely need better alternatives.
Precision
Percentage of positive instances out of the total predicted positive instances. Here denominator is the model
prediction done as positive from the whole given dataset. Take it as to find out ‘how much the model is right
when it says it is right’.
Specificity
Percentage of negative instances out of the total actual negative instances. Therefore denominator (TN +
FP) here is the actual number of negative instances present in the dataset. It is similar to recall but the shift is
on the negative instances. Like finding out how many healthy patients were not having cancer and were told
they don’t have cancer. Kind of a measure to see how separate the classes are.
F1 score
It is the harmonic mean of precision and recall. This takes the contribution of both, so higher the F1 score, the
better. See that due to the product in the numerator if one goes low, the final F1 score goes down
significantly. So a model does well in F1 score if the positive predicted are actually positives (precision) and
doesn't miss out on positives and predicts them negative (recall).
One drawback is that both precision and recall are given equal importance due to which according to our
application we may need one higher than the other and F1 score may not be the exact metric for it. Therefore
either weighted-F1 score or seeing the PR or ROC curve can help.
PR curve
It is the curve between precision and recall for various threshold values. In the figure below we have 6
predictors showing their respective precision-recall curve for various threshold values. The top right part of
the graph is the ideal space where we get high precision and recall. Based on our application we can choose
the predictor and the threshold value. PR AUC is just the area under the curve. The higher its numerical value
the better.
ROC curve
ROC stands for receiver operating characteristic and the graph is plotted against TPR and FPR for various
threshold values. As TPR increases FPR also increases. As you can see in the first figure, we have four
categories and we want the threshold value that leads us closer to the top left corner. Comparing different
predictors (here 3) on a given dataset also becomes easy as you can see in figure 2, one can choose the
threshold according to the application at hand. ROC AUC is just the area under the curve, the higher its
numerical value the better.
8 Unit II Notes || Machine Learning || MC4301
PR vs ROC curve
Both the metrics are widely used to judge a models performance.
Which one to use PR or ROC?
Due to the absence of TN in the precision-recall equation, they are useful in imbalanced classes. In the case
of class imbalance when there is a majority of the negative class. The metric doesn’t take much into
consideration the high number of TRUE NEGATIVES of the negative class which is in majority, giving
better resistance to the imbalance. This is important when the detection of the positive class is very
important.
Like to detect cancer patients, which has a high class imbalance because very few have it out of all the
diagnosed. We certainly don’t want to miss on a person having cancer and going undetected (recall) and be
sure the detected one is having it (precision).
Due to the consideration of TN or the negative class in the ROC equation, it is useful when both the classes
are important to us. Like the detection of cats and dog. The importance of true negatives makes sure that
both the classes are given importance, like the output of a CNN model in determining the image is of a cat or
a dog.
Conclusion
The evaluation metric to use depends heavily on the task at hand. For a long time, accuracy was the only
measure I used, which is really a vague option. I hope this blog would have been useful for you. That's all
from my side. Feel free to suggest corrections and improvements.
Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, kNN, K-Means, Random
Forest and Dimensionality Reduction Algorithms and Gradient Boosting are the leading ML
algorithms you can choose as per your ML model compatibility.
9 Unit II Notes || Machine Learning || MC4301
Depending on the complexities of problem and learning algorithms, model skill, data size evaluation
and use of statistical heuristic rule are the leading factors determine the quantity and types of training
data sets that help in improving the performance of the model.
Actually, there are different methods to measure the quality of the training data set. Standard quality-
assurance methods and detailed for in-depth quality assessment are the leading two popular methods
you can use to ensure the quality of data sets. Quality of data is important to get unbiased decisions
from the ML models, so you need to make sure to use the right quality of training data sets to
improve the performance of your ML model.
4. Supervised or Unsupervised ML
Moreover, the above-discussed ML algorithms, the performance of such AI-based models are
affected by methods or process of machine learning. And supervised, unsupervised and
reinforcement learning are the algorithm consist of a target/outcome variable (or dependent variable)
which is to be predicted from a given set of predictors (independent variables).
Similarly, reinforcement Learning is another important algorithm, used to train the model to make
specific decisions. In this training process, the machine learns from previous experiences and tries to
store the best suitable knowledge for the right predictions.
Actually, there are various types of validation techniques you can follow but you need to make sure
choose the best one that is suitable for your ML model validation and help you to improve the overall
performance of your ML model and predict in an unbiased manner. Similarly, testing of the model is
also important to ensure its accuracy and performance.
Summing-up
Improving machine learning model performance will not only make the model predict in an unbiased
manner but make it one of the most reliable and acceptable in the AI world. Hence, a machine
learning engineer and data scientist need to make sure all these points while working on such models
to improve the overall performance of the AI model.
10 Unit II Notes || Machine Learning || MC4301
It refers to the algorithm family that creates new features using the existing features. These new features
may not have the same interpretation as the original features, but they may have more explanatory power in
a different space rather than in the original space. This can also be used for Feature Reduction. It can be
done in many ways, by linear combinations of original features or using non-linear functions. It helps
machine learning algorithms to converge faster.
As we know, Normal Distribution is a very important distribution in Statistics, which is key to many
statisticians for solving problems in statistics. Usually, the data distribution in Nature follows a Normal
distribution like - age, income, height, weight, etc. But the features in the real-life data are not normally
distributed. However, it is the best approximation when we are unaware of the underlying distribution
pattern.
4. Square Root Transformation: This transformation is defined only for positive numbers. This can be
used for reducing the skewness of right-skewed data. This transformation is weaker than Log
Transformation.
11 Unit II Notes || Machine Learning || MC4301
5. Custom Transformation: A Function Transformer forwards its X (and optionally y) arguments to a user-
defined function or function object and returns this function's result. The resulting transformer will not be
pickle able if lambda is used as the function. This is useful for stateless transformations such as taking the
log of frequencies, doing custom scaling, etc.
6. Power Transformations: Power transforms are a family of parametric, monotonic transformations that
make data more Gaussian-like. The optimal parameter for stabilizing variance and minimizing skewness is
estimated through maximum likelihood. This is useful for modeling issues related to non-constant variance
or other situations where normality is desired. Currently, Power Transformer supports the Box-Cox
transform and the Yeo-Johnson transform.
Box-cox requires the input data to be strictly positive (not even zero is acceptable), while Yeo-Johnson
supports both positive and negative data.
There are mainly two types of Feature Selection techniques, which are:
o Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable and can be used for the labelled
dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used for the
unlabelled dataset.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search problem, in which different
combinations are made, evaluated, and compared with other combinations. It trains the algorithm by using the
subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and
with this feature set, the model has trained again.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This method does not depend on the
learning algorithm and chooses the features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns from the model by using different
metrics through ranking.
The advantage of using filter methods is that it needs low computational time and does not overfit the data.
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Chi-square Test: Chi-square test is a technique to determine the relationship between the categorical variables.
The chi-square value is calculated between each feature and the target variable, and the desired number of
features with the best chi-square value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised technique of features selection. It returns the rank of the variable
on the fisher's criteria in descending order. Then we can select the variables with a large fisher's score.
The value of the missing value ratio can be used for evaluating the feature set against the threshold value. The
formula for obtaining the missing value ratio is the number of missing values in each column divided by the
total number of observations. The variable is having more than the threshold value can be dropped.
14 Unit II Notes || Machine Learning || MC4301
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by considering the
interaction of features along with low computational cost. These are fast processing methods similar to the
filter method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and
optimally finds the most important features that contribute the most to
training in a particular iteration. Some techniques of embedded methods
are:
For machine learning engineers, it is very important to understand that which feature selection method will
work properly for their model. The more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
15 Unit II Notes || Machine Learning || MC4301
To know this, we need to first identify the type of input and output variables. In machine learning, variables
are of mainly two types:
Below are some univariate statistical measures, which can be used for filter-based feature selection:
Numerical Input variables are used for predictive regression modelling. The common method to be used for
such a case is the Correlation coefficient.
o Pearson's correlation coefficient (For linear Correlation).
o Spearman's rank coefficient (for non-linear correlation).
Numerical Input with categorical output is the case for classification predictive modelling problems. In this
case, also, correlation-based techniques should be used, but with categorical output.
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).
This is the case of regression predictive modelling with categorical input. It is a different example of a
regression problem. We can use the same measures as discussed in the above case but in reverse order.
Conclusion
Feature selection is a very complicated and vast field of machine learning, and lots of studies are already made
to discover the best methods. There is no fixed rule of the best feature selection method. However, choosing
the method depend on a machine learning engineer who can combine and innovate approaches to find the best
method for a specific problem. One should try a variety of model fits on different subsets of features selected
through different statistical Measures.
Machine Learning MC4301
UNIT III
BAYESIAN LEARNING
INTRODUCTION
Bayesian learning methods are relevant to study of machine learning for two different reasons.
1. First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems
2. The second reason is that they provide a useful perspective for understanding many
learning algorithms that do not explicitly manipulate probabilities.
Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example
Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a
prior probability for each candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
Bayesian methods can accommodate hypotheses that make probabilistic predictions
New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.
1. One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known in
advance they are often estimated based on background knowledge, previously available
data, and assumptions about the form of the underlying distributions.
2. A second practical difficulty is the significant computational cost required to determine
the Bayes optimal hypothesis in the general case. In certain specialized situations, this
computational cost can be significantly reduced.
BAYES THEOREM
Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior
probability, the probabilities of observing various data given the hypothesis, and the observed
data itself.
Notations
P(h) prior probability of h, reflects any background knowledge about the chance that h
is correct
P(D) prior probability of D, probability that D will be observed
P(D|h) probability of observing D given a world in which h holds
P(h|D) posterior probability of h, reflects confidence that h holds after D has been
observed
Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to
calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D)
and P(D|h).
P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
P(h|D) decreases as P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in support of h.
In many learning scenarios, the learner considers some set of candidate hypotheses H
and is interested in finding the most probable hypothesis h ∈ H given the observed data
D. Any such maximally probable hypothesis is called a maximum a posteriori (MAP)
hypothesis.
Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided
P(D|h) is often called the likelihood of the data D given h, and any hypothesis that maximizes
P(D|h) is called a maximum likelihood (ML) hypothesis
Example
Consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has particular form of cancer, and (2) that the patient does not. The
available data is from a particular laboratory test with two possible outcomes: +
(positive) and - (negative).
We have prior knowledge that over the entire population of people only .008 have this
disease. Furthermore, the lab test is only an imperfect indicator of the disease.
The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the
disease is not present. In other cases, the test returns the opposite result.
The above situation can be summarized by the following probabilities:
Suppose a new patient is observed for whom the lab test returns a positive (+) result.
Should we diagnose the patient as having cancer or not?
The exact posterior probabilities can also be determined by normalizing the above quantities
so that they sum to 1
What is the relationship between Bayes theorem and the problem of concept learning?
Since Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, and can use it as the basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable.
We can design a straightforward concept learning algorithm to output the maximum a posteriori
hypothesis, based on Bayes theorem, as follows:
In order specify a learning problem for the BRUTE-FORCE MAP LEARNING algorithm we
must specify what values are to be used for P(h) and for P(D|h) ?
Let’s choose P(h) and for P(D|h) to be consistent with the following assumptions:
The training data D is noise free (i.e., di = c(xi))
The target concept c is contained in the hypothesis space H
Do not have a priori reason to believe that any hypothesis is more probable than any
other.
Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for the above
BRUTE-FORCE MAP LEARNING algorithm.
Where, VSH,D is the subset of hypotheses from H that are consistent with D
To summarize, Bayes theorem implies that the posterior probability P(h|D) under our assumed
P(h) and P(D|h) is
Example:
FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis under the
probability distributions P(h) and P(D|h) defined above.
Are there other probability distributions for P(h) and P(D|h) under which FIND-S
outputs MAP hypotheses? Yes.
Because FIND-S outputs a maximally specific hypothesis from the version space, its
output hypothesis will be a MAP hypothesis relative to any prior probability distribution
that favours more specific hypotheses.
Note
Bayesian framework is a way to characterize the behaviour of learning algorithms
By identifying probability distributions P(h) and P(D|h) under which the output is a
optimal hypothesis, implicit assumptions of the algorithm can be characterized
(Inductive Bias)
Inductive inference is modelled by an equivalent probabilistic reasoning system based
on Bayes theorem
Consider the problem of learning a continuous-valued target function such as neural network
learning, linear regression, and polynomial curve fitting
A straightforward Bayesian analysis will show that under certain assumptions any learning
algorithm that minimizes the squared error between the output hypothesis predictions and the
training data will output a maximum likelihood (ML) hypothesis
Assuming training examples are mutually independent given h, we can write P(D|h) as the
product of the various (di|h)
Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each
di must also obey a Normal distribution around the true targetvalue f(xi). Because we are
writing the expression for P(D|h), we assume h is the correct description of f.
Hence, µ = f(xi) = h(xi)
Maximize the less complicated logarithm, which is justified because of the monotonicity of
function p
The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding
Thus, above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and the
hypothesis predictions h(xi)
Note:
Why is it reasonable to choose the Normal distribution to characterize noise?
Good approximation of many types of noise in physical systems
Central Limit Theorem shows that the sum of a sufficiently large number of
independent, identically distributed random variables itself obeys a Normal distribution
Only noise in the target value is considered, not in the attributes describing the instances
themselves
What criterion should we optimize in order to find a maximum likelihood hypothesis for f' in
this setting?
First obtain an expression for P(D|h)
Assume the training data D is of the form D = {(x1, d1) . . . (xm, dm)}, where di is the
observed 0 or 1 value for f (xi).
Both xi and di as random variables, and assuming that each training example is drawn
independently, we can write P(D|h) as
Equation (4) to substitute for P(di |h, xi) in Equation (5) to obtain
Equation (7) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting
Derive a weight-training rule for neural network learning that seeks to maximize G(h,D)
using gradient ascent
The gradient of G(h,D) is given by the vector of partial derivatives of G(h,D) with
respect to the various network weights that define the hypothesis h represented by the
learned network
In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to
unit j is
Suppose our neural network is constructed from a single layer of sigmoid units. Then,
where xijk is the kth input to unit j for the ith training example, and d(x) is the derivative
of the sigmoid squashing function.
Finally, substituting this expression into Equation (1), we obtain a simple expression for
the derivatives that constitute the gradient
Because we seek to maximize rather than minimize P(D|h), we perform gradient ascent rather
than gradient descent search. On each iteration of the search the weight vector is adjusted in
the direction of the gradient, using the weight update rule
Where, η is a small positive constant that determines the step size of the i gradient ascent search
This equation (1) can be interpreted as a statement that short hypotheses are preferred,
assuming a particular representation scheme for encoding hypotheses and data
-log2P(h): the description length of h under the optimal encoding for the hypothesis
space H, LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
-log2P(D | h): the description length of the training data D given hypothesis h, under the
optimal encoding from the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h
is the optimal code for describing data D assuming that both the sender and receiver
know the hypothesis h.
Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given
by the description length of the hypothesis plus the description length of the data given
the hypothesis.
Where, CH and CD|h are the optimal encodings for H and for D given h
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths of equ.
Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis
The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH,
and if we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP
Apply the MDL principle to the problem of learning decision trees from some training data.
What should we choose for the representations C1 and C2 of hypotheses and data?
For C1: C1 might be some obvious encoding, in which the description length grows with
the number of nodes and with the number of edges
For C2: Suppose that the sequence of instances (x1 . . .xm) is already known to both the
transmitter and receiver, so that we need only transmit the classifications (f (x1) . . . f
(xm)).
Now if the training classifications (f (x1) . . .f(xm)) are identical to the predictions of the
hypothesis, then there is no need to transmit any information about these examples. The
description length of the classifications given the hypothesis ZERO
If examples are misclassified by h, then for each misclassification we need to transmit
a message that identifies which example is misclassified as well as its correct
classification
The hypothesis hMDL under the encoding C1 and C2 is just the one that minimizes the
sum of these description lengths.
The naive Bayes classifier applies to learning tasks where each instance x is described
by a conjunction of attribute values and where the target function f (x) can take on any
value from some finite set V.
A set of training examples of the target function is provided, and a new instance is
presented, described by the tuple of attribute values (al, a2.. .a m).
The learner is asked to predict the target value, or classification, for this new instance.
The Bayesian approach to classifying the new instance is to assign the most probable target
value, VMAP, given the attribute values (al, a2.. .am) that describe the instance
The naive Bayes classifier is based on the assumption that the attribute values are
conditionally independent given the target value. Means, the assumption is that given
the target value of the instance, the probability of observing the conjunction (al, a 2.. .am),
is just the product of the probabilities for the individual attributes:
Where, VNB denotes the target value output by the naive Bayes classifier
An Illustrative Example
Let us apply the naive Bayes classifier to a concept learning problem i.e., classifying
days according to whether someone will play tennis.
The below table provides a set of 14 training examples of the target concept PlayTennis,
where each day is described by the attributes Outlook, Temperature, Humidity, and
Wind
Use the naive Bayes classifier and the training data from this table to classify the
following novel instance:
< Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong >
Our task is to predict the target value (yes or no) of the target concept PlayTennis for
this new instance
The probabilities of the different target values can easily be estimated based on their
frequencies over the 14 training examples
P(P1ayTennis = yes) = 9/14 = 0.64
P(P1ayTennis = no) = 5/14 = 0.36
Similarly, estimate the conditional probabilities. For example, those for Wind = strong
P(Wind = strong | PlayTennis = yes) = 3/9 = 0.33
P(Wind = strong | PlayTennis = no) = 3/5 = 0.60
Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new
instance, based on the probability estimates learned from the training data.
By normalizing the above quantities to sum to one, calculate the conditional probability that
the target value is no, given the observed attribute values
Estimating Probabilities
We have estimated probabilities by the fraction of times the event is observed to occur
over the total number of opportunities.
For example, in the above case we estimated P(Wind = strong | Play Tennis = no) by
the fraction nc /n where, n = 5 is the total number of training examples for which
PlayTennis = no, and nc = 3 is the number of these for which Wind = strong.
When nc = 0, then nc /n will be zero and this probability term will dominate the quantity
calculated in Equation (2) requires multiplying all the other probability terms by this
zero value
To avoid this difficulty we can adopt a Bayesian approach to estimating the probability,
using the m-estimate defined as follows
m -estimate of probability:
The naive Bayes classifier makes significant use of the assumption that the values of the
attributes a1 . . .an are conditionally independent given the target value v.
This assumption dramatically reduces the complexity of learning the target function
A Bayesian belief network describes the probability distribution governing a set of variables
by specifying a set of conditional independence assumptions along with a set of conditional
probabilities
Bayesian belief networks allow stating conditional independence assumptions that apply to
subsets of the variables
Notation
Consider an arbitrary set of random variables Y1 . . . Yn , where each variable Yi can
take on the set of possible values V(Yi).
The joint space of the set of variables Y to be the cross product V(Y1) x V(Y2) x. . .
V(Yn).
In other words, each item in the joint space corresponds to one of the possible
assignments of values to the tuple of variables (Y1 . . . Yn). The probability distribution
over this joint' space is called the joint probability distribution.
The joint probability distribution specifies the probability for each of the possible
variable bindings for the tuple (Y1 . . . Yn).
A Bayesian belief network describes the joint probability distribution for a set of
variables.
Conditional Independence
Where,
The naive Bayes classifier assumes that the instance attribute A 1 is conditionally independent
of instance attribute A2 given the target value V. This allows the naive Bayes classifier to
calculate P(Al, A2 | V) as follows,
Representation
A Bayesian belief network represents the joint probability distribution for a set of variables.
Bayesian networks (BN) are represented by directed acyclic graphs.
The Bayesian network in above f igure represents the joint probability distribution over the
boolean variables Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup
A Bayesian network (BN) represents the joint probability distribution by specifying a set of
conditional independence assumptions
BN represented by a directed acyclic graph, together with sets of local conditional
probabilities
Each variable in the joint space is represented by a node in the Bayesian network
The network arcs represent the assertion that the variable is conditionally independent
of its non-descendants in the network given its immediate predecessors in the network.
A conditional probability table (CPT) is given for each variable, describing the
probability distribution for that variable given the values of its immediate predecessors
The joint probability for any desired assignment of values (y1, . . . , yn ) to the tuple of network
variables (Y1 . . . Ym) can be computed by the formula
Example:
Consider the node Campfire. The network nodes and arcs represent the assertion that Campfire
is conditionally independent of its non-descendants Lightning and Thunder, given its
immediate parents Storm and BusTourGroup.
This means that once we know the value of the variables Storm and BusTourGroup, the
variables Lightning and Thunder provide no additional information about Campfire
The conditional probability table associated with the variable Campfire. The assertion is
Inference
Use a Bayesian network to infer the value of some target variable (e.g., ForestFire) given
the observed values of the other variables.
Inference can be straightforward if values for all of the other variables in the network
are known exactly.
A Bayesian network can be used to compute the probability distribution for any subset
of network variables given the values or distributions for any subset of the remaining
variables.
An arbitrary Bayesian network is known to be NP-hard
Affective algorithms can be considered for learning Bayesian belief networks from training
data by considering several different settings for learning problem
First, the network structure might be given in advance, or it might have to be inferred from
the training data.
Second, all the network variables might be directly observable in each training example,
or some might be unobservable.
In the case where the network structure is given in advance and the variables are fully
observable in the training examples, learning the conditional probability tables is
straightforward and estimate the conditional probability table entries
In the case where the network structure is given but only some of the variable values
are observable in the training data, the learning problem is more difficult. The learning
problem can be compared to learning weights for an ANN.
The gradient ascent rule which maximizes P(D|h) by following the gradient of ln P(D|h) with
respect to the parameters that define the conditional probability tables of the Bayesian network.
Let wijk denote a single entry in one of the conditional probability tables. In particular wijk
denote the conditional probability that the network variable Yi will take on the value yi, given
that its immediate parents Ui take on the values given by uik.
The gradient of ln P(D|h) is given by the derivatives for each of the w ijk.
As shown below, each of these derivatives can be calculated as
Derive the gradient defined by the set of derivatives for all i, j, and k. Assuming the
training examples d in the data set D are drawn independently, we write this derivative as
THE EM ALGORITHM
The EM algorithm can be used even for variables whose value is never directly observed,
provided the general form of the probability distribution governing these variables is known.
This problem setting is illustrated in Figure for the case where k = 2 and where the
instances are the points shown along the x axis.
Each instance is generated using a two-step process.
First, one of the k Normal distributions is selected at random.
Second, a single random instance xi is generated according to this selected
distribution.
This process is repeated to generate a set of data points as shown in the figure.
In this case, the sum of squared errors is minimized by the sample mean
Our problem here, however, involves a mixture of k different Normal distributions, and
we cannot observe which instances were generated by which distribution.
Consider full description of each instance as the triple (xi, zi1, zi2),
where xi is the observed value of the ith instance and
where zi1 and zi2 indicate which of the two Normal distributions was used to
generate the value xi
In particular, zij has the value 1 if xi was created by the j th Normal distribution and 0
otherwise.
Here xi is the observed variable in the description of the instance, and zil and zi2 are
hidden variables.
If the values of zil and zi2 were observed, we could use following Equation to solve for
the means p1 and p2
Because they are not, we will instead use the EM algorithm
EM algorithm
Logistic regression is a supervised learning classification algorithm used to predict the probability of a target
variable. The nature of target or dependent variable is dichotomous, which means there would be only two
possible classes.
In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML
algorithms that can be used for various classification problems such as spam detection, Diabetes prediction,
cancer detection etc.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression
is used for solving Regression problems, whereas Logistic regression is used for solving the classification
problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts
two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the logistic
function:
2 Unit IV Notes || Machine Learning || MC4301
2. In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):
3. But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of
the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
To implement the Logistic Regression using Python, we will use the same steps as we have done in previous
topics of Regression. Below are the steps:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in our
code efficiently. It will be the same as we have done in Data pre-processing topic. The code for this is given
below:
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
We have well prepared our dataset, and now we will train the dataset using the training set. For providing
training or fitting the model to the training set, we will import the LogisticRegression class of
the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic regression.
Below is the code for it:
Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we need
to import the confusion_matrix function of the sklearn library. After importing the function, we will call it
using a new variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output, we
can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Logistic Regression (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
In the above code, we have imported the ListedColormap class of Matplotlib library to create the colormap
for visualizing the result. We have created two new variables x_set and y_set to replace x_train and y_train.
After that, we have used the nm.meshgrid command to create a rectangular grid, which has a range of -
1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided colors
(purple and green). In this function, we have passed the classifier.predict to show the predicted data points
predicted by the classifier.
Output: By executing the above code, we will get the below output:
In the above graph, we can see that there are some Green points within the green region and Purple
points within the purple region.
All these data points are the observation points from the training set, which shows the result for
purchased variables.
This graph is made by using two independent variables i.e., Age on the x-axis and Estimated salary
on the y-axis.
The purple point observations are for which purchased (dependent variable) is probably 0, i.e., users
who did not purchase the SUV car.
The green point observations are for which purchased (dependent variable) is probably 1 means user
who purchased the SUV car.
We can also estimate from the graph that the users who are younger with low salary, did not purchase
the car, whereas older users with high estimated salary purchased the car.
6 Unit IV Notes || Machine Learning || MC4301
But there are some purple points in the green region (Buying the car) and some green points in the
purple region (Not buying the car). So we can say that younger users with a high estimated salary
purchased the car, whereas an older user with a low estimated salary did not purchase the car.
A Machine Learning model should have a very high level of accuracy in order to perform well with real-world
applications. But how to calculate the accuracy of the model, i.e., how good or poor our model will perform
in the real world? In such a case, the Cost function comes into existence. It is an important machine learning
parameter to correctly estimate the model.
Cost function also plays a crucial role in understanding that how well your model estimates the relationship
between the input and output parameters.
In this topic, we will explain the cost function in Machine Learning, Gradient descent, and types of cost
functions.
A cost function is an important parameter that determines how well a machine learning model performs for a
given dataset. It calculates the difference between the expected value and predicted value and represents it as
a single real number.
In machine learning, once we train our model, then we want to see how well our model is performing.
Although there are various accuracy functions that tell you how your model is performing, but will not give
insights to improve them. So, we need a function that can find when the model is most accurate by finding the
spot between the undertrained and overtrained model.
In simple, "Cost function is a measure of how wrong the model is in estimating the relationship between
X(input) and Y(output) Parameter." A cost function is sometimes also referred to as Loss function, and it can
be estimated by iteratively running the model to compare estimated predictions against the known values of
Y.
The main aim of each ML model is to determine parameters or weights that can minimize the cost function.
7 Unit IV Notes || Machine Learning || MC4301
Cost functions can be of various types depending on the problem. However, mainly it is of three types, which
are as follows:
1. Regression Cost Function
2. Binary Classification cost Functions
3. Multi-class Classification Cost Function.
There are three commonly used Regression cost functions, which are as follows:
a. Means Error
In this type of cost function, the error is calculated for each training data, and then the mean of all error values
is taken.
It is one of the simplest ways possible.
The errors that occurred from the training data can be either negative or positive. While finding mean, they
can cancel out each other and result in the zero-mean error for the model, so it is not recommended cost
function for a model.
This means the Absolute error cost function is also known as L1 Loss. It is not affected by noise or outliers,
hence giving better results if the dataset has noise or outlier.
The error in binary classification is calculated as the mean of cross-entropy for all N training data. Which
means:
Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N
In a multi-class classification problem, cross-entropy will generate a score that summarizes the mean
difference between actual and anticipated probability distribution.
For a perfect cross-entropy, the value should be zero when the score is minimized.
Gradient Descent is known as one of the most commonly used optimization algorithms to train machine
learning models by means of minimizing errors between actual and expected results. Further, gradient descent
is also used to train Neural Networks.
9 Unit IV Notes || Machine Learning || MC4301
In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about gradient descent, the
role of cost functions specifically as a barometer within Machine Learning, types of gradient descents, learning
rates, etc.
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century. Gradient
Descent is defined as one of the most commonly used iterative optimization algorithms of machine learning
to train the machine learning and deep learning models. It helps in finding the local minimum of a function.
The best way to define the local minimum or local maximum of a function using gradient descent is as follows:
If we move towards a negative gradient or away from the gradient of the function at the current point, it will
give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the current point,
we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main
objective of using a gradient descent algorithm is to minimize the cost function using iteration. To achieve
this goal, it performs two steps iteratively:
Calculates the first-order derivative of the function to compute the gradient or slope of that function.
Move away from the direction of the gradient, which means slope increased from the current point by alpha
times, where Alpha is defined as Learning Rate. It is a tuning parameter in the optimization process which
helps to decide the length of the steps.
10 Unit IV Notes || Machine Learning || MC4301
Before starting the working principle of gradient descent, we should know some basic concepts to find out the
slope of a line from linear regression. The equation for simple linear regression is given as:
Equation : Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The starting point (shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent line to
calculate the steepness of this slope. Further, this slope will inform the updates to the parameters (weights and
bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are generated,
then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which is called a
point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between expected and
actual. To minimize the cost function, two data points are required:
These two factors are used to determine the partial derivative calculation of future iteration and allow it to the
point of convergence or local minimum or global minimum. Let's discuss learning rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically a small value that is
evaluated and updated based on the behavior of the cost function. If the learning rate is high, it results in larger
steps but also leads to risks of overshooting the minimum. At the same time, a low learning rate shows the
small step sizes, which compromises overall efficiency but gives the advantage of more precision.
11 Unit IV Notes || Machine Learning || MC4301
Based on the error in various training models, the Gradient Descent learning algorithm can be divided
into Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Let's
understand these different types of gradient descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and update the model
after evaluating all training examples. This procedure is known as the training epoch. In simple words, it is a
greedy approach where we have to sum over all examples for each update.
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per iteration.
Or in other words, it processes a training epoch for each example within a dataset and updates each training
example's parameters one at a time. As it requires only one training example at a time, hence it is easier to
store in allocated memory. However, it shows some computational efficiency losses in comparison to batch
gradient systems as it shows frequent updates that require more detail and speed. Further, due to frequent
updates, it is also treated as a noisy gradient. However, sometimes it can be helpful in finding the global
minimum and also escaping the local minimum.
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic gradient descent.
It divides the training datasets into small batch sizes then performs the updates on those batches separately.
Splitting training datasets into smaller batches make a balance to maintain the computational efficiency of
batch gradient descent and speed of stochastic gradient descent. Hence, we can achieve a special type of
gradient descent with higher computational efficiency and less noisy gradient descent.
Although we know Gradient Descent is one of the most popular methods for optimization problems, it still
also has some challenges. There are a few challenges as follows:
Whenever the slope of the cost function is at zero or just close to zero, this model stops learning further. Apart
from the global minimum, there occur some scenarios that can show this slop, which is saddle point and local
minimum. Local minima generate the shape similar to the global minimum, where the slope of the cost
function increases on both sides of the current points.
13 Unit IV Notes || Machine Learning || MC4301
In contrast, with saddle points, the negative gradient only occurs on one side of the point, which reaches a
local maximum on one side and a local minimum on the other side. The name of a saddle point is taken by
that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point in a local region.
In contrast, the name of the global minima is given so because the value of the loss function is minimum there,
globally across the entire domain the loss function.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During backpropagation, this gradient
becomes smaller that causing the decrease in the learning rate of earlier layers than the later layer of the
network. Once this happens, the weight parameters update until they become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too large and
creates a stable model. Further, in this scenario, model weight increases, and they will be represented as NaN.
This problem can be solved using the dimensionality reduction technique, which helps to minimize complexity
within the model.
Example program:
Import numpy as np
def gradient_descent(x,y):
m_curr = b_curr = 0
iterations = 10000
n = len(x)
learning_rate = 0.08
for i in range(iterations):
y_predicted = m_curr * x + b_curr
cost = (1/n) * sum([val**2 for val in (y-y_predicted)])
md = -(2/n)*sum(x*(y-y_predicted))
bd = -(2/n)*sum(y-y_predicted)
m_curr = m_curr - learning_rate * md
b_curr = b_curr - learning_rate * bd
print ("m {}, b {}, cost {} iteration {}".format(m_curr,b_curr,cost, i))
x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])
14 Unit IV Notes || Machine Learning || MC4301
gradient_descent(x,y)
Out[]:
m 4.96, b 1.44, cost 89.0 iteration 0
m 0.4991999999999983, b 0.26879999999999993, cost 71.10560000000002 iteration 1
m 4.451584000000002, b 1.426176000000001, cost 56.8297702400001 iteration 2
.
.
m 2.000000000000002, b 2.999999999999995, cost 1.0255191767873153e-29 iteration 9997
m 2.000000000000001, b 2.9999999999999947, cost 1.0255191767873153e-29 iteration 9998
m 2.000000000000002, b 2.999999999999995, cost 1.0255191767873153e-29 iteration 9999
To tune the model, we need hyperparameter optimization. By finding the optimal combination of their
values, we can decrease the error and build the most accurate model.
15 Unit IV Notes || Machine Learning || MC4301
After each iteration, you compare the output with expected results, assess the accuracy, and adjust the
hyperparameters if necessary. This is a repeated process. You can do that manually or use one of the
many optimization techniques, which come in handy when you work with large amounts of data
Exhaustive search
Exhaustive search, or brute-force search, is the process of looking for the most optimal
hyperparameters by checking whether each candidate is a good match. You perform the same thing
when you forget the code for your bike’s lock and try out all the possible options. In machine learning,
we do the same thing but the number of options is quite large, usually.
The exhaustive search method is simple. For example, if you are working with a k-means algorithm,
you will manually search for the right number of clusters. However, if there are hundreds and
thousands of options that you have to consider, it becomes unbearably heavy and slow. This makes
brute-force search inefficient in the majority of real-life cases.
Gradient descent
Gradient descent is the most common algorithm for model optimization for minimizing the error. In
order to perform gradient descent, you have to iterate over the training dataset while re-adjusting the
model.
Your goal is to minimize the cost function because it means you get the smallest possible error and
improve the accuracy of the model.
On the graph, you can see a graphical representation of how the gradient descent algorithm travels in
the variable space. To get started, you need to take a random point on the graph and arbitrarily choose
a direction. If you see that the error is getting larger, that means you chose the wrong direction.
When you are not able to improve (decrease the error) anymore, the optimization is over and you have
found a local minimum. In the following video, you will find a step-by-step explanation of how
gradient descent works.
Looks fine so far. However, classical gradient descent will not work well when there are a couple of
local minimums. Finding your first minimum, you will simply stop searching because the algorithm
only finds a local one, it is not made to find the global one.
16 Unit IV Notes || Machine Learning || MC4301
Note: In gradient descent, you proceed forward with steps of the same size. If you choose a learning rate that
is too large, the algorithm will be jumping around without getting closer to the right answer. If it’s too small,
the computation will start mimicking exhaustive search take, which is, of course, inefficient.
So you have to choose the learning rate very carefully. If done right, gradient descent becomes a computation-
efficient and rather quick method to optimize models.
Genetic algorithms
Genetic algorithms represent another approach to ML optimization. The principle that lays behind the
logic of these algorithms is an attempt to apply the theory of evolution to machine learning.
In the evolution theory, only those specimens get to survive and reproduce that have the best adaptation
mechanisms. How do you know what specimens are and aren’t the best in the case of machine learning
models?
Imagine you have a bunch of random algorithms at hand. This will be your population. Among multiple
models with some predefined hyperparameters, some are better adjusted than the others. Let’s find
them! First, you calculate the accuracy of each model. Then, you keep only those that worked out best.
Now you can generate some descendants with similar hyperparameters to the best models to get a
second generation of models.
We repeat this process many times and only the best models will survive at the end of the process.
Genetic algorithms help to avoid being stuck at local minima/maxima. They are common in optimizing
neural network models.
17 Unit IV Notes || Machine Learning || MC4301
Now, we will add a loss function and optimize parameter to make the model that can predict the accurate value
of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced
so that we can get better long-term predictions.
18 Unit IV Notes || Machine Learning || MC4301
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It
is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added
to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda
to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the
cost function of the linear regression model. Hence, for the minimum value of λ, the model will
resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the independent
variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only
shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
Key Difference between Ridge Regression and Lasso Regression
o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the features
present in the model. It reduces the complexity of the model by shrinking the coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.
19 Unit IV Notes || Machine Learning || MC4301
What is Overfitting?
o Overfitting & underfitting are the two main errors/problems in the machine learning model, which
cause poor performance in Machine Learning.
o Overfitting occurs when the model fits more data than required, and it tries to capture each and every
datapoint fed to it. Hence it starts capturing noise and inaccurate data from the dataset, which degrades
the performance of the model.
o An overfitted model doesn't perform accurately with the test/unseen dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.
Example to Understand Overfitting
We can understand overfitting with a general example. Suppose there are three students, X, Y, and Z, and all
three are preparing for an exam. X has studied only three sections of the book and left all other sections. Y
has a good memory, hence memorized the whole book. And the third student, Z, has studied and practiced all
the questions. So, in the exam, X will only be able to solve the questions if the exam has questions related to
section 3. Student Y will only be able to solve questions if they appear exactly the same as given in the book.
Student Z will be able to solve all the exam questions in a proper way.
The same happens with machine learning; if the algorithm learns from a small part of the data, it is unable to
capture the required data points and hence under fitted.
Suppose the model learns the training dataset, like the Y student. They perform very well on the seen dataset
but perform badly on unseen data or unknown instances. In such cases, the model is said to be Overfitting.
And if the model performs well with the training dataset and also with the test/unseen dataset, similar to
student Z, it is said to be a good fit.
20 Unit IV Notes || Machine Learning || MC4301
Now, if the model performs well with the training dataset but not with the test dataset, then it is likely to have
an overfitting issue.
For example, if the model shows 85% accuracy with training data and 50% accuracy with the test dataset, it
means the model is not performing well.
1. Early Stopping
2. Train with more data
3. Feature Selection
4. Cross-Validation
5. Data Augmentation
6. Regularization
1.Early Stopping
In this technique, the training is paused before the model starts learning the noise within the model. In this
process, while training the model iteratively, measure the performance of the model after each iteration.
Continue up to a certain number of iterations until a new iteration improves the performance of the model.
After that point, the model begins to overfit the training data; hence we need to stop the process before the
learner passes that point.
Stopping the training process before the model starts capturing noise from the data is known as early stopping.
However, this technique may lead to the underfitting problem if training is paused too early. So, it is very
important to find that "sweet spot" between underfitting and overfitting.
2.Train with More data
Increasing the training set by including more data can enhance the accuracy of the model, as it provides more
chances to discover the relationship between input and output variables.
It may not always work to prevent overfitting, but this way helps the algorithm to detect the signal better to
minimize the errors.
When a model is fed with more training data, it will be unable to overfit all the samples of data and forced to
generalize well.
But in some cases, the additional data may add more noise to the model; hence we need to be sure that data is
clean and free from in-consistencies before feeding it to the model.
22 Unit IV Notes || Machine Learning || MC4301
3.Feature Selection
While building the ML model, we have a number of parameters or features that are used to predict the outcome.
However, sometimes some of these features are redundant or less important for the prediction, and for this
feature selection process is applied. In the feature selection process, we identify the most important features
within training data, and other features are removed. Further, this process helps to simplify the model and
reduces noise from the data. Some algorithms have the auto-feature selection, and if not, then we can manually
perform this process.
4.Cross-Validation
Cross-validation is one of the powerful techniques to prevent overfitting.
In the general k-fold cross-validation technique, we divided the dataset into k-equal-sized subsets of data;
these subsets are known as folds.
5.Data Augmentation
Data Augmentation is a data analysis technique, which is an alternative to adding more data to prevent
overfitting. In this technique, instead of adding more training data, slightly modified copies of already existing
data are added to the dataset.
The data augmentation technique makes it possible to appear data sample slightly different every time it is
processed by the model. Hence each data set appears unique to the model and prevents overfitting.
6.Regularization
If overfitting occurs when a model is complex, we can reduce the number of features. However, overfitting
may also occur with a simpler model, more specifically the Linear model, and for such cases, regularization
techniques are much helpful.
Regularization is the most popular technique to prevent overfitting. It is a group of methods that forces the
learning algorithms to make a model simpler. Applying the regularization technique may slightly increase the
bias but slightly reduces the variance. In this technique, we modify the objective function by adding the
penalizing term, which has a higher value with a more complex model.
The two commonly used regularization techniques are L1 Regularization and L2 Regularization.
Ensemble Methods
In ensemble methods, prediction from different machine learning models is combined to identify the most
popular result.
The most commonly used ensemble methods are Bagging and Boosting.
In bagging, individual data points can be selected more than once. After the collection of several sample
datasets, these models are trained independently, and depending on the type of task-i.e., regression or
classification-the average of those predictions is used to predict a more accurate result. Moreover, bagging
reduces the chances of overfitting in complex models.
23 Unit IV Notes || Machine Learning || MC4301
In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term for all folks. It
is the primary step to learn Machine Learning and Deep Learning technologies, which consists of a set of
weights, input values or scores, and a threshold. Perceptron is a building block of an Artificial Neural
Network. Initially, in the mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for performing
certain calculations to detect input data capabilities or business intelligence. Perceptron is a linear Machine
Learning algorithm used for supervised learning for various binary classifiers. This algorithm enables neurons
to learn elements and processes them one by one during preparation. In this tutorial, "Perceptron in Machine
Learning," we will discuss in-depth knowledge of Perceptron and its basic functions in brief. Let's start with
the basic introduction of Perceptron.
Weight parameter represents the strength of the connection between units. This is another most important
parameter of Perceptron components. Weight is directly proportional to the strength of the associated input
neuron in deciding the output. Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire or not.
Activation Function can be considered primarily as a step function.
The data scientist uses the activation function to take a subjective decision based on various problem
statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in
perceptron models by checking whether the learning process is slow or has vanishing or exploding gradients.
This step function or Activation function plays a vital role in ensuring that output is mapped between required
values (0,1) or (-1,1). It is important to note that the weight of input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation function curve up or down.
25 Unit IV Notes || Machine Learning || MC4301
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum, which gives us
output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns.
Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.
Neural networks can adapt to changing input; so the network generates the best possible result without needing
to redesign the output criteria. The concept of neural networks, which has its roots in artificial intelligence, is
swiftly gaining popularity in the development of trading systems.
Pros
Can often work more efficiently and for longer than humans
Can be programmed to learn from prior outcomes to strive to make smarter future calculations
Often leverage online services that reduce (but do not eliminate) systematic risk
Are continually being expanded in new fields with more difficult problems
Cons
Still rely on hardware that may require labor and expertise to maintain
May take long periods of time to develop the code and algorithms
May be difficult to assess errors or adaptions to the assumptions if the system is self-learning but lacks
transparency
Usually report an estimated range or estimated amount that may not actualize
27 Unit IV Notes || Machine Learning || MC4301
Multi-Class Classification
Multi-class classification is perhaps the most popular machine learning job, aside from regression.
The science behind it is the same whether it’s spelled multiclass or multi-class. An ML classification problem
with more than two outputs or classes is known as multi feature classification. Because each image may be
classed as many distinct animal categories, using a machine learning model to identify animal species in
photographs from an encyclopedia is an example of multi-class classification. Multi-class classification also
necessitates the use of only one class in a sample (ie. an elephant is only an elephant; it is not also a lemur).
We are given a set of training samples separated into K distinct classes, and we create an ML model to forecast
which of those classes some previously unknown data belongs to. The model learns patterns specific to each
class from the training dataset and utilizes those patterns to forecast the classification of future data.
Approach –
1. Load dataset from the source.
2. Split the dataset into “training” and “test” data.
3. Train Decision tree, SVM, and KNN classifiers on the training data.
4. Use the above classifiers to predict labels for the test data.
5. Measure accuracy and visualize classification.
Decision tree classifier – A decision tree classifier is a systematic approach for multiclass classification. It
poses a set of questions to the dataset (related to its attributes/features). The decision tree classification
algorithm can be visualized on a binary tree. On the root and each of the internal nodes, a question is posed
and the data on that node is further split into separate records that have different characteristics. The leaves of
the tree refer to the classes in which the dataset is split. In the following code snippet, we train a decision tree
classifier in scikit-learn.
Example:
# importing necessary libraries
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# loading the iris dataset
iris = datasets.load_iris()
# X -> features, y -> label
X = iris.data
y = iris.target
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
dtree_predictions = dtree_model.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, dtree_predictions)
28 Unit IV Notes || Machine Learning || MC4301
1. Static backpropagation. Static backpropagation is a network developed to map static inputs for static
outputs. Static backpropagation networks can solve static classification problems, such as optical
character recognition (OCR).
2. Recurrent backpropagation. The recurrent backpropagation network is used for fixed-point learning.
Recurrent backpropagation activation feeds forward until it reaches a fixed value.
The algorithm gets its name because the weights are updated backward, from output to input.
It does not have any parameters to tune except for the number of inputs.
It is highly adaptable and efficient and does not require any prior knowledge about the network.
It is a standard process that usually works well.
It is user-friendly, fast and easy to program.
Users do not need to learn any special functions.
1. Sigmoid function: The Sigmoid function exists between 0 and 1 or -1 and 1. The use of a sigmoid function
is to convert a real value to a probability. In machine learning, the sigmoid function is generally used to refer
to the logistic function, also called the logistic sigmoid function; it is also the most widely used sigmoid
function (others are the hyperbolic tangent and the arctangent).
A sigmoid function is placed as the last layer of the model to convert the model’s output into a probability
score, which is easier to work with and interpret.
Another reason to use it mostly in the output layer is that it can otherwise cause a neural network to get stuck
in training time.
2. TanH function: It is the hyperbolic tangent function whose range lies between -1 and 1, hence also called
the zero-centred function. Because it is zero centred, it is much easier to model inputs with strongly negative,
positive or neutral values. TanH function is used instead of sigmoid function if the output is other than 0and1.
TanH functions usually find applications in RNN for natural language processing and speech recognition
tasks.
On the downside, in the case of both Sigmoid and TanH, if the weighted sum input is very large or very small,
the function’s gradient becomes very small and closer to zero.
3. ReLU function: Rectified Linear Unit, also called ReLU, is a widely favoured activation function for deep
learning applications. Compared to Sigmoid and TanH activation functions, ReLU offers an upper hand in
terms of performance and generalisation. In terms of computation too, ReLU is faster as it does not compute
exponentials and divisions. The disadvantage is that ReLU overfits more, as compared with Sigmoid.
4. Parametric ReLU (PReLU): ReLU has been one of the keys to the recent successes in deep learning. Its
use has lead to better solutions than that of sigmoid. This is partially due to the vanishing gradient problem in
case of sigmoid activations. But, we can still improve upon ReLU. LeakyReLU was introduced, which doesn’t
zero out the negative inputs as ReLU does. Instead, it multiplies the negative input by a small value (like 0.02)
and keeps the positive input as is. But this has shown a negligible increase in the accuracy of our models.
Dropout is a regularization
Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-
adaptations on training data. It is a very efficient way of performing model averaging with neural networks.
The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.
A simple and powerful regularization technique for neural networks and deep learning models is dropout. This
notebook will uncover the dropout regularization technique and how to apply it to deep learning models in
Python with Keras.
Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out”
randomly. This means that their contribution to the activation of downstream neurons is temporally removed
on the forward pass and any weight updates are not applied to the neuron on the backward pass.
As a neural network learns, neuron weights settle into their context within the network. Weights of neurons
are tuned for specific features providing some specialization. Neighboring neurons become to rely on this
specialization, which if taken too far can result in a fragile model too specialized to the training data. This
reliant on context for a neuron during training is referred to complex co-adaptations.
UNIT V NON-PARAMETRIC MACHINE LEARNING
k- Nearest Neighbors- Decision Trees – Branching – Greedy Algorithm - Multiple Branches – Continuous
attributes – Pruning. Random Forests: ensemble learning. Boosting – Adaboost algorithm. Support Vector
Machines – Large Margin Intuition – Loss Function - Hinge Loss – SVM Kernels
1
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV car.
The company wants to give the ads to the users who are interested in buying that SUV. So for this problem,
we have a dataset that contains multiple user's information through the social network. The dataset contains
lots of information but the Estimated Salary and Age we will consider for the independent variable and
the Purchased variable is for the dependent variable. Below is the dataset:
2
Steps to implement the K-NN algorithm:
Data Pre-processing step
Fitting the K-NN algorithm to the Training set
Predicting the test result
Test accuracy of the result(Creation of Confusion matrix)
Visualizing the test set result.
And then we will fit the classifier to the training data. Below is the code for it:
from sklearn.neighbors import KNeighborsClassifier #Fitting K-NN classifier to the training set
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
classifier.fit(x_train, y_train)
Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we did in Logistic
Regression. Below is the code for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)
3
Visualizing the Training set result:
Now, we will visualize the training set result for K-NN model. The code will remain same as we did in
Logistic Regression, except the name of the graph.
The output graph is different from the graph which we have occurred in Logistic Regression. It can be
understood in the below points:
As we can see the graph is showing the red point and green points. The green points are for
Purchased(1) and Red Points for not Purchased(0) variable.
The graph is showing an irregular boundary instead of showing any straight line or any curve
because it is a K-NN algorithm, i.e., finding the nearest neighbor.
The graph has classified users in the correct categories as most of the users who didn't buy the SUV
are in the red region and users who bought the SUV are in the green region.
The graph is showing good result but still, there are some green points in the red region and red
points in the green region. But this is no big issue as by doing this model is prevented from
overfitting issues.
Hence our model is well trained.
4
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting
a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on
the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further.
It continues the process until it reaches the leaf node of the tree. The complete process can be better understood
using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the
offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM).
The root node splits further into the next decision node (distance from the office) and one leaf node based on
the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and
one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
5
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute selection
measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
Information Gain
Gini Index
1. Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a dataset based
on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision tree.
A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first. It can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no
2. Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini index.
It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features of
the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of tree pruning technology used:
Cost Complexity Pruning
Reduced Error Pruning.
6
Python Implementation of Decision Tree
Now we will implement the Decision tree using Python. For this, we will use the dataset "user_data.csv,"
which we have used in previous classification models. By using the same dataset, we can compare the Decision
tree classifier with other classification models such as KNN SVM, LogisticRegression, etc.
Data Pre-processing step
Fitting a Decision-Tree algorithm to the Training set
Predicting the test result
Test accuracy of the result(Creation of Confusion matrix)
Visualizing the test set result.
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:
In the above code, we have created a classifier object, in which we have passed two main parameters;
"criterion='entropy': Criterion is used to measure the quality of split, which is calculated by
information gain given by entropy.
random_state=0": For generating the random states.
7
3. Predicting the test result
Now we will predict the test set result. We will create a new prediction vector y_pred. Below is the code for
it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
Output:
In the below output image, the predicted output and real test output are given. We can clearly see that there
are some values in the prediction vector, which are different from the real vector values. These are prediction
errors.
In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to other
classification models, the Decision Tree classifier made a good prediction.
The above output is completely different from the rest classification models. It has both vertical and horizontal
lines that are splitting the dataset according to the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.
Visualization of test set result will be similar to the visualization of the training set except that the training set
will be replaced with the test set.
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
fori, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Decision Tree Algorithm(Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
9
Output:
3. Greedy Algorithm:
The greedy method is one of the strategies like Divide and conquer used to solve the problems. This
method is used for solving optimization problems. An optimization problem is a problem that demands
either maximum or minimum results.
The Greedy method is the simplest and straightforward approach. It is not an algorithm, but it is a
technique. The main function of this approach is that the decision is taken on the basis of the currently
available information. Whatever the current information is present, the decision is made without
worrying about the effect of the current decision in future.
This technique is basically used to determine the feasible solution that may or may not be optimal. The
feasible solution is a subset that satisfies the given criteria. The optimal solution is the solution which
is the best and the most favorable solution in the subset. In the case of feasible, if more than one
solution satisfies the given criteria then those solutions will be considered as the feasible, whereas the
optimal solution is the best solution among all the solutions.
10
Pseudo code of Greedy Algorithm
Algorithm Greedy (a, n)
{
Solution : = 0;
for i = 0 to n do
{
x: = select(a);
if feasible(solution, x)
{
Solution: = union(solution , x)
}
return solution;
}}
The above is the greedy algorithm. Initially, the solution is assigned with zero value. We pass the array and
number of elements in the greedy algorithm. Inside the for loop, we select the element one by one and checks
whether the solution is feasible or not. If the solution is feasible, then we perform the union.
Let's understand through an example.
Suppose there is a problem 'P'. I want to travel from A to B shown as below:
P:A→B
The problem is that we have to travel this journey from A to B. There are various solutions to go from A to
B. We can go from A to B by walk, car, bike, train, aeroplane, etc. There is a constraint in the journey that
we have to travel this journey within 12 hrs. If I go by train or aeroplane then only, I can cover this distance
within 12 hrs. There are many solutions to this problem but there are only two solutions that satisfy the
constraint.
If we say that we have to cover the journey at the minimum cost. This means that we have to travel this
distance as minimum as possible, so this problem is known as a minimization problem. Till now, we have two
feasible solutions, i.e., one by train and another one by air. Since travelling by train will lead to the minimum
cost so it is an optimal solution. An optimal solution is also the feasible solution, but providing the best result
so that solution is the optimal solution with the minimum cost. There would be only one optimal solution.
The problem that requires either minimum or maximum result then that problem is known as an optimization
problem. Greedy method is one of the strategies used for solving the optimization problems.
We have to travel from the source to the destination at the minimum cost. Since we have three feasible
solutions having cost paths as 10, 20, and 5. 5 is the minimum cost path so it is the optimal solution. This is
11
the local optimum, and in this way, we find the local optimum at each stage in order to calculate the global
optimal solution.
Continuous attributes
What are Continuous Variables?
Simply put, if a variable can take any value between its minimum and maximum value, then it is called a
continuous variable. By nature, a lot of things we deal with fall in this category: age, weight, height being
some of them.
Just to make sure the difference is clear, let me ask you to classify whether a variable is continuous or
categorical:
1. Gender of a person
2. Number of siblings of a Person
3. Time on which a laptop runs on battery
Normalization:
In simpler words, it is a process of comparing variables at a ‘neutral’ or ‘standard’ scale. It helps to obtain
same range of values. Normally distributed data is easy to read and interpret. As shown below, in a normally
distributed data, 99.7% of the observations lie within 3 standard deviations from the mean. Also, the mean is
zero and standard deviation is one. Normalization technique is commonly used in algorithms such as k-means,
clustering etc.
A commonly used normalization method is z-scores. Z score of an observation is the number of standard
deviations it falls above or below the mean. It’s formula is shown below.
12
Randy scored 76 in maths test. Katie score 86 in science test. Maths test has (mean = 70, sd = 2). Science test
has (mean = 80, sd = 3)
z(Randy) = (76 – 70)/2 = 3
z(Katie) = (86 – 80)/3 = 2
There are various types of transformation methods. Some are Log, sqrt, exp, Box-cox, power etc. The
commonly used method is Log Transformation.
Hence, to avoid such situation we use PCA a.k.a Principal Component Analysis. It is nothing but, finding out
few ‘principal ‘variables which explain significant amount of variation in dependent variable. Using this
technique, a large number of variables are reduced to few significant variables. This technique helps to reduce
noise, redundancy and enables quick computations.
Factor Analysis:
Factor Analysis was invented by Charles Spearman (1904). This is a variable reduction technique. It is used
to determine factor structure or model. It also explains the maximum amount of variance in the model. Let’s
say some variables are highly correlated. These variables can be grouped by their correlations i.e., all variables
in a particular group can be highly correlated among themselves but have low correlation with variables of
other group(s). Here each group represents a single underlying construct or factor. Factor analysis is of two
types:
1. EFA (Exploratory Factor Analysis) – It identifies and summarizes the underlying correlation structure
in a data set
2. CFA (Confirmatory Factor Analysis) – It attempts to confirm hypothesis using the correlation structure
and rate ‘goodness of fit’.
About pruning
Pruning is the process of eliminating weight connections from a network to speed up inference and reduce
model storage size. Decision trees and neural networks, in general, are overparameterized. Pruning a network
entails deleting unneeded parameters from an overly parameterized network.
Pruning mostly serves as an architectural search inside the tree or network. In fact, because pruning functions
as a regularizer, a model will often generalise slightly better at low levels of sparsity. The trimmed model will
match the baseline at higher levels. If you push it too far, the model will start to generalise worse than the
baseline, but with greater performance.
14
The major disadvantage of pre-pruning is the narrow viewing field, which implies that the tree’s current
expansion may not match the standards, but later expansion may. In this situation, the decision tree’s
development is halted early.
Post-pruning
The decision tree generation is divided into two steps by post-pruning. The first step is the tree-building
process, with the termination condition that the fraction of a certain class in the node reaches 100%, and the
second phase is pruning the tree structure gained in the first phase.
Post-pruning techniques circumvent the problem of a narrow viewing field in this way. As a result, post-
pruning procedures are often more accurate than pre-pruning methods, therefore post-pruning methods are
more widely utilised than pre-pruning methods. The pruning procedure identifies the node as a leaf node by
using the label of the most common class in the subset associated with the current node, which is the same as
in pre-pruning.
Pruning methods
The goal of pruning is to remove sections of a classification model that explain random variation in the training
sample rather than actual domain characteristics. This makes the model more understandable to the user and,
perhaps, more accurate on fresh data that was not used to train the classifier. An effective approach for
differentiating sections of a classifier that are attributable to random effects from parts that describe significant
structure is required for pruning. There are different methods for pruning listed in this article used in both
strategies.
15
Minimum Error Pruning (MEP)
This method is a bottom-up strategy that seeks a single tree with the lowest “anticipated error rate on an
independent data set.” This does not indicate the adoption of a pruning set, but rather that the developer wants
to estimate the error rate for unknown scenarios. Indeed, both the original and enhanced versions described
exploiting just information from the training set.
In the presence of noisy data, Laplace probability estimation is employed to improve the performance of ID3.
Later, the Bayesian technique was employed to enhance this procedure, and the approach is known as an m-
probability estimation. There were two modifications:
Prior probabilities are used in estimate rather than assuming a uniform starting distribution of classes.
Several trees with differing degrees of pruning may be generated by adjusting the value of the
parameter. The degree of pruning is now decided by parameters rather than the number of classes.
Furthermore, factors like the degree of noise in the training data may be changed based on domain
expertise or the complexity of the problem.
The predicted error rate for each internal node is estimated in the minimal error pruning approach and is
referred to as static error. The anticipated error rate of the branch with the node is then estimated as a weighted
sum of the expected error rates of the node’s children, where each weight represents the chance that
observation in the node would reach the associated child.
16
Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It
can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
17
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression tasks, it is not more suitable
for Regression tasks.
18
3. Predicting the Test Set result
Since our model is fitted to the training set, so now we can predict the test result. For prediction, we will create
a new prediction vector y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
Output:
The prediction vector is given as:
By checking the above prediction vector and test set real vector, we can determine the incorrect predictions
done by the classifier.
Output: As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92
correct predictions.
19
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Random Forest Algorithm(Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Ensemble Learning
ensemble learning helps improve machine learning results by combining several models. This approach
allows the production of better predictive performance compared to a single model. Basic idea is to learn a
set of classifiers (experts) and to allow them to vote.
20
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a decision tree classifier
and is generated using a random selection of attributes at each node to determine the split. During
classification, each tree votes and the most popular class is returned.
AdaBoost was the first really successful boosting algorithm developed for the purpose of binary
classification. AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that
combines multiple “weak classifiers” into a single “strong classifier”. It was formulated by Yoav Freund
and Robert Schapire. They also won the 2003 Gödel Prize for their work.
21
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
Explanation:
The above diagram explains the AdaBoost algorithm in a very simple way. Let’s try to understand it in a
stepwise process:
B1 consists of 10 data points which consist of two types namely plus(+) and minus(-) and 5 of
which are plus(+) and the other 5 are minus(-) and each one has been assigned equal weight
initially. The first model tries to classify the data points and generates a vertical separator line
but it wrongly classifies 3 plus(+) as minus(-).
B2 consists of the 10 data points from the previous model in which the 3 wrongly classified
plus(+) are weighted more so that the current model tries more to classify these pluses(+)
correctly. This model generates a vertical separator line that correctly classifies the previously
wrongly classified pluses(+) but in this attempt, it wrongly classifies three minuses(-).
B3 consists of the 10 data points from the previous model in which the 3 wrongly classified
minus(-) are weighted more so that the current model tries more to classify these minuses(-)
correctly. This model generates a horizontal separator line that correctly classifies the previously
wrongly classified minuses(-).
B4 combines together B1, B2, and B3 in order to build a strong prediction model which is much
better than any individual model used.
22
values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an
output of -0.8, which would be an ensemble prediction of -1.0 or the second class.
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify whether
it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train our model
with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary between these two data (cat
and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:
23
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane
are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be
multiple lines that can separate these classes. Consider the below image:
24
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points
are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the
goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
25
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:
26
Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
#Data Pre-processing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the dataset as:
27
Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:
from sklearn.svm import SVC # "Support vector classifier"
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(x_train, y_train)
In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable data.
However, we can change it for non-linear data. And then we fitted the classifier to the training dataset(x_train,
y_train)
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor), gamma, and
kernel.
Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new vector y_pred. Below is the code
for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)
After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.
Output: Below is the output for the prediction of the test set:
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10 correct
predictions. Therefore we can say that our SVM model improved as compared to the Logistic regression
model.
Visualizing the training set result:
Now we will visualize the training set result, below is the code for it:
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:
29
As we can see, the above output is appearing similar to the Logistic regression output. In the output, we got
the straight line as hyperplane because we have used a linear kernel in the classifier. And we have also
discussed above that for the 2d space, the hyperplane in SVM is a straight line.
Visualizing the test set result:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:
As we can see in the above output image, the SVM classifier has divided the users into two regions (Purchased
or Not purchased). Users who purchased the SUV are in the red region with the red scatter points. And users
who did not purchase the SUV are in the green region with green scatter points. The hyperplane has divided
the two classes into Purchased and not purchased variable.
30
Large Margin Intuition
SVM Decision Boundary
Consider a case where we set constant C to be a very large value, when minimizing the optimization objective,
we are going to be highly motivated to choose a value, so that the first term is equal to 0. So what would it
take to make this first term equal to 0.
31
SVM Decision Boundary
We can rewrite the optimization objective of SVM as follow:
where p(i) is the projection of x(u) onto the vector θ.
Simplification: θ0 = 0.
According to the illustration below, with the minimal value of the magnitude of θ, the absolute value of p will
large as much as possible (hence the large margin).
In logistic regression, we take the output of the linear function and squash the value within the range of [0,1]
using the sigmoid function. If the squashed value is greater than a threshold value(0.5) we assign it a label 1,
else we assign it a label 0. In SVM, we take the output of the linear function and if that output is greater than
1, we identify it with one class and if the output is -1, we identify is with another class. Since the threshold
values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as
margin.
Loss Function
In Machine learning, the loss function is determined as the difference between the actual output and the
predicted output from the model for the single training example while the average of the loss function for all
the training examples is termed as the cost function. This computed difference from the loss functions( such
32
as Regression Loss, Binary Classification, and Multiclass Classification loss function) is termed the error
value; this error value is directly proportional to the actual and predicted value.
It is important to note that, amount of deviation doesn’t matter; the thing which matters here is whether the
value predicted by our model is right or wrong. Loss functions are different based on your problem statement
to which machine learning is being applied. The cost function is another term used interchangeably for the
loss function, but it holds a slightly different meaning. A loss function is for a single training example, while
a cost function is an average loss over the complete train dataset.
33
2. Binary Classification Loss Functions
These loss functions are made to measure the performances of the classification model. In this, data points are
assigned one of the labels, i.e. either 0 or 1. Further, they can be classified as:
Binary Cross-Entropy
It’s a default loss function for binary classification problems. Cross-entropy loss calculates the performance
of a classification model, which gives an output of a probability value between 0 and 1. Cross-entropy loss
increases as the predicted probability value deviate from the actual label.
Hinge loss
Hinge loss can be used as an alternative to cross-entropy, which was initially developed to use with a support
vector machine algorithm. Hinge loss works best with the classification problem because target values are in
the set of {-1,1}. It allows to assign more error when there is a difference in sign between actual and predicted
values. Hence resulting in better performance than cross-entropy.
Multi-class Cross-Entropy
In this case, the target values are in the set of 0 to n i.e {0,1,2,3…n}. It calculates a score that takes an average
difference between actual and predicted probability values, and the score is minimized to reach the best
possible accuracy. Multi-class cross-entropy is the default loss function for text classification problems.
Hinge Loss
The hinge loss is a specific type of cost function that incorporates a margin or distance from the
classification boundary into the cost calculation. Even if new observations are classified correctly, they
can incur a penalty if the margin from the decision boundary is not large enough. The hinge loss
increases linearly.
The hinge loss is mostly associated with soft-margin support vector machines.
34
If you are familiar with the construction of hyperplanes and their margins in support vector machines, you
probably know that margins are often defined as having a distance equal to 1 from the data-separating-
hyperplane. Otherwise, check out my post on support vector machines (link opens in new tab), where I explain
the details of maximum margins classifiers. We want data points to not only fall on the correct side of the
hyperplane but also to be located beyond the margin.
Support vector machines address a classification problem where observations either have an outcome of +1
or -1. The support vector machine produces a real-valued output that is negative or positive depending on
which side of the decision boundary it falls. Only if an observation is classified correctly and the distance from
the plane is larger than the margin will it incur no penalty. The distance from the hyperplane can be regarded
as a measure of confidence. The further an observation lies from the plane, the more confident it is in the
classification.
For example, if an observation was associated with an actual outcome of +1, and the SVM produced an output
of 1.5, the loss would equal 0.
Contrary to methods like linear regression, where we try to find a line that minimizes the distance from the
data points, an SVM tries to maximize the distance. If you are interested, check out my post on constructing
regression lines. Comparing the two approaches nicely illustrates the difference between the nature of
regression and classification problems.
35
An observation that is located directly on the boundary would incur a loss of 1 regardless of whether the real
outcome was +1 or -1.
Observations that fall on the correct side of the decision boundary (hyperplane) but are within the margin incur
a cost between 0 and 1.
All observations that end up on the wrong side of the hyperplane will incur a loss that is greater than 1 and
increases linearly. If the actual outcome was 1 and the classifier predicted 0.5, the corresponding loss would
be 0.5 even though the classification is correct.
Now that we have a strong intuitive understanding of the hinge loss, understanding the math will be a breeze.
In this case, the blue and red data points are linearly separable, allowing for a hard margin classifier.
If the data is not linearly separable, hard margin classification is not applicable.
Even though support vector machines are linear classifiers, they are still able to separate data points that are
not linearly separable by applying the kernel trick.
The blue and the red data points are not linearly separable.
Furthermore, if the margin of the SVM is very small, the model is more likely to overfit. In these cases, we
can choose to cut the model some slack by allowing for misclassifications. We call this a soft margin support
vector machine. But if the model produces too many misclassifications, its utility declines. Therefore, we need
to penalize the misclassified samples by introducing a cost function.
In summary, the soft margin support vector machine requires a cost function while the hard margin SVM does
not.
SVM Cost
In the post on support vectors, we’ve established that the optimization objective of the support vector classifier
is to minimize the term w, which is a vector orthogonal to the data-separating hyperplane onto which we
project our data points.
\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2wmin21i=1∑nwi2
37
This minimization problem represents the primal form of the hard margin SVM, which doesn’t account for
classification errors.
For the soft-margin SVM, we combine the minimization objective with a loss function such as the hinge loss.
The first term sums over the number of features (n), while the second term sums over the number of samples
in the data (m).
The t variable is the output produced by the model as a product of the weight parameter w and the data input
x.
t_i = w^Tx_jti=wTxj
To understand how the model generates this output, refer to the post on support vectors (link opens in new
tab).
The loss term has a regularizing effect on the model. But how can we control the regularization? That is how
can we control how aggressively the model should try to avoid misclassifications. To manually control the
number of misclassifications during training, we introduce an additional parameter, C, which we multiply with
the loss term.
\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2 + C\sum^m_{j=1} max(0, 1 -t_j \cdot y_j)wmin21i=1∑nwi2
+Cj=1∑mmax(0,1−tj⋅yj)
The smaller C is, the stronger the regularization. Accordingly, the model will attempt to maximize the margin
and be more tolerant towards misclassifications.
38
SVM Kernels
Kernel Function is a method used to take data as input and transform it into the required form of
processing data. “Kernel” is used due to a set of mathematical functions used in Support Vector Machine
providing the window to manipulate the data. So, Kernel Function generally transforms the training set of
data so that a non-linear decision surface is able to transform to a linear equation in a higher number of
dimension spaces. Basically, It returns the inner product between two points in a standard feature
dimension.
Standard Kernel Function Equation :
Gaussian Kernel: It is used to perform transformation when there is no prior knowledge about data.
Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function, adding radial
basis method to improve the transformation.
Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model of the neural network,
which is used as an activation function for artificial neurons.
39
Sigmoid Kernel Graph
Code:
Polynomial Kernel: It represents the similarity of vectors in the training set of data in a feature space
over polynomials of the original variables used in the kernel.
40