0% found this document useful (0 votes)
20 views71 pages

UNIT-2 Material

The document explains the concepts of Regression and Classification in supervised learning, detailing how regression predicts continuous values based on independent variables, while classification categorizes observations into predefined classes. It also discusses distance-based methods like K-Nearest Neighbors (KNN) for both classification and regression, emphasizing the importance of selecting an appropriate value for K. Additionally, it covers decision trees, their structure, and key concepts such as entropy and information gain used for determining the best splitting attributes.

Uploaded by

Prashant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views71 pages

UNIT-2 Material

The document explains the concepts of Regression and Classification in supervised learning, detailing how regression predicts continuous values based on independent variables, while classification categorizes observations into predefined classes. It also discusses distance-based methods like K-Nearest Neighbors (KNN) for both classification and regression, emphasizing the importance of selecting an appropriate value for K. Additionally, it covers decision trees, their structure, and key concepts such as entropy and information gain used for determining the best splitting attributes.

Uploaded by

Prashant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 71

1) Explain Regression and Classification in Supervised Learning?

Regression
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last
5 years and the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique
which helps in finding the correlation between variables and enables us to predict the
continuous output variable based on the one or more predictor variables. It is mainly used
for prediction, forecasting, time series modeling, and determining the causal-effect
relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Some examples of regression can be as:
o Prediction of rain using temperature and other factors
o Determining Market trends
o Prediction of road accidents due to rash driving.
Terminologies Related to the Regression Analysis:
o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is
called underfitting.
Why do we use Regression Analysis?
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There
are various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology
which can make predictions more accurately. So for such case we need Regression analysis
which is a statistical method and used in machine learning and data science. Below are some
other reasons for using Regression analysis:
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Classification
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be
called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning
technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
1. y=f(x), where y = categorical output
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar
to each other and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Learners in Classification Problems:
In the classification problems, there are two types of learners:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time
for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve Bayes,
ANN.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
o Email Spam Detection
o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.
2) Explain Distance Based Methods in supervised Learning?
Distance-based algorithms are machine learning algorithms that classify queries by computing
distances between these queries and a number of internally stored exemplars. Exemplars that
are closest to the query have the largest influence on the classification assigned to the query.
The abbreviation KNN stands for “K-Nearest Neighbour”. It is a supervised machine learning
algorithm. The algorithm can be used to solve both classification and regression problem
statements.
The number of nearest neighbours to a new unknown variable that has to be predicted or
classified is denoted by the symbol ‘K’.
Let’s take a good look at a related real-world scenario before we get started with this awesome
algorithm.
We are often notified that you share many characteristics with your nearest peers, whether it
be your thinking process, working etiquettes, philosophies, or other factors. As a result, we
build friendships with people we deem similar to us.
The KNN algorithm employs the same principle. Its aim is to locate all of the closest neighbours
around a new unknown data point in order to figure out what class it belongs to. It’s a distance-
based approach.
Consider the diagram below; it is straightforward and easy for humans to identify it as a “Cat”
based on its closest allies. This operation, however, cannot be performed directly by the
algorithm.
KNN calculates the distance from all points in the proximity of the unknown data and filters out
the ones with the shortest distances to it. As a result, it’s often referred to as a distance-based
algorithm.
In order to correctly classify the results, we must first determine the value of K (Number of
Nearest Neighbors’).
In the following diagram, the value of K is 5. Since there are four cats and just one dog in the
proximity of the five closest neighbours, the algorithm would predict that it is a cat based on
the proximity of the five closest neighbors in the red circle’s boundaries.

Here, ‘K’ is the hyper parameter for KNN. For proper classification/prediction, the value of K
must be fine-tuned.
But, How do we select the right value of K?
We don’t have a particular method for determining the correct value of K. Here, we’ll try to test
the model’s accuracy for different K values. The value of K that delivers the best accuracy for
both training and testing data is selected.
Note!!
It is recommended to always select an odd value of K ~
When the value of K is set to even, a situation may arise in which the elements from both
groups are equal. In the diagram below, elements from both groups are equal in the internal
“Red” circle (k == 4).
In this condition, the model would be unable to do the correct classification for you. Here the
model will randomly assign any of the two classes to this new unknown data.
Choosing an odd value for K is preferred because such a state of equality between the two
classes would never occur here. Due to the fact that one of the two groups would still be in the
majority, the value of K is selected as odd.
The impact of selecting a smaller or larger K value on the model
 Larger K value: The case of underfitting occurs when the value of k is increased. In this case, the
model would be unable to correctly learn on the training data.
 Smaller k value: The condition of overfitting occurs when the value of k is smaller. The model
will capture all of the training data, including noise. The model will perform poorly for the test
data in this scenario.

3) How does KNN work for ‘Classification’ and ‘Regression’ problem statements?

Classification
When the problem statement is of ‘classification’ type, KNN tends to use the concept of
“Majority Voting”. Within the given range of K values, the class with the most votes is chosen.
Consider the following diagram, in which a circle is drawn within the radius of the five closest
neighbours. Four of the five neighbours in this neighbourhood voted for ‘RED,’ while one voted
for ‘WHITE.’ It will be classified as a ‘RED’ wine based on the majority votes.
Real-world example:
Several parties compete in an election in a democratic country like India. Parties compete for
voter support during election campaigns. The public votes for the candidate with whom they
feel more connected.
When the votes for all of the candidates have been recorded, the candidate with the most
votes is declared as the election’s winner.
 Regression~
KNN employs a mean/average method for predicting the value of new data. Based on the value
of K, it would consider all of the nearest neighbours.
The algorithm attempts to calculate the mean for all the nearest neighbours’ values until it has
identified all the nearest neighbours within a certain range of the K value.
Consider the diagram below, where the value of k is set to 3. It will now calculate the mean (52)
based on the values of these neighbours (50, 55, and 51) and allocate this value to the unknown
data.

4) What is the Impact of Imbalanced dataset and Outliers on KNN? Explain?


Imbalanced dataset~
When dealing with an imbalanced data set, the model will become biased. Consider the
example shown in the diagram below, where the “Yes” class is more prominent.
As a consequence, the bulk of the closest neighbours to this new point will be from the
dominant class. Because of this, we must balance our data set using either an “Upscaling” or
“Downscaling” strategy.
Outliers
Outliers are the points that differ significantly from the rest of the data points.
The outliers will impact the classification/prediction of the model. The appropriate class for the
new data point, according to the following diagram, should be “Category B” in green.
The model, however, would be unable to have the appropriate classification due to the
existence of outliers. As a result, removing outliers before using KNN is recommended.

5) What are Decision Trees in Machine Learning Explain?

 Decision tree algorithm falls under the category of supervised learning. They can be used
to solve both regression and classification problems.
 Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the
tree.
 We can represent any Boolean function on discrete attributes using the decision tree.
Below are some assumptions that we made while using decision tree:
 At the beginning, we consider the whole training set as the root.
 Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
 On the basis of attribute values records are distributed recursively.
 We use statistical methods for ordering attributes as root or the internal node.

As you can see from the above image that Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting the
use of computer in the daily life of the people.

Root Nodes – It is the node present at the beginning of a decision tree from this node the
population starts dividing according to various features.
Decision Nodes – the nodes we get after splitting the root nodes are called Decision Node
Leaf Nodes – the nodes where further splitting is not possible are called leaf nodes or terminal
nodes
Sub-tree – just like a small portion of a graph is called sub-graph similarly a sub-section of this
decision tree is called sub-tree.
Pruning – is nothing but cutting down some nodes to stop overfitting.
Example of a decision tree

Let’s understand decision trees with the help of an example

Decision trees are upside down which means the root is at the top and then this root is split
into various several nodes. Decision trees are nothing but a bunch of if-else statements in
layman terms. It checks if the condition is true and if it is then it goes to the next node attached
to that decision.

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If

yes then it will go to the next feature which is humidity and wind. It will again check if there is a

strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and play.
Did you notice anything in the above flowchart? We see that if the weather is cloudy then we
must go to play. Why didn’t it split more? Why did it stop there?
To answer this question, we need to know about few more concepts like entropy, information
gain, and Gini index. But in simple terms, I can say here that the output for the training dataset
is always yes for cloudy weather, since there is no disorderliness here we don’t need to split the
node further.
The goal of machine learning is to decrease uncertainty or disorders from the dataset and for
this, we use decision trees.
6) Define the following: i) Entropy ii) Information Gain iii) Gain Ratio?
(OR)
Explain the process of picking the best splitting Attribute in Decision Trees?
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.

Example: Suppose you have a group of friends who decides which movie they can watch
together on Sunday. There are 2 choices for movies, one is “Lucy” and the second
is “Titanic” and now everyone has to tell their choice.
After everyone gives their answer we see that “Lucy” gets 4 votes and “Titanic” gets 5 votes.
Which movie do we watch now? Isn’t it hard to choose 1 movie now because the votes for both
the movies are somewhat equal?
This is exactly what we call disorderness, there is an equal number of votes for both the movies,
and we can’t really decide which movie we should watch.
It would have been much easier if the votes for “Lucy” were 8 and for “Titanic” it was 2. Here
we could easily say that the majority of votes are for “Lucy” hence everyone will be watching
this movie.
In a decision tree, the output is mostly “yes” or “no”
The formula for Entropy is shown below:

E(S)=-P(+)logP(+)-P(-)logP(-)
Here p+ is the probability of positive class
p– is the probability of negative class
S is the subset of the training example
How do Decision Trees use Entropy?
Entropy basically measures the impurity of a node. Impurity is the degree of randomness; it
tells how random our data is. A pure sub-split means that either you should be getting “yes”, or
you should be getting “no”.
Suppose a feature has 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’
and 2 ‘no’ whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative classes in both
the nodes. In order to make a decision tree, we need to calculate the impurity of each split, and
when the purity is 100%, we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
For feature 2 :

For feature 3 :

We can clearly see from the tree itself that left node has low entropy or more purity than right
node since left node has a greater number of “yes” and it is easy to decide here.
Always remember that the higher the Entropy, the lower will be the purity and the higher will
be the impurity.

Information Gain

Information gain measures the reduction of uncertainty given some feature and it is also a

deciding factor for which attribute should be selected as a decision node or root node .

Example: 1
Suppose our entire population has a total of 30 instances. The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t
Now we have two features to predict whether he/she will go to the gym or not.
Feature 1 is “Energy” which takes two values “high” and “low”
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly
motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use information gain
to decide which feature should be the root node and which feature should be placed after the
split.
Let’s calculate the entropy:

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:

Our parent entropy was near 0.99 and after looking at this value of information gain, we can
say that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.
Example : 2 “Motivation” and calculate its information gain.

Let’s calculate the entropy here:

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:

We now see that the “Energy” feature gives more reduction which is 0.37 than the
“Motivation” feature. Hence we will select the feature which has the highest information gain
and then split the node based on that feature.
Gini Impurity
Gini impurity is a measure of how often a randomly chosen element from the set would be
incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

Gini impurity is lower bounded by 0, with 0 occurring if the data set contains only one class.

There are many algorithms there to build a decision tree. They are
1. CART (Classification and Regression Trees) — This makes use of Gini impurity as the metric.
2. ID3 (Iterative Dichotomiser 3) — This uses entropy and information gain as metric.
In this article, I will go through ID3. Once you got it is easy to implement the same using CART.

7) Write the ID3 algorithm for inducing the decision trees?

1. ID3 (Iterative Dichotomiser 3) — This uses entropy and information gain as metric.
Consider whether a dataset based on which we will determine whether to play football or not.
Here there are for independent variables to determine the dependent variable. The independent
variables are Outlook, Temperature, Humidity, and Wind. The dependent variable is whether to
play football or not.
As the first step, we have to find the parent node for our decision tree. For that follow the steps:
Find the entropy of the class variable.
E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
note: Here typically we will take log to base 2.Here total there are 14 yes/no. Out of which 9 YES
and 5 NO. Based on it we calculated probability above.
From the above data for outlook we can arrive at the following table easily.

Now we have to calculate average weighted entropy. ie, we have found the total of weights of
each feature multiplied by probabilities.
E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3) = (5/14)(-(3/5)log(3/5)-(2/5)log(2/5))
+ (4/14)(0) + (5/14)((2/5)log(2/5)-(3/5)log(3/5)) = 0.693
The next step is to find the information gain. It is the difference between parent entropy and
average weighted entropy we found above.
IG(S, outlook) = 0.94 - 0.693 = 0.247
Similarly find Information gain for Temperature, Humidity, and Windy.
IG(S, Temperature) = 0.940 - 0.911 = 0.029
IG(S, Humidity) = 0.940 - 0.788 = 0.152
IG(S, Windy) = 0.940 - 0.8932 = 0.048
Now select the feature having the largest entropy gain. Here it is Outlook. So it forms the first
node(root node) of our decision tree.
Now our data look as follows :
Since overcast contains only examples of class ‘Yes’ we can set it as yes. That means If outlook is
overcast football will be played. Now our decision tree looks as follows.

The next step is to find the next node in our decision tree. Now we will find one under sunny.
We have to determine which of the following Temperature, Humidity or Wind has higher
information gain.

Calculate parent entropy E(sunny)


E(sunny) = (-(3/5)log(3/5)-(2/5)log(2/5)) = 0.971.
Now Calculate the information gain of Temperature. IG(sunny, Temperature)

E(sunny, Temperature) = (2/5)*E(0,2) + (2/5)*E(1,1) + (1/5)*E(1,0)=2/5=0.4


Now calculate information gain.
IG(sunny, Temperature) = 0.971–0.4 =0.571
Similarly we get
IG(sunny, Humidity) = 0.971
IG(sunny, Windy) = 0.020
Here IG(sunny, Humidity) is the largest value. So Humidity is the node that comes under sunny.
For humidity from the above table, we can say that play will occur if humidity is normal and will
not occur if it is high. Similarly, find the nodes under rainy.
Note: A branch with entropy more than 0 needs further splitting.
Finally, our decision tree will look as below:

8) Write the CART algorithm for inducing the decision trees?


CART Algorithm:
This algorithm can be used for both classification & regression. CART algorithm uses Gini Index
criterion to split a node to a sub-node. It start with the training set as a root node, after
successfully splitting the root node in two, it splits the subsets using the same logic & again split
the sub-subsets, recursively until it finds further splitting will not give any pure sub-nodes or
maximum number of leaves in a growing tree or termed it as a Tree pruning.

How to calculate Gini Index?

In Gini Index, P is the probability of class i & there is total c classes.


Considering you have only two predictor/attributes: Humidity & Wind
Class: Rainy & Sunny

GI = 1 — ((num of observations from Feature_1/total observation)² + (num of observations from


Feature_2/total observation)²)
GI = 1-((6/10)² + (4/10)²) => 1-(0.36+0.16) => 1–0.52 => 0.48
So, the Gini index for the first/initial set is 0.48
Basic idea on how the Node split happens:
Based on attribute “wind” (f) & threshold value “3.55” (t) the CART algorithm created
nodes/subsets which would give a pure subsets to right side of the above flow (ref: image 4).
I wrote a small code snippet to understand it better:

But it is not the best pair (f,t). The above mentioned process will continue for all available
attributes & will keep on searching for the new lowest Gini score, if it finds it will keep the
threshold value & it’s attribute, later it will split the node based on best attribute & threshold
value. According to our data set best Gini score is “0.40” for “Wind” attribute (f) & “3.55” as best
threshold value (t). The below tree generated by DecisionTreeClassifier using scikit-learn which
shows node split happened based on same threshold value & attribute:
9) What is Naïve Bayes Algorithm in Machine Learning with example? Where is the Naïve
Bayes algorithm is used?
It is a classification technique based on Bayes' Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature.

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

Play
Outlook

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes
11 Rainy No

12 Overcast Yes

13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5
Likelihood table weather condition:
Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71


Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

10) What is Linear Regression with an example? Write Linear regression Algorithm and its
uses in Machine Learning?
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,
Y=DependentVariable(TargetVariable)
X=IndependentVariable(predictorVariable)
a0=intercept of the line (Gives an additional degree of freedom)
a1 =Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o SimpleLinearRegression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o MultipleLinearregression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
o PositiveLinearRelationship:
If the dependent variable increases on the Y-axis and independent variable increases on
X-axis, then such a relationship is termed as a Positive linear relationship.

o NegativeLinearRelationship:
If the dependent variable decreases on the Y-axis and independent variable increases on
the X-axis, then such a relationship is called a negative linear relationship.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a 0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
For the above linear equation, MSE can be calculated as:

Where,
N=Totalnumberofobservations
Yi=Actualvalue
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.

Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression


Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from the
given dataset.
o Linearrelationshipbetweenthefeaturesandtarget:
Linear regression assumes the linear relationship between the dependent and
independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o HomoscedasticityAssumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o Normaldistributionoferrorterms:
Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will
become either too wide or too narrow, which may cause difficulties in finding
coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
o Noautocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual errors.

11) What is KNN algorithm explain with an example?


K-Nearest Neighbor(KNN) Algorithm for
Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of
a particular dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a
new SUV car. The company wants to give the ads to the users who are interested in buying that
SUV. So for this problem, we have a dataset that contains multiple user's information through
the social network. The dataset contains lots of information but the Estimated
Salary and Age we will consider for the independent variable and the Purchased variable is for
the dependent variable. Below is the dataset:

Steps to implement the K-NN algorithm:


o Data Pre-processing step
o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
Data Pre-Processing Step:
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the
code for it:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed.
After feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.
o Fitting K-NN classifier to the Training data:
Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the class,
we will create the Classifier object of the class. The Parameter of this class will be
o n_neighbors: To define the required neighbors of the algorithm. Usually, it takes
5.
o metric='minkowski': This is the default parameter and it decides the distance
between the points.
o p=2: It is equivalent to the standard Euclidean metric.
And then we will fit the classifier to the training data. Below is the code for it:
1. #Fitting K-NN classifier to the training set
2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
4. classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create a y_pred vector
as we did in Logistic Regression. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
Output:
The output for the above code will be:
o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the
classifier. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
In above code, we have imported the confusion_matrix function and called it using the variable
cm.
Output: By executing the above code, we will get the matrix as below:

In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say
that the performance of the model is improved by using the K-NN algorithm.
o Visualizing the Training set result:
Now, we will visualize the training set result for K-NN model. The code will remain same
as we did in Logistic Regression, except the name of the graph. Below is the code for it:
1. #Visulaizing the trianing set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
By executing the above code, we will get the below graph:

The output graph is different from the graph which we have occurred in Logistic Regression. It
can be understood in the below points:
o As we can see the graph is showing the red point and green points. The green
points are for Purchased(1) and Red Points for not Purchased(0) variable.
o The graph is showing an irregular boundary instead of showing any straight line
or any curve because it is a K-NN algorithm, i.e., finding the nearest neighbor.
o The graph has classified users in the correct categories as most of the users who
didn't buy the SUV are in the red region and users who bought the SUV are in the
green region.
o The graph is showing good result but still, there are some green points in the red
region and red points in the green region. But this is no big issue as by doing this
model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new dataset, i.e.,
Test dataset. Code remains the same except some minor changes: such as x_train and
y_train will be replaced by x_test and y_test.
Below is the code for it:
1. #Visualizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:

The above graph is showing the output for the test data set. As we can see in the graph, the
predicted output is well good as most of the red points are in the red region and most of the
green points are in the green region.
However, there are few green points in the red region and a few red points in the green region.
So these are the incorrect observations that we have observed in the confusion matrix(7
Incorrect output).

12) What is Logistic Regression with an example? Write Logistic regression Algorithm and its
uses in Machine Learning?
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression; therefore,
it is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Example: There is a dataset given which contains the information of various users obtained
from the social networking sites. There is a car making company that has recently launched a
new SUV car. So the company wanted to check how many users from the dataset, wants to
purchase the car.
For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:
o Data Pre-processing step
o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can
use it in our code efficiently. It will be the same as we have done in Data pre-processing topic.
The code for this is given below:

1. #Data Pre-procesing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output. Consider the given
image:

Now, we will extract the dependent and independent variables from the given dataset. Below is
the code for it:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and
salary, which are at index 2, 3. And we have taken 4 for y variable because our dependent
variable is at index 4. The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:
1. # Splitting the dataset into training and test set.
2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
The output for this is given below:
For test

set:

For training set:


In logistic regression, we will do feature scaling because we want accurate result of predictions.
Here we will only scale the independent variable because dependent variable have only 0 and 1
values. Below is the code for it:

1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)

The scaled output is given below:

2. Fitting Logistic Regression to the Training set:


We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the
logistic regression. Below is the code for it:
1. #Fitting Logistic Regression to the training set
2. from sklearn.linear_model import LogisticRegression
3. classifier= LogisticRegression(random_state=0)
4. classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the below output:
Out[5]:
1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)
Hence our model is well fitted to the training set.
3. Predicting the Test Result
Our model is well trained on the training set, so we will now predict the result by using test set
data. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:

The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.
4. Test Accuracy of the result
Now we will create the confusion matrix here to check the accuracy of the classification. To
create it, we need to import the confusion_matrix function of the sklearn library. After
importing the function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted value return by the
classifier). Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix()
Output:
By executing the above code, a new confusion matrix will be created. Consider the below
image:

We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
5. Visualizing the training set result
Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
1. #Visualizing the training set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
In the above code, we have imported the ListedColormap class of Matplotlib library to create
the colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have
taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of
provided colors (purple and green). In this function, we have passed the classifier.predict to
show the predicted data points predicted by the classifier.
Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:


o In the above graph, we can see that there are some Green points within the green
region and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the
result for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is probably
0, i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably
1 means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary,
did not purchase the car, whereas older users with high estimated salary purchased the
car.
o But there are some purple points in the green region (Buying the car) and some green
points in the purple region(Not buying the car). So we can say that younger users with a
high estimated salary purchased the car, whereas an older user with a low estimated
salary did not purchase the car.
The goal of the classifier:
We have successfully visualized the training set result for the logistic regression, and our goal
for this classification is to divide the users who purchased the SUV car and who did not
purchase the car. So from the output graph, we can clearly see the two regions (Purple and
Green) with the observation points. The Purple region is for those users who didn't buy the car,
and Green Region is for those users who purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have used
the Linear model for Logistic Regression. In further topics, we will learn for non-linear
Classifiers.
Visualizing the test set result:
Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we
will use x_test and y_test instead of x_train and y_train. Below is the code for it:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations
are in the purple region. So we can say it is a good prediction and model. Some of the green and
purple data points are in different regions, which can be ignored as we have already calculated
this error using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification
problem.

13) What is SVM explains in Machine Learning with an example?

What are support vector machines?

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.


Python Implementation of Support Vector Machine
Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.
o Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
1. #Data Pre-processing Step
2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the dataset as:

The scaled output for the test set will be:


Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:
1. from sklearn.svm import SVC # "Support vector classifier"
2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)
In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the classifier
to the training dataset(x_train, y_train)
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor),
gamma, and kernel.
o Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new vector y_pred.
Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
After getting the y_pred vector, we can compare the result of y_pred and y_test to check the
difference between the actual value and predicted value.
Output: Below is the output for the prediction of the test set:
o Creating the confusion matrix:
Now we will see the performance of the SVM classifier that how many incorrect
predictions are there as compared to the Logistic regression classifier. To create the
confusion matrix, we need to import the confusion_matrix function of the sklearn
library. After importing the function, we will call it using a new variable cm. The function
takes two parameters, mainly y_true( the actual values) and y_pred (the targeted value
return by the classifier). Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10
correct predictions. Therefore we can say that our SVM model improved as compared to the
Logistic regression model.
o Visualizing the training set result:
Now we will visualize the training set result, below is the code for it:
1. from matplotlib.colors import ListedColormap
2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12. mtp.title('SVM classifier (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()
Output:
By executing the above code, we will get the output as:

As we can see, the above output is appearing similar to the Logistic regression output. In the
output, we got the straight line as hyperplane because we have used a linear kernel in the
classifier. And we have also discussed above that for the 2d space, the hyperplane in SVM is a
straight line.
o Visualizing the test set result:
1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users into two
regions (Purchased or Not purchased). Users who purchased the SUV are in the red region with
the red scatter points. And users who did not purchase the SUV are in the green region with
green scatter points. The hyperplane has divided the two classes into Purchased and not
purchased variable.
14) What is Binary Classification? Explain different types of Binary Classification with
examples?
In machine learning, binary classification is a supervised learning algorithm that categorizes
new observations into one of two classes.
The following are a few binary classification applications, where the 0 and 1 columns are
two possible classes for each observation:
Application Observation 0 1

Medical Diagnosis Patient Healthy Diseased


Email Analysis Email Not Spam Spam
Financial Data Analysis Transaction Not Fraud Fraud
Won't
Marketing Website visitor Will Buy
Buy
Image Classification Image Hotdog Not Hotdog
Quick example
In a medical diagnosis, a binary classifier for a specific disease could take a patient's
symptoms as input features and predict whether the patient is healthy or has the disease.
The possible outcomes of the diagnosis are positive and negative.
Evaluation of binary classifiers
If the model successfully predicts the patients as positive, this case is called True Positive
(TP). If the model successfully predicts patients as negative, this is called True Negative
(TN). The binary classifier may misdiagnose some patients as well. If a diseased patient is
classified as healthy by a negative test result, this error is called False Negative (FN).
Similarly, If a healthy patient is classified as diseased by a positive test result, this error is
called False Positive(FP).
We can evaluate a binary classifier based on the following parameters:
 True Positive (TP): The patient is diseased and the model predicts "diseased"
 False Positive (FP): The patient is healthy but the model predicts "diseased"
 True Negative (TN): The patient is healthy and the model predicts "healthy"
 False Negative (FN): The patient is diseased and the model predicts "healthy"
After obtaining these values, we can compute the accuracy score of the binary classifier as
follows: accuracy= (TP+TN)/(TP+FP+TN+FN)
The following is a confusion matrix, which represents the above parameters:
In machine learning, many methods utilize binary classification. The most common are:
 Support Vector Machines
 Naive Bayes
 Nearest Neighbor
 Decision Trees
 Logistic Regression
 Neural Networks
There are perhaps four main types of classification tasks that you may encounter; they are:

 Binary Classification
 Multi-Class Classification
 Multi-Label Classification
 Imbalanced Classification
Multi Class Classification:
Classification means categorizing data and forming groups based on the similarities. In a
dataset, the independent variables or features play a vital role in classifying our data. When
we talk about multiclass classification, we have more than two classes in our dependent or
target variable, as can be seen in Fig.1:

The above picture is taken from the Iris dataset which depicts that the target variable has
three categories i.e., Virginica, setosa, and Versicolor, which are three species of Iris plant.
We might use this dataset later, as an example of a conceptual understanding of multiclass
classification.
Which classifiers do we use in multiclass classification? When do we use them?
We use many algorithms such as Naïve Bayes, Decision trees, SVM, Random forest classifier,
KNN, and logistic regression for classification. But we might learn about only a few of them
here because our motive is to understand multiclass classification. So, using a few
algorithms we will try to cover almost all the relevant concepts related to multiclass
classification.
Naive Bayes
Naive Bayes is a parametric algorithm which means it requires a fixed set of parameters or
assumptions to simplify the machine’s learning process. In parametric algorithms, the
number of parameters used is independent of the size of training data.
Naïve Bayes Assumption:
 It assumes that features of a dataset are completely independent of each other.
But it is generally not true that is why we also call it a ‘naïve’ algorithm.
It is a classification model based on conditional probability and uses Bayes theorem to
predict the class of unknown datasets. This model is mostly used for large datasets as it is
easy to build and is fast for both training and making predictions. Moreover, without
hyperparameter tuning, it can give you better results as compared to other algorithms.
Naïve Bayes can also be an extremely good text classifier as it performs well, such as in the
spam ham dataset.
Bayes theorem is stated as-

 By P (A|B), we are trying to find the probability of event A given that event B is
true. It is also known as posterior probability.
 Event B is known as evidence.
 P (A) is called priori of A which means it is probability of event before evidence is
seen.
 P (B|A) is known as conditional probability or likelihood.
Note: Naïve Bayes’ is linear classifier which might not be suitable to classes that are not
linearly separated in a dataset. Let us look at the figure below:
As can be seen in Fig.2b, Classifiers such as KNN can be used for non-linear classification
instead of Naïve Bayes classifier.
KNN (K-nearest neighbours)
KNN is a supervised machine learning algorithm that can be used to solve both classification
and regression problems. It is one of the simplest algorithms yet powerful one. It does not
learn a discriminative function from the training data but memorizes the training data
instead. Due to the very same reason, it is also known as a lazy algorithm.
How it works?
The K-nearest neighbor algorithm forms a majority vote between the K most similar
instances, and it uses a distance metric between the two data points for defining them as
similar. Most popular choice is Euclidean distance which is written as:

K in KNN is the hyperparameter that can be chosen by us to get the best possible fit for the
dataset. If we keep the smallest value for K, i.e. K=1, then the model will show low bias, but
high variance because our model will be overfitted in this case. Whereas, a larger value for
K, lets suppose k=10, will surely smoothen our decision boundary, which means low
variance but high bias. So we always go for a trade-off between the bias and variance,
known as bias-variance trade-off.
Let us understand more about it by looking at its advantages and disadvantages:
Advantages-
 KNN makes no assumptions about the distribution of classes i.e. it is a non-
parametric classifier
 It is one of the methods that can be widely used in multiclass classification
 It does not get impacted by the outliers
 This classifier is easy to use and implement
Disadvantages-
 K value is difficult to find as it must work well with test data also, not only with
the training data
 It is a lazy algorithm as it does not make any models
 It is computationally extensive because it measures distance with each data point
Decision Trees
As the name suggests, the decision tree is a tree-like structure of decisions made based on
some conditional statements. This is one of the most used supervised learning methods in
classification problems because of their high accuracy, stability, and easy interpretation.
They can map linear as well as non-linear relationships in a good way.
Let us look at the figure below, Fig.3, where we have used adult census income dataset with
two independent variables and one dependent variable. Our target or dependent variable is
income, which has binary classes i.e, <=50K or >50K.
Fig 3: Decision Tree- Binary Classifier
We can see that the algorithm works based on some conditions, such as Age <50 and
Hours>=40, to further split into two buckets for reaching towards homogeneity. Similarly,
we can move ahead for multiclass classification problem datasets, such as Iris data.
Now a question arises in our mind. How should we decide which column to take first and
what is the threshold for splitting? For splitting a node and deciding threshold for splitting,
we use entropy or Gini index as measures of impurity of a node. We aim to maximize the
purity or homogeneity on each split, as we saw in Fig.2.
Confusion Matrix in Multi-class Classification
A confusion matrix is table which is used in every classification problem to describe the
performance of a model on a test data.
As we know about confusion matrix in binary classification, in multiclass classification also
we can find precision and recall accuracy.
Let’s take an example to have a better idea about confusion matrix in multiclass
classification using Iris dataset which we have already seen above in this article.

Finding precision and recall from above Table.1:


Precision for Virginica class is the number of correctly predicted virginica species out of all
the predicted virginica species, which is 4/7 = 57.1%. This means that only 4/7 of the species
that our predictor classifies as Virginica are actually virginica. Similarly, we can find for
other species i.e. for Setosa and Versicolor, precision is 20% and 62.5% respectively.
Whereas, Recall for Virginica class is the number of correctly predicted virginica species out
of actual virginica species, which is 50%. This means that our classifier classified half of the
virginica species as virginica. Similarly, we can find for other species i.e. for Setosa and
Versicolor, recall is 20% and 71.4% respectively.

15) Write the differences between Binary and Multi class Classification?

Parameters Binary classification Multi-class classification

It is a classification of two groups, There can be any number of classes in


No. of
i.e. classifies objects in at most two it, i.e., classifies the object into more
classes
classes. than two classes.

Algorithms The most popular algorithms used Popular algorithms that can be used for
used by the binary classification are- multi-class classification include:
k-Nearest Neighbors
Logistic Regression Decision Trees
k-Nearest Neighbors
Naive Bayes
Decision Trees
Random Forest.
Support Vector Machine
Gradient Boosting
Naive Bayes

Examples of binary classification


Examples of multi-class classification
include-
include:
Email spam detection (spam or
Examples Face classification.
not).
Plant species classification.
Churn prediction (churn or not).
Optical character recognition.
Conversion prediction (buy or not).

16) What is MNIST in Machine Learning?


The MNIST database (Modified National Institute of Standards and Technology database) is a
large database of handwritten digits that is commonly used for training various image
processing systems. The database is also widely used for training and testing in the field of
machine learning.
It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that
since NIST's training dataset was taken from American Census Bureau employees, while the
testing dataset was taken from American high school students, it was not well-suited for
machine learning experiments. Furthermore, the black and white images from NIST were
normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale
levels.
The MNIST database contains 60,000 training images and 10,000 testing images. Half of the
training set and half of the test set were taken from NIST's training dataset, while the other half
of the training set and the other half of the test set were taken from NIST's testing dataset. The
original creators of the database keep a list of some of the methods tested on it.
In their original paper, they use a support-vector machine to get an error rate of 0.8%. An
extended dataset similar to MNIST called EMNIST has been published in 2017, which contains
240,000 training mnist dataset images, and 40,000 testing mnist dataset images of MNIST
dataset of handwritten digits and characters.
What is MNIST used for?
MNIST provides a baseline for testing image processing systems. You could consider it as the
“hello world” of machine learning. Data scientists will train an algorithm on the MNIST
dataset simply to test a new architecture or framework, to ensure that they work.
Because MNIST is a labeled dataset that pairs images of hand-written numerals with the name
of the respective numeral, it can be used in supervised learning to train classifiers. It is a good
example, alongside Fei Fei Li’s ImageNet, of how a good, labeled dataset can advance the cause
of machine learning more broadly. More examples of open datasets are here.

Is MNIST data binary?
The MNIST database was constructed from NIST's Special Database 3 and Special Database 1
which contain binary images of handwritten digits. NIST originally designated SD-3 as their
training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize
than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census
Bureau employees, while SD-1 was collected among high-school students. Drawing sensible
conclusions from learning experiments requires that the result be independent of the choice of
training set and test among the complete set of samples. Therefore it was necessary to build a
new database by mixing NIST's datasets.

How MNIST dataset is created?
The MNIST handwritten digit classification problem is a standard dataset used in computer
vision and deep learning.
Although the dataset is effectively solved, it can be used as the basis for learning and practicing
how to develop, evaluate, and use convolutional deep learning neural networks for image
classification from scratch. This includes how to develop a robust test harness for estimating
the performance of the model, how to explore improvements to the model, and how to save
the model and later load it to make predictions on new data.
Although the MNIST dataset is effectively solved, it can be a useful starting point for developing
and practicing a methodology for solving image classification tasks using convolutional neural
networks.
Instead of reviewing the literature on well-performing models on the dataset, we can develop a
new model from scratch.
The dataset already has a well-defined train and test dataset that we can use.
In order to estimate the performance of a model for a given training run, we can further split
the training set into a train and validation dataset. Performance on the train and validation
dataset over each run can then be plotted to provide learning curves and insight into how well a
model is learning the problem.

17) What is RANKING IN Machine Learning? How it works? Why should we care and its uses?
Ranking is a type of machine learning that sorts data in a relevant order. Companies use ranking
to optimize search and recommendations.
Outline
 What is a ranking model?
 How does ranking work?
 Why should I care?
 Use cases
 The fastest way to build a ranking model
What is a ranking model?
Ranking is a type of supervised machine learning (ML) that uses labeled datasets to train its
data and models to classify future data to predict outcomes. Quite simply, the goal of a ranking
model is to sort data in an optimal and relevant order.
Ranking was first largely deployed within search engines. People search for a topic, while the
ranking algorithm reorders search results based on the PageRank, and the search engine is able
to display the most relevant results to its customers.
Until recently, most ranking models, and ML as whole, were limited in their scope of use, as
most companies didn’t have enough data to power these algorithms. Better methods for data
collection and more intuitive ML tools have made it possible for nearly anyone to deploy a
successful ranking model within their business.
How does ranking work?
As we’ll discuss later in this blog, ranking is incredibly versatile and dependent on the data a
company has. Even so, a common framework guides the construction of all ranking models.
Ranking models are made up of 2 main factors: queries and documents. Queries are any input
value, such as a question on Google or an interaction on an e-commerce site. Documents are
the output value or results of the query. Given the query, and the associated documents, a
function, given a list of parameters to rank on, will score the documents to be sorted in order of
relevancy.
The machine learning algorithm learning to rank takes the scores from this model, and uses
them to predict future outcomes on a new and unseen list of documents.

As an example, a search for “Mage” is done on Google Search (“Mage” is the query). After the
search, a list of associated documents matching the query will be displayed (Mage A.I., Mage
definition, Mage World of Warcraft, etc.). The function will score each of the documents based
on their relevance to the query (Mage A.I. = 1, Mage definition = 2, Mage World of Warcraft =3,
and so on). The documents with higher scores will be ranked higher when there is a search for
Mage.
Data required for a ranking model consists of documents from a query, user profiles, user
behaviors, search history, clicks, etc.
Why should I care?
Ranking ensures that the most relevant results appear first on a customer’s search, maximizing
the chances they will find something of interest, and minimizing the chances of churn. With so
many options for organic web search, the need to stay competitive has never been greater.
According to a Google study, 61% of users said if they didn’t find what they were looking for
right away, they would quickly move on to another site. Depending on available data,
companies can use ranking within their web pages and apps to serve their customers the most
relevant results as soon as they enter.
Use cases:
The most successful companies are using ranking within their software to improve the user
experience. Ranking has allowed these companies to create customized feeds for each user
based on their past search and buying history. Ranking carries many use cases across industries,
nearly anyone with data can and should be using ranking in some capacity to optimize their
business. A few use cases are:
1. Search results
2. Targeted ads
3. Recommendations
Here are a few companies who have used ranking to maximize user engagement.
 Amazon
With millions of listings or documents, for every product search or query, Amazon
needed to find a way to rank its products in order to maximize the chance of purchase.
Using a combination of individual preferences, gathered from users' search and
purchasing history and a product’s popularity, Amazon created a ranking system that
would display the most relevant products at the top of their feed. Additionally, ranking
was used in Amazon’s recommendation system, which would use users' ranked
preferences in order to predict what products a user is most likely to purchase in the
future.
 Netflix
Similar to Amazon, Netflix uses ranking to fuel their recommendation system. The
recommendation system predicts what content a user is most likely to watch and
displays the most relevant content at the top of the home page. Netflix uses a few
different features to rank and recommend content; such as: watch history, search
history, and general popularity. They also use ranking to fuel their collaborative filtering.
 TikTok
TikTok’s standout feature is the For You page which is built on a ranking system. This
feature has allowed TikTok to customize each home page to be reflective of the
preferences and interests of its user. TikTok uses similar metrics to Netflix to rank its
content: watch history, re-watch rate, and engagement. Similar to Netflix, TikTok’s
ranking system also aids in collaborative filtering.
- Starbucks
Starbucks found great success with their mobile app, which is one of the most downloaded
apps on the App Store. The app allows Starbucks to create a custom user experience for their
customers even when they’re not within a physical coffee shop. The app uses ranking to
recommend the most relevant products to users. Taking into account order history, new
products and general popularity of other products, Starbucks is able to keep customers' favorite
orders at the top of the recommended search while introducing them to new products that
they are most likely to enjoy.
The fastest way to build a ranking model
For the companies listed above, entire teams of data scientists and AI engineers were built to
create and maintain the ranking systems in place. The cost to build these teams is impractical
for most businesses. Recently, there have been great tools emerging which allow for the easy
building and deployment of ranking models–this with little to no programming experience.
Mage allows for the building and deployment of a ranking model with no ML programming
knowledge. To use Mage, a database containing a list of queries and documents is first
uploaded. Queries could contain a list of clothes or menu items, their documents could be the
number of engagement (clicks and purchases) each received. The greater the quality and
quantity of data uploaded, the better that Mage is able to produce ranking predictions.
Once the data is uploaded, users will be given the option to transform their datasets by
removing and adding columns, applying transformer actions: split and filter data, group values,
aggregate data, and identifying what columns they would like to rank. Mage will then produce a
ranking model which can be deployed into your data warehouses, downloaded to a CSV file, or
saved directly to a Mage dataset.

Ranking falls under the Regression function.

OML4SQL supports XGBoost algorithm for ranking.

You might also like