UNIT-2 Material
UNIT-2 Material
Regression
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last
5 years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique
which helps in finding the correlation between variables and enables us to predict the
continuous output variable based on the one or more predictor variables. It is mainly used
for prediction, forecasting, time series modeling, and determining the causal-effect
relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Some examples of regression can be as:
o Prediction of rain using temperature and other factors
o Determining Market trends
o Prediction of road accidents due to rash driving.
Terminologies Related to the Regression Analysis:
o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is
called underfitting.
Why do we use Regression Analysis?
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There
are various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology
which can make predictions more accurately. So for such case we need Regression analysis
which is a statistical method and used in machine learning and data science. Below are some
other reasons for using Regression analysis:
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Classification
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be
called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning
technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
1. y=f(x), where y = categorical output
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar
to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Learners in Classification Problems:
In the classification problems, there are two types of learners:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time
for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve Bayes,
ANN.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
o Email Spam Detection
o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.
2) Explain Distance Based Methods in supervised Learning?
Distance-based algorithms are machine learning algorithms that classify queries by computing
distances between these queries and a number of internally stored exemplars. Exemplars that
are closest to the query have the largest influence on the classification assigned to the query.
The abbreviation KNN stands for “K-Nearest Neighbour”. It is a supervised machine learning
algorithm. The algorithm can be used to solve both classification and regression problem
statements.
The number of nearest neighbours to a new unknown variable that has to be predicted or
classified is denoted by the symbol ‘K’.
Let’s take a good look at a related real-world scenario before we get started with this awesome
algorithm.
We are often notified that you share many characteristics with your nearest peers, whether it
be your thinking process, working etiquettes, philosophies, or other factors. As a result, we
build friendships with people we deem similar to us.
The KNN algorithm employs the same principle. Its aim is to locate all of the closest neighbours
around a new unknown data point in order to figure out what class it belongs to. It’s a distance-
based approach.
Consider the diagram below; it is straightforward and easy for humans to identify it as a “Cat”
based on its closest allies. This operation, however, cannot be performed directly by the
algorithm.
KNN calculates the distance from all points in the proximity of the unknown data and filters out
the ones with the shortest distances to it. As a result, it’s often referred to as a distance-based
algorithm.
In order to correctly classify the results, we must first determine the value of K (Number of
Nearest Neighbors’).
In the following diagram, the value of K is 5. Since there are four cats and just one dog in the
proximity of the five closest neighbours, the algorithm would predict that it is a cat based on
the proximity of the five closest neighbors in the red circle’s boundaries.
Here, ‘K’ is the hyper parameter for KNN. For proper classification/prediction, the value of K
must be fine-tuned.
But, How do we select the right value of K?
We don’t have a particular method for determining the correct value of K. Here, we’ll try to test
the model’s accuracy for different K values. The value of K that delivers the best accuracy for
both training and testing data is selected.
Note!!
It is recommended to always select an odd value of K ~
When the value of K is set to even, a situation may arise in which the elements from both
groups are equal. In the diagram below, elements from both groups are equal in the internal
“Red” circle (k == 4).
In this condition, the model would be unable to do the correct classification for you. Here the
model will randomly assign any of the two classes to this new unknown data.
Choosing an odd value for K is preferred because such a state of equality between the two
classes would never occur here. Due to the fact that one of the two groups would still be in the
majority, the value of K is selected as odd.
The impact of selecting a smaller or larger K value on the model
Larger K value: The case of underfitting occurs when the value of k is increased. In this case, the
model would be unable to correctly learn on the training data.
Smaller k value: The condition of overfitting occurs when the value of k is smaller. The model
will capture all of the training data, including noise. The model will perform poorly for the test
data in this scenario.
3) How does KNN work for ‘Classification’ and ‘Regression’ problem statements?
Classification
When the problem statement is of ‘classification’ type, KNN tends to use the concept of
“Majority Voting”. Within the given range of K values, the class with the most votes is chosen.
Consider the following diagram, in which a circle is drawn within the radius of the five closest
neighbours. Four of the five neighbours in this neighbourhood voted for ‘RED,’ while one voted
for ‘WHITE.’ It will be classified as a ‘RED’ wine based on the majority votes.
Real-world example:
Several parties compete in an election in a democratic country like India. Parties compete for
voter support during election campaigns. The public votes for the candidate with whom they
feel more connected.
When the votes for all of the candidates have been recorded, the candidate with the most
votes is declared as the election’s winner.
Regression~
KNN employs a mean/average method for predicting the value of new data. Based on the value
of K, it would consider all of the nearest neighbours.
The algorithm attempts to calculate the mean for all the nearest neighbours’ values until it has
identified all the nearest neighbours within a certain range of the K value.
Consider the diagram below, where the value of k is set to 3. It will now calculate the mean (52)
based on the values of these neighbours (50, 55, and 51) and allocate this value to the unknown
data.
Decision tree algorithm falls under the category of supervised learning. They can be used
to solve both regression and classification problems.
Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the
tree.
We can represent any Boolean function on discrete attributes using the decision tree.
Below are some assumptions that we made while using decision tree:
At the beginning, we consider the whole training set as the root.
Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
On the basis of attribute values records are distributed recursively.
We use statistical methods for ordering attributes as root or the internal node.
As you can see from the above image that Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting the
use of computer in the daily life of the people.
Root Nodes – It is the node present at the beginning of a decision tree from this node the
population starts dividing according to various features.
Decision Nodes – the nodes we get after splitting the root nodes are called Decision Node
Leaf Nodes – the nodes where further splitting is not possible are called leaf nodes or terminal
nodes
Sub-tree – just like a small portion of a graph is called sub-graph similarly a sub-section of this
decision tree is called sub-tree.
Pruning – is nothing but cutting down some nodes to stop overfitting.
Example of a decision tree
Decision trees are upside down which means the root is at the top and then this root is split
into various several nodes. Decision trees are nothing but a bunch of if-else statements in
layman terms. It checks if the condition is true and if it is then it goes to the next node attached
to that decision.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If
yes then it will go to the next feature which is humidity and wind. It will again check if there is a
strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and play.
Did you notice anything in the above flowchart? We see that if the weather is cloudy then we
must go to play. Why didn’t it split more? Why did it stop there?
To answer this question, we need to know about few more concepts like entropy, information
gain, and Gini index. But in simple terms, I can say here that the output for the training dataset
is always yes for cloudy weather, since there is no disorderliness here we don’t need to split the
node further.
The goal of machine learning is to decrease uncertainty or disorders from the dataset and for
this, we use decision trees.
6) Define the following: i) Entropy ii) Information Gain iii) Gain Ratio?
(OR)
Explain the process of picking the best splitting Attribute in Decision Trees?
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.
Example: Suppose you have a group of friends who decides which movie they can watch
together on Sunday. There are 2 choices for movies, one is “Lucy” and the second
is “Titanic” and now everyone has to tell their choice.
After everyone gives their answer we see that “Lucy” gets 4 votes and “Titanic” gets 5 votes.
Which movie do we watch now? Isn’t it hard to choose 1 movie now because the votes for both
the movies are somewhat equal?
This is exactly what we call disorderness, there is an equal number of votes for both the movies,
and we can’t really decide which movie we should watch.
It would have been much easier if the votes for “Lucy” were 8 and for “Titanic” it was 2. Here
we could easily say that the majority of votes are for “Lucy” hence everyone will be watching
this movie.
In a decision tree, the output is mostly “yes” or “no”
The formula for Entropy is shown below:
E(S)=-P(+)logP(+)-P(-)logP(-)
Here p+ is the probability of positive class
p– is the probability of negative class
S is the subset of the training example
How do Decision Trees use Entropy?
Entropy basically measures the impurity of a node. Impurity is the degree of randomness; it
tells how random our data is. A pure sub-split means that either you should be getting “yes”, or
you should be getting “no”.
Suppose a feature has 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’
and 2 ‘no’ whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative classes in both
the nodes. In order to make a decision tree, we need to calculate the impurity of each split, and
when the purity is 100%, we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
For feature 2 :
For feature 3 :
We can clearly see from the tree itself that left node has low entropy or more purity than right
node since left node has a greater number of “yes” and it is easy to decide here.
Always remember that the higher the Entropy, the lower will be the purity and the higher will
be the impurity.
Information Gain
Information gain measures the reduction of uncertainty given some feature and it is also a
deciding factor for which attribute should be selected as a decision node or root node .
Example: 1
Suppose our entire population has a total of 30 instances. The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t
Now we have two features to predict whether he/she will go to the gym or not.
Feature 1 is “Energy” which takes two values “high” and “low”
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly
motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use information gain
to decide which feature should be the root node and which feature should be placed after the
split.
Let’s calculate the entropy:
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99 and after looking at this value of information gain, we can
say that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.
Example : 2 “Motivation” and calculate its information gain.
Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
We now see that the “Energy” feature gives more reduction which is 0.37 than the
“Motivation” feature. Hence we will select the feature which has the highest information gain
and then split the node based on that feature.
Gini Impurity
Gini impurity is a measure of how often a randomly chosen element from the set would be
incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.
Gini impurity is lower bounded by 0, with 0 occurring if the data set contains only one class.
There are many algorithms there to build a decision tree. They are
1. CART (Classification and Regression Trees) — This makes use of Gini impurity as the metric.
2. ID3 (Iterative Dichotomiser 3) — This uses entropy and information gain as metric.
In this article, I will go through ID3. Once you got it is easy to implement the same using CART.
1. ID3 (Iterative Dichotomiser 3) — This uses entropy and information gain as metric.
Consider whether a dataset based on which we will determine whether to play football or not.
Here there are for independent variables to determine the dependent variable. The independent
variables are Outlook, Temperature, Humidity, and Wind. The dependent variable is whether to
play football or not.
As the first step, we have to find the parent node for our decision tree. For that follow the steps:
Find the entropy of the class variable.
E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
note: Here typically we will take log to base 2.Here total there are 14 yes/no. Out of which 9 YES
and 5 NO. Based on it we calculated probability above.
From the above data for outlook we can arrive at the following table easily.
Now we have to calculate average weighted entropy. ie, we have found the total of weights of
each feature multiplied by probabilities.
E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3) = (5/14)(-(3/5)log(3/5)-(2/5)log(2/5))
+ (4/14)(0) + (5/14)((2/5)log(2/5)-(3/5)log(3/5)) = 0.693
The next step is to find the information gain. It is the difference between parent entropy and
average weighted entropy we found above.
IG(S, outlook) = 0.94 - 0.693 = 0.247
Similarly find Information gain for Temperature, Humidity, and Windy.
IG(S, Temperature) = 0.940 - 0.911 = 0.029
IG(S, Humidity) = 0.940 - 0.788 = 0.152
IG(S, Windy) = 0.940 - 0.8932 = 0.048
Now select the feature having the largest entropy gain. Here it is Outlook. So it forms the first
node(root node) of our decision tree.
Now our data look as follows :
Since overcast contains only examples of class ‘Yes’ we can set it as yes. That means If outlook is
overcast football will be played. Now our decision tree looks as follows.
The next step is to find the next node in our decision tree. Now we will find one under sunny.
We have to determine which of the following Temperature, Humidity or Wind has higher
information gain.
But it is not the best pair (f,t). The above mentioned process will continue for all available
attributes & will keep on searching for the new lowest Gini score, if it finds it will keep the
threshold value & it’s attribute, later it will split the node based on best attribute & threshold
value. According to our data set best Gini score is “0.40” for “Wind” attribute (f) & “3.55” as best
threshold value (t). The below tree generated by DecisionTreeClassifier using scikit-learn which
shows node split happened based on same threshold value & attribute:
9) What is Naïve Bayes Algorithm in Machine Learning with example? Where is the Naïve
Bayes algorithm is used?
It is a classification technique based on Bayes' Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature.
Play
Outlook
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
10) What is Linear Regression with an example? Write Linear regression Algorithm and its
uses in Machine Learning?
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
o NegativeLinearRelationship:
If the dependent variable decreases on the Y-axis and independent variable increases on
the X-axis, then such a relationship is called a negative linear relationship.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a 0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
For the above linear equation, MSE can be calculated as:
Where,
N=Totalnumberofobservations
Yi=Actualvalue
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a
new SUV car. The company wants to give the ads to the users who are interested in buying that
SUV. So for this problem, we have a dataset that contains multiple user's information through
the social network. The dataset contains lots of information but the Estimated
Salary and Age we will consider for the independent variable and the Purchased variable is for
the dependent variable. Below is the dataset:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say
that the performance of the model is improved by using the K-NN algorithm.
o Visualizing the Training set result:
Now, we will visualize the training set result for K-NN model. The code will remain same
as we did in Logistic Regression, except the name of the graph. Below is the code for it:
1. #Visulaizing the trianing set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
By executing the above code, we will get the below graph:
The output graph is different from the graph which we have occurred in Logistic Regression. It
can be understood in the below points:
o As we can see the graph is showing the red point and green points. The green
points are for Purchased(1) and Red Points for not Purchased(0) variable.
o The graph is showing an irregular boundary instead of showing any straight line
or any curve because it is a K-NN algorithm, i.e., finding the nearest neighbor.
o The graph has classified users in the correct categories as most of the users who
didn't buy the SUV are in the red region and users who bought the SUV are in the
green region.
o The graph is showing good result but still, there are some green points in the red
region and red points in the green region. But this is no big issue as by doing this
model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new dataset, i.e.,
Test dataset. Code remains the same except some minor changes: such as x_train and
y_train will be replaced by x_test and y_test.
Below is the code for it:
1. #Visualizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
The above graph is showing the output for the test data set. As we can see in the graph, the
predicted output is well good as most of the red points are in the red region and most of the
green points are in the green region.
However, there are few green points in the red region and a few red points in the green region.
So these are the incorrect observations that we have observed in the confusion matrix(7
Incorrect output).
12) What is Logistic Regression with an example? Write Logistic regression Algorithm and its
uses in Machine Learning?
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore,
it is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can
use it in our code efficiently. It will be the same as we have done in Data pre-processing topic.
The code for this is given below:
By executing the above lines of code, we will get the dataset as the output. Consider the given
image:
Now, we will extract the dependent and independent variables from the given dataset. Below is
the code for it:
Now we will split the dataset into a training set and test set. Below is the code for it:
1. # Splitting the dataset into training and test set.
2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
The output for this is given below:
For test
set:
1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.
4. Test Accuracy of the result
Now we will create the confusion matrix here to check the accuracy of the classification. To
create it, we need to import the confusion_matrix function of the sklearn library. After
importing the function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted value return by the
classifier). Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix()
Output:
By executing the above code, a new confusion matrix will be created. Consider the below
image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
5. Visualizing the training set result
Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
1. #Visualizing the training set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
In the above code, we have imported the ListedColormap class of Matplotlib library to create
the colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have
taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of
provided colors (purple and green). In this function, we have passed the classifier.predict to
show the predicted data points predicted by the classifier.
Output: By executing the above code, we will get the below output:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:
As we can see, the above output is appearing similar to the Logistic regression output. In the
output, we got the straight line as hyperplane because we have used a linear kernel in the
classifier. And we have also discussed above that for the 2d space, the hyperplane in SVM is a
straight line.
o Visualizing the test set result:
1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
By executing the above code, we will get the output as:
As we can see in the above output image, the SVM classifier has divided the users into two
regions (Purchased or Not purchased). Users who purchased the SUV are in the red region with
the red scatter points. And users who did not purchase the SUV are in the green region with
green scatter points. The hyperplane has divided the two classes into Purchased and not
purchased variable.
14) What is Binary Classification? Explain different types of Binary Classification with
examples?
In machine learning, binary classification is a supervised learning algorithm that categorizes
new observations into one of two classes.
The following are a few binary classification applications, where the 0 and 1 columns are
two possible classes for each observation:
Application Observation 0 1
Binary Classification
Multi-Class Classification
Multi-Label Classification
Imbalanced Classification
Multi Class Classification:
Classification means categorizing data and forming groups based on the similarities. In a
dataset, the independent variables or features play a vital role in classifying our data. When
we talk about multiclass classification, we have more than two classes in our dependent or
target variable, as can be seen in Fig.1:
The above picture is taken from the Iris dataset which depicts that the target variable has
three categories i.e., Virginica, setosa, and Versicolor, which are three species of Iris plant.
We might use this dataset later, as an example of a conceptual understanding of multiclass
classification.
Which classifiers do we use in multiclass classification? When do we use them?
We use many algorithms such as Naïve Bayes, Decision trees, SVM, Random forest classifier,
KNN, and logistic regression for classification. But we might learn about only a few of them
here because our motive is to understand multiclass classification. So, using a few
algorithms we will try to cover almost all the relevant concepts related to multiclass
classification.
Naive Bayes
Naive Bayes is a parametric algorithm which means it requires a fixed set of parameters or
assumptions to simplify the machine’s learning process. In parametric algorithms, the
number of parameters used is independent of the size of training data.
Naïve Bayes Assumption:
It assumes that features of a dataset are completely independent of each other.
But it is generally not true that is why we also call it a ‘naïve’ algorithm.
It is a classification model based on conditional probability and uses Bayes theorem to
predict the class of unknown datasets. This model is mostly used for large datasets as it is
easy to build and is fast for both training and making predictions. Moreover, without
hyperparameter tuning, it can give you better results as compared to other algorithms.
Naïve Bayes can also be an extremely good text classifier as it performs well, such as in the
spam ham dataset.
Bayes theorem is stated as-
By P (A|B), we are trying to find the probability of event A given that event B is
true. It is also known as posterior probability.
Event B is known as evidence.
P (A) is called priori of A which means it is probability of event before evidence is
seen.
P (B|A) is known as conditional probability or likelihood.
Note: Naïve Bayes’ is linear classifier which might not be suitable to classes that are not
linearly separated in a dataset. Let us look at the figure below:
As can be seen in Fig.2b, Classifiers such as KNN can be used for non-linear classification
instead of Naïve Bayes classifier.
KNN (K-nearest neighbours)
KNN is a supervised machine learning algorithm that can be used to solve both classification
and regression problems. It is one of the simplest algorithms yet powerful one. It does not
learn a discriminative function from the training data but memorizes the training data
instead. Due to the very same reason, it is also known as a lazy algorithm.
How it works?
The K-nearest neighbor algorithm forms a majority vote between the K most similar
instances, and it uses a distance metric between the two data points for defining them as
similar. Most popular choice is Euclidean distance which is written as:
K in KNN is the hyperparameter that can be chosen by us to get the best possible fit for the
dataset. If we keep the smallest value for K, i.e. K=1, then the model will show low bias, but
high variance because our model will be overfitted in this case. Whereas, a larger value for
K, lets suppose k=10, will surely smoothen our decision boundary, which means low
variance but high bias. So we always go for a trade-off between the bias and variance,
known as bias-variance trade-off.
Let us understand more about it by looking at its advantages and disadvantages:
Advantages-
KNN makes no assumptions about the distribution of classes i.e. it is a non-
parametric classifier
It is one of the methods that can be widely used in multiclass classification
It does not get impacted by the outliers
This classifier is easy to use and implement
Disadvantages-
K value is difficult to find as it must work well with test data also, not only with
the training data
It is a lazy algorithm as it does not make any models
It is computationally extensive because it measures distance with each data point
Decision Trees
As the name suggests, the decision tree is a tree-like structure of decisions made based on
some conditional statements. This is one of the most used supervised learning methods in
classification problems because of their high accuracy, stability, and easy interpretation.
They can map linear as well as non-linear relationships in a good way.
Let us look at the figure below, Fig.3, where we have used adult census income dataset with
two independent variables and one dependent variable. Our target or dependent variable is
income, which has binary classes i.e, <=50K or >50K.
Fig 3: Decision Tree- Binary Classifier
We can see that the algorithm works based on some conditions, such as Age <50 and
Hours>=40, to further split into two buckets for reaching towards homogeneity. Similarly,
we can move ahead for multiclass classification problem datasets, such as Iris data.
Now a question arises in our mind. How should we decide which column to take first and
what is the threshold for splitting? For splitting a node and deciding threshold for splitting,
we use entropy or Gini index as measures of impurity of a node. We aim to maximize the
purity or homogeneity on each split, as we saw in Fig.2.
Confusion Matrix in Multi-class Classification
A confusion matrix is table which is used in every classification problem to describe the
performance of a model on a test data.
As we know about confusion matrix in binary classification, in multiclass classification also
we can find precision and recall accuracy.
Let’s take an example to have a better idea about confusion matrix in multiclass
classification using Iris dataset which we have already seen above in this article.
15) Write the differences between Binary and Multi class Classification?
Algorithms The most popular algorithms used Popular algorithms that can be used for
used by the binary classification are- multi-class classification include:
k-Nearest Neighbors
Logistic Regression Decision Trees
k-Nearest Neighbors
Naive Bayes
Decision Trees
Random Forest.
Support Vector Machine
Gradient Boosting
Naive Bayes
17) What is RANKING IN Machine Learning? How it works? Why should we care and its uses?
Ranking is a type of machine learning that sorts data in a relevant order. Companies use ranking
to optimize search and recommendations.
Outline
What is a ranking model?
How does ranking work?
Why should I care?
Use cases
The fastest way to build a ranking model
What is a ranking model?
Ranking is a type of supervised machine learning (ML) that uses labeled datasets to train its
data and models to classify future data to predict outcomes. Quite simply, the goal of a ranking
model is to sort data in an optimal and relevant order.
Ranking was first largely deployed within search engines. People search for a topic, while the
ranking algorithm reorders search results based on the PageRank, and the search engine is able
to display the most relevant results to its customers.
Until recently, most ranking models, and ML as whole, were limited in their scope of use, as
most companies didn’t have enough data to power these algorithms. Better methods for data
collection and more intuitive ML tools have made it possible for nearly anyone to deploy a
successful ranking model within their business.
How does ranking work?
As we’ll discuss later in this blog, ranking is incredibly versatile and dependent on the data a
company has. Even so, a common framework guides the construction of all ranking models.
Ranking models are made up of 2 main factors: queries and documents. Queries are any input
value, such as a question on Google or an interaction on an e-commerce site. Documents are
the output value or results of the query. Given the query, and the associated documents, a
function, given a list of parameters to rank on, will score the documents to be sorted in order of
relevancy.
The machine learning algorithm learning to rank takes the scores from this model, and uses
them to predict future outcomes on a new and unseen list of documents.
As an example, a search for “Mage” is done on Google Search (“Mage” is the query). After the
search, a list of associated documents matching the query will be displayed (Mage A.I., Mage
definition, Mage World of Warcraft, etc.). The function will score each of the documents based
on their relevance to the query (Mage A.I. = 1, Mage definition = 2, Mage World of Warcraft =3,
and so on). The documents with higher scores will be ranked higher when there is a search for
Mage.
Data required for a ranking model consists of documents from a query, user profiles, user
behaviors, search history, clicks, etc.
Why should I care?
Ranking ensures that the most relevant results appear first on a customer’s search, maximizing
the chances they will find something of interest, and minimizing the chances of churn. With so
many options for organic web search, the need to stay competitive has never been greater.
According to a Google study, 61% of users said if they didn’t find what they were looking for
right away, they would quickly move on to another site. Depending on available data,
companies can use ranking within their web pages and apps to serve their customers the most
relevant results as soon as they enter.
Use cases:
The most successful companies are using ranking within their software to improve the user
experience. Ranking has allowed these companies to create customized feeds for each user
based on their past search and buying history. Ranking carries many use cases across industries,
nearly anyone with data can and should be using ranking in some capacity to optimize their
business. A few use cases are:
1. Search results
2. Targeted ads
3. Recommendations
Here are a few companies who have used ranking to maximize user engagement.
Amazon
With millions of listings or documents, for every product search or query, Amazon
needed to find a way to rank its products in order to maximize the chance of purchase.
Using a combination of individual preferences, gathered from users' search and
purchasing history and a product’s popularity, Amazon created a ranking system that
would display the most relevant products at the top of their feed. Additionally, ranking
was used in Amazon’s recommendation system, which would use users' ranked
preferences in order to predict what products a user is most likely to purchase in the
future.
Netflix
Similar to Amazon, Netflix uses ranking to fuel their recommendation system. The
recommendation system predicts what content a user is most likely to watch and
displays the most relevant content at the top of the home page. Netflix uses a few
different features to rank and recommend content; such as: watch history, search
history, and general popularity. They also use ranking to fuel their collaborative filtering.
TikTok
TikTok’s standout feature is the For You page which is built on a ranking system. This
feature has allowed TikTok to customize each home page to be reflective of the
preferences and interests of its user. TikTok uses similar metrics to Netflix to rank its
content: watch history, re-watch rate, and engagement. Similar to Netflix, TikTok’s
ranking system also aids in collaborative filtering.
- Starbucks
Starbucks found great success with their mobile app, which is one of the most downloaded
apps on the App Store. The app allows Starbucks to create a custom user experience for their
customers even when they’re not within a physical coffee shop. The app uses ranking to
recommend the most relevant products to users. Taking into account order history, new
products and general popularity of other products, Starbucks is able to keep customers' favorite
orders at the top of the recommended search while introducing them to new products that
they are most likely to enjoy.
The fastest way to build a ranking model
For the companies listed above, entire teams of data scientists and AI engineers were built to
create and maintain the ranking systems in place. The cost to build these teams is impractical
for most businesses. Recently, there have been great tools emerging which allow for the easy
building and deployment of ranking models–this with little to no programming experience.
Mage allows for the building and deployment of a ranking model with no ML programming
knowledge. To use Mage, a database containing a list of queries and documents is first
uploaded. Queries could contain a list of clothes or menu items, their documents could be the
number of engagement (clicks and purchases) each received. The greater the quality and
quantity of data uploaded, the better that Mage is able to produce ranking predictions.
Once the data is uploaded, users will be given the option to transform their datasets by
removing and adding columns, applying transformer actions: split and filter data, group values,
aggregate data, and identifying what columns they would like to rank. Mage will then produce a
ranking model which can be deployed into your data warehouses, downloaded to a CSV file, or
saved directly to a Mage dataset.