0% found this document useful (0 votes)
354 views37 pages

Jntuk R20 ML Unit-Ii

The document discusses different supervised learning methods including distance based methods, nearest neighbors, decision trees, linear regression, logistic regression and support vector machines. It provides details on distance measures like Euclidean, Manhattan and Minkowski distances. It explains the K-nearest neighbors algorithm and discusses factors affecting it. It also describes decision trees and the different types of linear regression.

Uploaded by

Thanuja Malla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
354 views37 pages

Jntuk R20 ML Unit-Ii

The document discusses different supervised learning methods including distance based methods, nearest neighbors, decision trees, linear regression, logistic regression and support vector machines. It provides details on distance measures like Euclidean, Manhattan and Minkowski distances. It explains the K-nearest neighbors algorithm and discusses factors affecting it. It also describes decision trees and the different types of linear regression.

Uploaded by

Thanuja Malla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT-II

Supervised Learning(Regression/Classification)

Basic Methods: Distance based Methods, Nearest Neighbours, Decision Trees, Naive Bayes,

Linear Models: Linear Regression, Logistic Regression, Generalized Linear Models, Support Vector Machines

Binary Classification: Multiclass/Structured outputs, MNIST, Ranking.

……………………………………………………………………………………………………………………………..

1.Distance based Methods


Distanced based algorithms are machine learning algorithms that classify queries by computing distances
between these queries and a number of internally stored exemplars. Exemplars that are closets to the
query have the largest influence on the classification assigned to the query.

Euclidean Distance

It is the most common use of distance. In most cases when people said about distance, they will refer to
Euclidean distance. Euclidean distance is also known as simply distance. When data is dense or continuous, this
is the best proximity measure. The Euclidean distance between two points is the length of the path
connecting them. The Pythagorean theory gives distance between two points.

Manhattan Distance

 If you want to find Manhattan distance between two different points (x1, y1) and (x2, y2) such as
the following, it would look like the following:
 Manhattan distance = (x2 – x1) + (y2 – y1)
 Diagrammatically, it would look like traversing the path from point A to point B while walking on
the pink straight line.
Minkowski Distance

The generalized form of the Euclidean and Manhattan Distances is the Minkowski Distance. You can
express the Minkowski distance as

The order of the norm is represented by p.

When an order(p) is 1, Manhattan Distance is represented, and when order(p) is 2 in the above formula,
Euclidean Distance is represented.

2.Nearest Neighbours

The abbreviation KNN stands for “K-Nearest Neighbour”. It is a supervised machine learning algorithm. The
algorithm can be used to solve both classification and regression problem statements.

The number of nearest neighbours to a new unknown variable that has to be predicted or classified is
denoted by the symbol ‘K’.

KNN calculates the distance from all points in the proximity of the unknown data and filters out the ones
with the shortest distances to it. As a result, it’s often referred to as a distance-based algorithm.

In order to correctly classify the results, we must first determine the value of K (Number of Nearest
Neighbours).

It is recommended to always select an odd value of K ~

When the value of K is set to even, a situation may arise in which the elements from both groups are
equal. In the diagram below, elements from both groups are equal in the internal “Red” circle (k == 4).

In this condition, the model would be unable to do the correct classification for you. Here the model will
randomly assign any of the two classes to this new unknown data.

Choosing an odd value for K is preferred because such a state of equality between the two classes would
never occur here. Due to the fact that one of the two groups would still be in the majority, the value of K
is selected as odd.
The impact of selecting a smaller or larger K value on the model

 Larger K value: The case of underfitting occurs when the value of k is increased. In this case,
the model would be unable to correctly learn on the training data.

 Smaller k value: The condition of overfitting occurs when the value of k is smaller. The model
will capture all of the training data, including noise. The model will perform poorly for the test
data in this scenario.

How does KNN work for ‘Classification’ and ‘Regression’ problem statements?

 Classification

When the problem statement is of ‘classification’ type, KNN tends to use the concept of “Majority Voting”.
Within the given range of K values, the class with the most votes is chosen.

Consider the following diagram, in which a circle is drawn within the radius of the five closest neighbours.
Four of the five neighbours in this neighbourhood voted for ‘RED,’ while one voted for ‘WHITE.’ It will be
classified as a ‘RED’ wine based on the majority votes.

Real-world example:

Several parties compete in an election in a democratic country like India. Parties compete for voter support
during election campaigns. The public votes for the candidate with whom they feel more connected.

When the votes for all of the candidates have been recorded, the candidate with the most votes is declared
as the election’s winner.
 Regression

KNN employs a mean/average method for predicting the value of new data. Based on the value of K, it
would consider all of the nearest neighbours.

The algorithm attempts to calculate the mean for all the nearest neighbours’ values until it has identified all
the nearest neighbours within a certain range of the K value.

Consider the diagram below, where the value of k is set to 3. It will now calculate the mean (52) based on
the values of these neighbours (50, 55, and 51) and allocate this value to the unknown data.

Impact of Imbalanced dataset and Outliers on KNN

 Imbalanced dataset

When dealing with an imbalanced data set, the model will become biased. Consider the example shown in
the diagram below, where the “Yes” class is more prominent.

As a consequence, the bulk of the closest neighbours to this new point will be from the dominant class.
Because of this, we must balance our data set using either an “Upscaling” or “Downscaling” strategy.
 Outliers

Outliers are the points that differ significantly from the rest of the data points.

The outliers will impact the classification/prediction of the model. The appropriate class for the new data
point, according to the following diagram, should be “Category B” in green.

The model, however, would be unable to have the appropriate classification due to the existence of
outliers. As a result, removing outliers before using KNN is recommended.

Importance of scaling down the numeric variables to the same level

Data has 2 parts: –

1) Magnitude

2) Unit

For instance; if we say 20 years then “20” is the magnitude here and “years” is its unit.

Since it is a distance-dependent algorithm, KNN selects the neighbours in the closest vicinity based solely
on the magnitude of the data. Have a look at the diagram below; the data is not scaled, so it can not
find the closest neighbours correctly. As a consequence, the outcome will be influenced.

The data values in the previous figure have now been scaled down to the same level in the following
example. Based on the scaled distance, all of the closest neighbours would be accurately identified.
3.Decision Trees

Like SVMs, Decision Trees are versatile Machine Learning algorithms that can perform both classification and
regression tasks, and even multioutput tasks. They are very powerful algorithms, capable of fitting complex
datasets.

Decision Trees are also the fundamental components of Random Forests, which are among the most
powerful Machine Learning algorithms available today.
Chapter-II

Linear Models: Linear Regression, Logistic Regression, Generalized Linear Models, Support Vector Machines

……………………………………………………………………………………………………………………………..

1.Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method
that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables
such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent
(y) variables, hence called as linear regression. Since linear regression shows the linear relationship, which
means it finds how the value of the dependent variable is changing according to the value of the
independent variable.

The linear regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)

X= Independent Variable (predictor Variable)

a0= intercept of the line (Gives an additional degree of freedom)

a1 = Linear regression coefficient (scale factor to each input value).

ε = random error

The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

 Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
 Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

a.Simple linear regression


Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent
variable and a single independent variable. The relationship shown by a Simple Linear Regression model is
linear or a sloped straight line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a continuous/real
value. However, the independent variable can be measured on continuous or categorical values.
Simple Linear regression algorithm has mainly two objectives:

 Model the relationship between the two variables. Such as the relationship between Income and
expenditure, experience and Salary, etc.

 Forecasting new observations. Such as Weather forecasting according to temperature, Revenue of a


company according to the investments in a year, etc.

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)

Implementation of Simple Linear Regression Algorithm using Python


Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable) and experience (Independent
variable). The goals of this problem is:

 We want to find out if there is any correlation between these two variables

 We will find the best fit line for the dataset.

 How the dependent variable is changing by changing the independent variable.

Here, we will create a Simple Linear Regression model to find out the best fitting line for representing the
relationship between these two variables.

To implement the Simple Linear regression model in machine learning using Python, we need to follow the
below steps:

Step-1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-processing. We have already done it
earlier in this tutorial. But there will be some changes, which are given in the below steps:
a) First, we will import the three important libraries, which will help us for loading the dataset, plotting the
graphs, and creating the Simple Linear Regression model.

# Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

b) Next, we will load the dataset into our code. After that, we need to extract the dependent and
independent variables from the given dataset. The independent variable is years of experience, and the
dependent variable is salary.

# Importing the dataset

dataset = pd.read_csv('Salary_Data.csv')

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, 1].values

In the above lines of code, for x variable, we have taken -1 value since we want to remove the last column
from the dataset. For y variable, we have taken 1 value as a parameter, since we want to extract the
second column and indexing starts from the zero.

c) Next, we will split both variables into the test set and training set. We have 30 observations, so we will
take 20 observations for the training set and 10 observations for the test set. We are splitting our dataset
so that we can train our model using a training dataset and then test the model using a test dataset.

# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state =

0) Step-2: Fitting the Simple Linear Regression to the Training Set:

Now the second step is to fit our model to the training dataset. To do so, we will import the
LinearRegression class of the linear_model library from the scikit learn. After importing the class, we are
going to create an object of the class named as a regressor.

# Fitting Simple Linear Regression to the Training set

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

In the above code, we have used a fit() method to fit our Simple Linear Regression object to the training
set. In the fit() function, we have passed the x_train and y_train, which is our training dataset for the
dependent and an independent variable. We have fitted our regressor object to the training set so that the
model can easily learn the correlations between the predictor and target variables.

Step: 3. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the
output for the new observations. In this step, we will provide the test dataset (new observations) to the model
to check whether it can predict the correct output or not.

We will create a prediction vector y_pred, which will contain predictions of test dataset, and prediction of
training set respectively.
# Predicting the Test set results

y_pred = regressor.predict(X_test)

Step: 4. visualizing the Training set results:


Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of the
pyplot library, which we have already imported in the pre-processing step. The scatter () function will
create a scatter plot of observations.

In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In
the function, we will pass the real values of training set, which
means a year of experience x_train, training set of Salaries
y_train, and color of the observations. Here we are taking a
green color for the observation, but it can be any color as
per the choice.

Now, we need to plot the regression line, so for this, we will


use the plot() function of the pyplot library. In this function,
we will pass the years of experience for training set,
predicted salary for training set x_pred, and color of the
line.

Next, we will give the title for the plot. So here, we will use
the title() function of the pyplot library and pass the name
("Salary vs Experience (Training Dataset)".

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()

function. # Visualising the Training set results

plt.scatter(X_train, y_train, color = 'red')

plt.plot(X_train, regressor.predict(X_train), color = 'blue')

plt.title('Salary vs Experience (Training set)')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.show()

In the above plot, we can see the real values observations in green dots and predicted values are covered
by the red regression line. The regression line shows a correlation between the dependent and
independent variable.
The good fit of the line can be observed by calculating the difference between actual values and predicted
values. But as we can see in the above plot, most of the observations are close to the regression line,
hence our model is good for the training set.

Step: 5. visualizing the Test set results:

In the previous step, we have visualized the


performance of our model on the training set. Now,
we will do the same for the Test set. The complete
code will remain the same as the above code, except
in this, we will use x_test, and y_test instead of
x_train and y_train

# Visualising the Test set results

plt.scatter(X_test, y_test, color = 'red')


plt.plot(X_train, regressor.predict(X_train), color =
'blue')

plt.title('Salary vs Experience (Test set)')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.show()

b.Multiple linear regression


Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent variable.

Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

Some key points about MLR:

 For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or
independent variable may be of continuous or categorical form.

 Each feature variable must model the linear relationship with the dependent variable.

 MLR tries to fit a regression line through a multidimensional space of data-points.

The multiple regression equation explained above takes the following form:

y = b 1 x1 + b 2 x2 + … + b n xn +

c. Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn.....= Coefficients of the model.

x1, x2, x3, x4,. = Various Independent/feature variable

Assumptions for Multiple Linear Regression:

 A linear relationship should exist between the Target and predictor variables.

 The regression residuals must be normally distributed.

 MLR assumes little or no multicollinearity (correlation between the independent variable) in

data. Implementation of Multiple Linear Regression model using Python:

To implement MLR using Python, we have below problem:

Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend,
Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal is to create a model
that can easily determine which company has a maximum profit, and which is the most affecting factor for
the profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four variables are
independent variables. Below are the main steps of deploying the MLR model:

1. Data Pre-processing Steps


2. Fitting the MLR model to the training set

3. Predicting the result of the test set

Step-1: Data Pre-processing Step:

The very first step is data pre-processing

, which we have already discussed in this tutorial. This process contains the below steps:

 Importing libraries: Firstly, we will import the library which will help in building the model. Below is
the code for it:

# Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

 Importing dataset: Now we will import the dataset(50_CompList), which contains all the variables.
Extracting dependent and independent variables from it.

# Importing the dataset

dataset = pd.read_csv('50_Startups.csv')

X = dataset.iloc[:, :-1]

y = dataset.iloc[:, 4]

 Convert the column into categorical

columns
states=pd.get_dummies(X['State'],drop_first=True)

 Drop the state

coulmn
X=X.drop('State',axis=1)

 concat the dummy

variables
X=pd.concat([X,states],axis=1)

 Now we will split the dataset into training and test

set. # Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Step: 2- Fitting our MLR model to the Training set:

Now, we have well prepared our dataset in order to provide training, which means we will fit our
regression model to the training set. It will be similar to as we did in Simple Linear Regression model.

# Fitting Multiple Linear Regression to the Training set

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

Step: 3- Prediction of Test set results:

The last step for our model is checking the performance of the model. We will do it by predicting the test set
result. For prediction, we will create a y_pred vector.
# Predicting the Test set results

y_pred = regressor.predict(X_test)

Now, checking the final results

The above score tells that our model is 95% accurate with the training dataset and 93% accurate with the
test dataset.

Applications of Multiple Linear Regression:

There are mainly two applications of Multiple Linear Regression:

 Effectiveness of Independent variable on prediction:

 Predicting the impact of changes:

2. Logistic linear regression


Logistic regression aims to solve classification problems. It does this by predicting categorical outcomes,
unlike linear regression that predicts a continuous outcome.

In the simplest case there are two outcomes, which is called binomial, an example of which is predicting
if a tumor is malignant or benign. Other cases have more than two outcomes to classify, in this case it is
called multinomial. A common example for multinomial logistic regression would be predicting the class of
an iris flower between 3 different species.

Here we will be using basic logistic regression to predict a binomial variable. This means it has only two
possible outcomes.

Example:

import numpy

from sklearn import linear_model

#Reshaped for Logistic function.

X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)

y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()

logr.fit(X,y)

#predict if tumor is cancerous where the size is 3.46mm:


predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))

print(predicted)

3. Generalized Linear Models

Generalized Linear Models (GLMs) are a class of regression models that can be used to model a
wide range of relationships between a response variable and one or more predictor variables. Unlike
traditional linear regression models, which assume a linear relationship between the response and
predictor variables, GLMs allow for more flexible, non-linear relationships by using a different
underlying statistical distribution.

Linear regression revisited

Linear regression is used to predict the value of continuous variable y by the linear combination of explanatory

variables X.

In the univariate case, linear regression can be expressed as follows;

Linear regression

Here, i indicates the index of each sample. Notice this model assumes normal distribution for the noise term. The

model can be illustrated as follows;


Linear regression illustrated

By the three normal PDF (probability density function) plots, I’m trying to show that the data follow a normal

distribution with a fixed variance.

Poisson regression

So linear regression is all you need to know? Definitely not. If you’d like to apply statistical modelling in real

problems, you must know more than that.

For example, assume you need to predict the number of defect products (Y) with a sensor value (x) as the

explanatory variable. The scatter plot looks like this.

Do you use linear regression for this data?

There are several problems if you try to apply linear regression for this kind of data.
1. The relationship between X and Y does not look linear. It’s more likely to be exponential.

2. The variance of Y does not look constant with regard to X. Here, the variance of Y seems to

increase when X increases.

3. As Y represents the number of products, it always has to be a positive integer. In other words, Y is

a discrete variable. However, the normal distribution used for linear regression assumes

continuous variables. This also means the prediction by linear regression can be negative. It’s not

appropriate for this kind of count data.

Here, the more proper model you can think of is the Poisson regression model. Poisson regression is an

example of generalized linear models (GLM).

There are three components in generalized linear models.

1. Linear predictor

2. Link function

3. Probability distribution

In the case of Poisson regression, it’s formulated like this.

Poisson regression

Linear predictor is just a linear combination of parameter (b) and explanatory variable (x).

Link function literally “links” the linear predictor and the parameter for probability distribution. In the case of

Poisson regression, the typical link function is the log link function. This is because the parameter for Poisson

regression must be positive (explained later).


The last component is the probability distribution which generates the observed variable y. As we use

Poisson distribution here, the model is called Poisson regression.

Poisson distribution is used to model count data. It has only one parameter which stands for both mean and

standard deviation of the distribution. This means the larger the mean, the larger the standard deviation. See

below.

Poisson distribution with mean=1, 5, 10

Now, let’s apply Poisson regression to our data. The result should look like this.

Poisson regression illustrated

The magenta curve is the prediction by Poisson regression. I added the bar plot of the probability mass function

of Poisson distribution to make the difference from linear regression clear.

The prediction curve is exponential as the inverse of the log link function is an exponential function. From this, it

is also clear that the parameter for Poisson regression calculated by the linear predictor guaranteed to be positive.
the Generalized linear models (GLMs) which explains how Linear regression and Logistic
regression are a member of a much broader class of models. GLMs can be used to construct the
models for regression and classification problems by using the type of distribution which best
describes the data or labels given for training the model. Below given are some types of datasets and
the corresponding distributions which would help us in constructing the model for a particular type of
data (The term data specified here refers to the output data or the labels of the dataset).
1. Binary classification data – Bernoulli distribution
2. Real valued data – Gaussian distribution
3. Count-data – Poisson distribution
To understand GLMs we will begin by defining exponential families. Exponential families are a class
of distributions whose probability density function(PDF) can be molded into the following form:
4. Support Vector Machines

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in
the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the
below diagram in which there are two different categories that are classified using a decision boundary
or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can accurately
identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see
the extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider
the below diagram:

Backward Skip 10sPlay VideoForward Skip 10s

SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM
SVM can be of two types:

a.Linear SVM:

Linear SVM is used for linearly separable data, which means if a dataset can be classified into two
classes by using a single straight line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.

o The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want
a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:
o
o So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:

o
o Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
b.Non-linear SVM:
Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by
using a straight line, then such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:

o
Chapter-III

Binary Classification: Multiclass/Structured outputs, MNIST, Ranking.

……………………………………………………………………………………………………………………………..

1.Binary Classification
It is a process or task of classification, in which a given data is being classified into two classes. It’s basically a
kind of prediction about which of two groups the thing belongs to.

Let us suppose, two emails are sent to you, one is sent by an insurance company that keeps sending their ads,
and the other is from your bank regarding your credit card bill. The email service provider will classify the
two emails, the first one will be sent to the spam folder and the second one will be kept in the primary
one.

This process is known as binary classification, as there are two discrete classes, one is spam and the other is
primary. So, this is a problem of binary classification.
Binary classification uses some algorithms to do the task, some of the most common algorithms used by
binary classification are .

 Logistic Regression
 k-Nearest Neighbors
 Decision Trees
 Support Vector Machine
 Naive Bayes

Binary classification is used in a wide range of applications, such as spam email detection, medical diagnosis,
sentiment analysis, fraud detection, and many more

2.Multiclass Classification

Multi-class classification is the task of classifying elements into different classes. Unlike binary, it doesn’t
restrict itself to any number of classes.

Examples of multi-class classification are


 classification of news in different categories,

 classifying books according to the subject,

 classifying students according to their streams etc.

In these, there are different classes for the response variable to be classified in and thus according to the
name, it is a Multi-class classification.

Multiclass classification is the process of assigning entities with more than two classes. Each entity is assigned
to one class without any overlap. An example of multiclass classification, using images of vegetables, where
each image is either a carrot, tomato, or zucchini.
Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:
o ?(ylog(p)+(1?y)log(1?p))
confusion matrix

o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area
Under the Curve.
o It is a graph that shows the performance of the classification model at different thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

3.MNIST
The MNIST database (Modified National Institute of Standards and Technology database) is a large database
of handwritten digits that is commonly used for training various image processing systems.
The database is also widely used for training and testing in the field of machine

learning. It was created by "re-mixing" the samples from NIST's original datasets.

The creators felt that since NIST's training dataset was taken from American Census Bureau employees,
while the testing dataset was taken from American high school students, it was not well-suited for machine
learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a
28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set
and half of the test set were taken from NIST's training dataset, while the other half of the training set and
the other half of the test set were taken from NIST's testing dataset. The original creators of the database
keep a list of some of the methods tested on it. In their original paper, they use a support-vector machine
to get an error rate of 0.8%.

Extended MNIST (EMNIST) is a newer dataset developed and released by NIST to be the (final) successor
to MNIST.[11][12] MNIST included images only of handwritten digits. EMNIST includes all the images
from NIST Special Database 19, which is a large database of handwritten uppercase and lower case letters as
well as digits. The images in EMNIST were converted into the same 28x28 pixel format, by the same
process, as were the MNIST images. Accordingly, tools which work with the older, smaller, MNIST dataset
will likely work unmodified with EMNIST.

4.ANOVA
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed
aggregate variability found inside a data set into two parts: systematic factors and random
factors. The systematic factors have a statistical influence on the given data set, while the
random factors do not. Analysts use the ANOVA test to determine the influence that
independent variables have on the dependent variable in a regression study.

The t- and z-test methods developed in the 20th century were used for statistical analysis
until 1918, when Ronald Fisher created the analysis of variance method.12 ANOVA is
also called the Fisher analysis of variance, and it is the extension of the t- and z-tests. The
term became well-known in 1925, after appearing in Fisher's book, "Statistical Methods for
Research Workers."3 It was employed in experimental psychology and later expanded to
subjects that were more complex.

The ANOVA test is the initial step in analyzing factors that affect a given data set. Once
the test is finished, an analyst performs additional testing on the methodical factors that
measurably contribute to the data set's inconsistency. The analyst utilizes the ANOVA test
results in an f-test to generate additional data that aligns with the
proposed regression models.

se ANOVA
A researcher might, for example, test students from multiple colleges to see if students
from one of the colleges consistently outperform students from the other colleges. In a
business application, an R&D researcher might test two different processes of creating a
product to see if one process is better than the other in terms of cost efficiency.

The type of ANOVA test used depends on a number of factors. It is applied when data
needs to be experimental. Analysis of variance is employed if there is no access to
statistical software resulting in computing ANOVA by hand. It is simple to use and best
suited for small samples. With many experimental designs, the sample sizes have to be
the same for the various factor level combinations.

ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample t-
tests. However, it results in fewer type I errors and is appropriate for a range of issues.
ANOVA groups differences by comparing the means of each group and includes
spreading out the variance into diverse sources. It is employed with subjects, test groups,
between groups and within groups.

One-Way ANOVA Versus Two-Way ANOVA


There are two main types of ANOVA: one-way (or unidirectional) and two-way. There also
variations of ANOVA. For example, MANOVA (multivariate ANOVA) differs from ANOVA
as the former tests for multiple dependent variables simultaneously while the latter
assesses only one dependent variable at a time. One-way or two-way refers to the
number of independent variables in your analysis of variance test. A one-way ANOVA
evaluates the impact of a sole factor on a sole response variable. It determines whether all
the samples are the same. The one-way ANOVA is used to determine whether there are
any statistically significant differences between the means of three or more independent
(unrelated) groups.

A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one
independent variable affecting a dependent variable. With a two-way ANOVA, there are
two independents. For example, a two-way ANOVA allows a company to compare worker
productivity based on two independent variables, such as salary and skill set. It is utilized
to observe the interaction between the two factors and tests the effect of two factors at the
same time.

You might also like