0% found this document useful (0 votes)
9 views56 pages

Mod 3

Uploaded by

sagar.k20045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views56 pages

Mod 3

Uploaded by

sagar.k20045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Machine Learning (BCS602)

MODULE -03

CHAPTER -01: SIMILARITY BASED LERANING

INTRODUCTION TO SIMILARITY OR INSTANCE-BASED LEARNING


 Similarity-based classifiers use similarity measures to locate the nearest neighbors and classify a
test instance which works in contrast with other learning mechanisms such as decision trees or
neural networks.
 Similarity-based learning is also called as Instance-based learning/Just-in time learning since it
does not build an abstract model of the training instances and performs lazy learning when
classifying a new instance.
 This learning mechanism simply stores all data and uses it only when needs to classify an unseen
instance.
 The advantage of using this learning is that processing occurs only when a request to classify a
new instance is given.
 This methodology is particularly useful when the whole dataset is not available in the beginning
but collected in an incremental manner.
 The drawback of this learning is that it requires a large memory to store the data since a global
abstract model is not constructed initially with the training data.
 Classification of instances is done based on the measure of similarity in the form of distance
functions over data instances.
 Several distance metrics are used to estimate the similarity or dissimilarity between instances
required for clustering, nearest neighbor classification, anomaly detection, and so on.
 Popular distance metrics used are Hamming distance, Euclidean distance, Manhattan distance,
Minkowski distance, Cosine similarity, Mahalanobis distance, Pearson's correlation or
correlation similarity, Mean squared difference, Jaccard coefficient, Tanimoto coefficient, etc.
 Generally, Similarity-based classification problems formulate the features of test instance and
training instances in Euclidean space to learn the similarity or dissimilarity between instances.

Differences between Instance- and Model-based Learning


 Instance-based Learning also comes under the category of memory-based models which
normally compare the given test instance with the trained instances that are stored in memory.
Memory-based models classify a test instance by checking the similarity with the training
instance.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 1


Machine Learning (BCS602)

 Some examples of Instance-based learning algorithms are:


1. k-Nearest Neighbor (k-NN)
2. Variants of Nearest Neighbor learning 3. Locally Weighted Regression
3. Learning Vector Quantization (LVQ)
4. Self-Organizing Map (SOM)
5. Radial Basis Function (RBF) networks.
 These instance-based methods have serious limitations about the range of feature values taken.
Moreover, they are sensitive to irrelevant and correlated features leading to misclassification of
instances.
INSTANCE BASED LEARNING MODEL BASE LEARNING
Lazy Learners Eager Learners
Processing of training instance is done only Processing of training instance is done during
during testing phase training phase
No model is built with training instance before Generalizes a model with training instances
it receives a test instance before it receives a test instance
Predicts the class of the test instance directly Predicts the class of the test instance from the
from the training data model built
Slow in testing phase Fast in testing phase
Learns by making many local approximations Learns by creating global approximations

NEAREST-NEIGHBOR LEARNING
 A natural approach to similarity-based classification is k-Nearest-Neighbors (k-NN), which is a
non-parametric method used for both classification and regression problems.
 It is a simple and powerful non-parametric algorithm that predicts the category of the test
instance according to the 'k' training samples which are closer to the test instance and classifies it
to that category which has the largest probability.
 A visual representation of this learning is shown in Figure 4.1.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 2


Machine Learning (BCS602)


 There are two classes of objects called C, and C, in the given figure. When given a test instance
T, the category of this test instance is determined by looking at the class of k = 3 nearest
neighbors.
 Thus, the class of this test instance T is predicted as C.

K- Nearest Neighbors Algorithm


 k-NN performs instance-based learning which just stores the training data instances learning
instances case by case.
 The model is also 'memory-based' as it uses training data at when predictions need to be made.
 It is a lazy learning algorithm since no prediction model is built earlier with training instances
and classification happens only after getting the test instance.
 The algorithm classifies a new instance by determining the 'K' most similar instances (i.e k
nearest neighbors) and summarizing the output of those instances.
 If the target variable is discrete then it is a classification problem, so it selects the most common
class value among the instances by a majority vote.
 However, if the target variable is continuous then it is a regression problem, and hence the mean
output variable of the 'K' instances is the output of the test instance.
 The most popular distance measure such as Euclidean distance is used in k-NN to determine the
'k' instances which are similar to the test instance.
 The value of 'K' is best determined by turning with different 'K' values and choosing the 'K'
which classifies the test instance more accurately.
Machine Learning (BCS602)

 Data normalization/standardization is required when data (features) have different ranges or a


wider range of possible values when computing distances and to transform all features to a
specific range.
 k-NN classifier performance is strictly affected by three factors such as the number of nearest
neighbors (i.e., selection of k), distance metric and decision rule.
 If the k value selected is small then it may result in overfitting or less stable and if it is big
then it may include many irrelevant points from other classes.
 The choice of the distance metric selected also plays a major role and it depends on the type
of the independent attributes in the training dataset
 The k-NN classification algorithm best suits lower dimensional data asin a high-dimensional
space the nearest neighbors may not be very close at all.

Algorithm: k-NN

Inputs: Training dataset T, distance metric d, Test instance t, the number of nearest neighbors
Output: Predicted class or category.
Prediction: For test instance 1,
Step 1: For each instance i in T, compute the distance between the test instance t and every other
instance i in the training dataset using a distance metric (Euclidean distance).
[Continuous attributes- Euclidean distance between two points in the plane with coordinates (x 1,
y1) and (x2, y2) is given as dist ((x1, y2), (x1, y2)) = (𝐗𝟏 − 𝐗𝟐)² + (𝐘𝟏 − 𝐘𝟐)² ]

[Categorical attributes (Binary) - Hamming Distance: If the value of the two instances is same, the
distance d will be equal to 0 otherwise d = 1.]

Step 2: Sort the distances in an ascending order and select the first k nearest training data
instances to the test instance.

Step 3: Predict the class of the test instance by majority voting (if target attribute is discrete
valued) or mean (if target attribute is continuous valued) of the k selected nearest instances.

Example:

Brightness Saturation Class


40 20 Red
50 50 Blue
60 90 Blue

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 4


Machine Learning (BCS602)

Brightness Saturation Class


10 25 Red
70 70 Blue
60 10 Red
25 80 Blue
20 35 ?
To know the class of the new instance, we have to calculate the distance from the new entry to other
entries in the data set using the Euclidean distance formula.

Here's the formula: (𝐗𝟏 − 𝐗𝟐)² + (𝐘𝟏 − 𝐘𝟐)²

Step 1:

Brightness Saturation Class Euclidean distance

40 20 Red (40 − 20)² + (20 − 35)² =25

50 50 Blue (50 − 20)² + (50 − 35)² = 33.54

60 90 Blue (60 − 20)² + (90 − 35)² =68.01

10 25 Red (10 − 20)² + (25 − 35)² = 10

70 70 Blue (70 − 20)² + (70 − 35)² = 61.03

60 10 Red (60 − 20)² + (10 − 35)² = 47.17

25 80 Blue (25 − 20)² + (80 − 35)² = 45


20 35 ?

Step 2: Assuming k =3 , select the 3 instance with least Euclidean distance

Brightness Saturation Class Euclidean distance


10 25 Red 10
40 20 Red 25
50 50 Blue 33.54

Step 3: Since Red has majority voting. The given new instance/ test data (20, 35, ? ) will be
classified as RED.

Example 2:

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 5


Machine Learning (BCS602)

Sl. No CGPA Assessment Project Submitted Result


1 9.2 85 8 Pass
2 8 80 7 Pass
3 8.5 81 8 Pass
4 6 45 5 Fail
5 6.5 50 4 Fail
6 8.2 72 7 Pass
7 5.8 38 5 Fail
8 8.9 91 9 Pass

Consider the new test instance as (6.1, 40, 5)

Step 1:
Sl. Project Euclidean distance
CGPA Assessment Result
No Submitted

1 9.2 85 8 Pass (9.2 − 6.1) + (85 − 40) + (8 − 5)


= 45.2063
2 8 80 7 Pass (8 − 6.1) + (80 − 40) + (7 − 5)
= 40.095
3 8.5 81 8 Pass (8.5 − 6.1) + (81 − 40) + (8 − 5)
= 41.179
4 6 45 5 Fail (6 − 6.1) + (45 − 40) + (5 − 5)
= 5.001
5 6.5 50 4 Fail (6.5 − 6.1) + (50 − 40) + (4 − 5)
= 10.057
6 8.2 72 7 Pass (8.2 − 6.1) + (72 − 40) + (7 − 5)
= 32.131
7 5.8 38 5 Fail (5.8 − 6.1) + (38 − 40) + (5 − 5)
= 2.022
(8.9 − 6.1) + (91 − 40) + (9 − 5)
8 8.9 91 9 Pass
= 51.233

Step 2: Assign k = 3 select the 3 instance with least Euclidean distance

CGPA Assessment Project Result Euclidean distance


Submitted
6 45 5 Fail 5.001

6.5 50 4 Fail 10.057

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 6


Machine Learning (BCS602)

5.8 38 5 Fail 2.022

Step 3: Since Fail has majority voting. The given new instance/ test data (6.1, 40, 5 ) will be
classified as Fail.

WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM


 The Weighted k-NN is an extension of k-NN. It chooses the neighbors by using the weighted
distance. The k-Nearest Neighbor (k-NN) algorithm has some serious limitations as its
performance is solely dependent on choosing the k nearest neighbors, the distance metric used
and the decision rule.
 However, the principle idea of Weighted k-NN is that k closest neighbors to the test instance are
assigned a higher weight in the decision as compared to neighbors that are farther away from the
test instance.
 The idea is that weights are inversely proportional to distances. The selected k nearest neighbors
can be assigned uniform weights, which means all the instances in each neighborhood are
weighted equally or weights can be assigned by the inverse of their distance.
 In the second case, closer neighbors of a query point will have a greater influence than neighbors
which are further away.

Algorithm: Weighted k-NN

Inputs: Training dataset 'T', Distance metric 'd', Weighting function w(i), Test instance 't', the
number of nearest neighbors ‘k’.
Output: Predicted class or category
Prediction: For test instance t,
Step 1: For each instance 'i’ in Training dataset T, compute the distance between the test instance t
and every other instance 'i' using a distance metric (Euclidean distance).
Step 2: Sort the distances in the ascending order and select the first 'k nearest training data instances
to the test instance.
Step 3: Predict the class of the test instance by weighted voting technique (Weighting function w(i))
for the k selected nearest instances:
 Compute the inverse of each distance of the 'K' selected nearest instances.
 Find the sum of the inverses.
 Compute the weight by dividing each inverse distance by the sum. (Each weight is a vote
for its associated class).

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 7


Machine Learning (BCS602)

 Add the weights of the same class.


 Predict the class by choosing the class with the maximum vote.

Example:
Consider the same training dataset given below Table. Use Weighted k-NN and determine the class
of the new test data (7.6, 60, 8).

Sl. No CGPA Assessment Project Submitted Result


1 9.2 85 8 Pass
2 8 80 7 Pass
3 8.5 81 8 Pass
4 6 45 5 Fail
5 6.5 50 4 Fail
6 8.2 72 7 Pass
7 5.8 38 5 Fail
8 8.9 91 9 Pass

Step 1:

Sl. Project Euclidean distance


CGPA Assessment Result
No Submitted

1 9.2 85 8 Pass (9.2 − 7.6) + (85 − 60) + (8 − 8)


= 25.051
2 8 80 7 Pass (8 − 7.6) + (80 − 60) + (7 − 8)
= 20.028
3 8.5 81 8 Pass (8.5 − 7.6) + (81 − 60) + (8 − 8)
= 21.019
4 6 45 5 Fail (6 − 7.6) + (45 − 60) + (5 − 8)
= 15.380
5 6.5 50 4 Fail (6.5 − 7.6) + (50 − 60) + (4 − 8)
= 10.826
6 8.2 72 7 Pass (8.2 − 7.6) + (72 − 60) + (7 − 8)
= 12.056
7 5.8 38 5 Fail (5.8 − 7.6) + (38 − 60) + (5 − 8)
= 22.276
(8.9 − 7.6) + (91 − 60) + (9 − 8)
8 8.9 91 9 Pass
= 31.043

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 8


Machine Learning (BCS602)

Step 2: Sort the distance according to ascending order and Select the first 3 nearest training data
instances to the test instance. The selected nearest neighbors are

Sl. Project Euclidean distance


CGPA Assessment Result
No Submitted

5 6.5 50 4 Fail (6.5 − 7.6) + (50 − 60) + (4 − 8)


= 10.826
6 8.2 72 7 Pass (8.2 − 7.6) + (72 − 60) + (7 − 8)
= 12.056
4 6 45 5 Fail (6 − 7.6) + (45 − 60) + (5 − 8)
= 15.380

Step 3: Predict the class of the test instance by weighted voting technique from the 3 selected nearest
instance.
 Compute the inverse of each distance of the 3 selected nearest insatnces

Inverse
CGPA Assessment Project Result Euclidean distance (ED) Distance
Submitted (1/ED)

6.5 50 4 Fail (6.5 − 7.6) + (50 − 60) + (4 − 8) 1/10.826


= 10.826 = 0.0923
8.2 72 7 Pass (8.2 − 7.6) + (72 − 60) + (7 − 8) 1/12.056
= 12.056 = 0.0829
6 45 5 Fail (6 − 7.6) + (45 − 60) + (5 − 8) 1/15.380
= 15.380 = 0.0650

 Find the sum of the inverse


Sum= 0.0923 + 0.0829 + 0.0650 = 0.2403

 Compute the weight by dividing each inverse distance by the sum as


Euclidean Inverse Weight =
CGPA Assessment Project Result distance Distance (1/ED) Inverse
Submitted
(ED) distance /Sum
6.5 50 4 Fail 1/10.826 0.0923/ 0.2403
10.826
= 0.0923 =0.3843
8.2 72 7 Pass 1/12.056 0.0829 / 0.2403
12.056
= 0.0829 =0.3451
6 45 5 Fail 1/15.380 0.0650 / 0.2403
15.380
= 0.0650 =0.2705

 Add the weights of the same class


Fail = 0.2705 + 0.3843 = 0.6548

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 9


Machine Learning (BCS602)

Pass = 0.3451
 Predict the class by choosing the class with the maximum vote.
Since the value of Fail is more the new instance belongs to Fail.
NEAREST CENTROID CLASSIFIER
 A simple alternative to k-NN classifiers for similarity-based classification is the Nearest Centroid
Classifier.
 It is a simple classifier and also called as Mean Difference classifier.
 The idea of this classifier is to classify a test instance to the class whose centroid/mean is closest
to that instance.
Algorithm: Nearest Centroid Classifier
Inputs: Training dataset T, Distance metric d. Test instance t
Output: Predicted class or category
Step 1: Compute the mean/centroid of each class.
Step 2: Compute the distance between the test instance and mean/centroid of each class (Euclidean
Distance).
Step 3: Predict the class by choosing the class with the smaller distance.

Example: Consider the sample data shown in below Table with two features x and y. The target
classes are 'A' or 'B'. Predict the class of the instance (6, 5) using Nearest Centroid Classifier.
X Y Class
3 1 A
5 2 A
4 3 A
7 6 B
6 7 B
8 5 B

Step 1: Compute the mean/centroid of each class. In this example there are 2 classes called ‘A’ &
‘B’
Centroid of class ‘A’ = (3 +5 + 4 , 1 + 2 + 3)/ 3 =( 12, 6)/3 = (4 , 2)
Centroid of class ‘B’ = (7 +6 + 8 , 6 + 7 + 5)/ 3 =( 21, 18)/3 = (7 , 6)

Step 2: Calculate the Euclidean distance between test instance (6 ,5) and each of the centroid.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 10


Machine Learning (BCS602)

Euc_distance = (6 − 4) + (5 − 2) = 3.6

Euc_distance = (6 − 7) + (5 − 6) = 1.414

Step 3: The test instance has smaller distance to class B as the value of Class B is smaller than Class
A

LOCALLY WEIGHTED REGRESSION (LWR)


 Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that
performs local regression by combining regression model with nearest neighbor's model.
 LWR is also referred to as a memory-based method as it requires training data while prediction
but uses only the training data instances locally around the point of interest.
 Using nearest neighbors algorithm, we find the instances that are closest to a test instance and fit
linear function to each of those 'k nearest instances in the local regression model.
 The key idea is that we need to approx imate the linear functions of all 'k neighbors that
minimize the error such that the prediction line is no more linear but rather it is a curve.
 Ordinary linear regression finds out a linear relationship between the input x and the output y.
Given training dataset T,
 Hypothesis function h(x), the predicted target output is a linear function where β0 , is the
intercept and β1, is the coefficient of x.
h(x) = β0 + β1 * x
 The cost function is such that it minimizes the error difference between the predicted value h(x)
and true value 'yI' and it is given as.

J(β) = ∑ ℎ (𝑥 ) − 𝑦 ^2
where 'm' is the number of instances in the training dataset.
 Now the cost function is modified for locally weighted linear regression including the weights
only for the nearest neighbor points.
 Hence, the cost function is given as

J(β) = ∑ 𝑤 ℎ (𝑥 ) − 𝑦 ^2
where wi is the weight associated with each xi.
 The weight function used is a Gaussian kernel that gives a higher value for instances that are
close to the test instance, and for instances far away, it tends to zero but never equals to zero.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 11


Machine Learning (BCS602)

 wi is computed in as,

Wi = 𝑒

where 𝜏 is called the bandwidth parameter and controls the rate at which

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 12


Machine Learning (BCS602)

CHAPTER -02: REGRESSION ANALYSIS

Regression analysis is a supervised learning method for predicting continuous variables. The
difference between classification and regression analysis is that regression methods are used to
predict qualitative variables or continuous numbers unlike categorical variables or labels. It is used to
predict linear or non-linear relationships among variables of the given dataset.

INTRODUCTION TO REGRESSION
 Regression analysis is the premier method of supervised learning. This is one of the most popular
and oldest supervised learning techniques.
 Given a training dataset D containing N training points (x, y), where i = 1...N, regression analysis
is used to model the relationship between one or more independent variables x, and a dependent
variable y.
 The relationship between the dependent and independent variables can be represented as a
function as follows:
Y = f(x)
 Here, the feature variable x is also known as an explanatory variable, exploratory variable, a
predictor variable, an independent variable, a covariate, or a domain point.
 y is a dependent variable. Dependent variables are also called as labels, target variables, or
response variables.
 Regression analysis determines the change in response variables when one exploration variable is
varied while keeping all other parameters constant.
 This is used to determine the relationship each of the exploratory variables exhibits. Thus,
regression analysis is used for prediction and forecasting.
 Regression is used to predict continuous variables or quantitative variables such as price and
revenue. Thus, the primary concern of regression analysis is to find answer to questions such as:
 What is the relationship between the variables?
 What is the strength of the relationships?
 What is the nature of the relationship such as linear or non-linear?
 What is the relevance of the attributes?
 What is the contribution of each attribute?
 There are many applications of regression analysis. Some of the applications of regressions
include predicting:
 Sales of a goods or services ‫جاده‬

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 13


Machine Learning (BCS602)

 Value of bonds in portfolio management


 Premium on insurance companies
 Yield of crops in agriculture
 Prices of real estate.

INTRODUCTION TO LINEARITY, CORRELATION, AND CAUSATION


The quality of the regression analysis is determined by the factors such as correlation and causation.
Regression and Correlation
 Correlation among two variables can be done effectively using a Scatter plot, which is a plot
between explanatory variables and response variables.
 It is a 2D graph showing the relationship between two variables. The x-axis of the scatter plot is
independent, or input or predictor variables and y-axis of the scatter plot is output or dependent or
predicted variables.
 The scatter plot is useful in exploring data. Some of the scatter plots are shown in Figure 5.1.
 The Pearson correlation coefficient is the most common test for determining correlation if there
is an association between two variables.
 The correlation coefficient is denoted by r.
 The positive, negative, and random correlations are given in Figure 5.1.

 In positive correlation, one variable change is associated with the change in another variable.
 In negative correlation, the relationship between the variables is reciprocal while in random
correlation, no relationship exists between variables.
 While correlation is about relationships among variables, say x and y, regression is about
predicting one variable given another variable.
Machine Learning (BCS602)

Regression and Causation


 Causation is about causal relationship among variables, say x and y.
 Causation means knowing whether x causes y to happen or vice versa.
 x causes y is often denoted as x implies y.
 Correlation and Regression relationships are not same as causation relationship.
 For example, the correlation between economical background and marks scored does not imply
that economic background causes high marks.
Linearity and Non-linearity Relationships
 The linearity relationship between the variables means the relationship between the dependent and
independent variables can be visualized as a straight line.
 The line of the form, y = ax + b can be fitted to the data points that indicate the relationship
between x and y.
 By linearity, it is meant that as one variable increases, the corresponding variable also increases in
a linear manner.
 A linear relationship is shown in Figure 5.2 (a). A non-linear relationship exists in functions such
as exponential function and power function and it is shown in Figures 5.2 (b) and 5.2 (c).

Here, x-axis is given by x data and y-axis is given by y data.


 The functions like exponential function (y = axb) and power function [y = x / ax+b] are non-linear
relationships between the dependent and independent variables that cannot be fitted in a line.
 This is shown in Figures 5.2 (b) and (c).

TYPES OF REGRESSION METHODS


The classification of regression methods are
Machine Learning (BCS602)

Linear Regression: It is a type of regression where a line is fitted upon given data for finding the
linear relationship between one independent variable and one dependent variable to describe
relationships.

Multiple Regression: It is a type of regression where a line is fitted for finding the linear relationship
between two or more independent variables and one dependent variable to describe relationships
among variables.

Polynomial Regression: It is a type of non-linear regression method of describing relationships


among variables where Nth degree polynomial is used to model the relationship between one
independent variable and one dependent variable. Polynomial multiple regression is used to model
two or more independent variables and one dependant variable.

Logistic Regression: It is used for predicting categorical variables that involve one or more
independent variables and one dependent variable. This is also known as a binary classifier.

Lasso and Ridge Regression Methods: These are special variants of regression method where
regularization methods are used to limit the number and size of coefficients of the independent
variables.

LIMITATIONS OF REGRESSION METHOD

1. Outliers-Outliers are abnormal data. It can bias the outcome of the regression model, as outliers
push the regression line towards it.

2. Number of cases - The ratio of independent and dependent variables should be at least 20:1. For
every explanatory variable, there should be at least 20 samples. Atleast five samples are required in
extreme cases.

3. Missing data - Missing data in training data can make the model unfit for the sampled data.

4. Multicollinearity - If exploratory variables are highly correlated, the regression is vulnerable to


bias. Singularity leads to perfect correlation of 1.
Machine Learning (BCS602)

INTRODUCTION TO LINEAR REGRESSION


 In the simplest form, the linear regression model can be created by fitting a line among the
scattered data points. The line is of the form
y = a0 + a1* x + e
 Here, a0 is the intercept which represents the bias and a1, represents the slope of the line. These are
called regression coefficients. e is the error in prediction.
 The assumptions of linear regression are listed as follows:
1. The observations (y) are random and are mutually independent.
2. The difference between the predicted and true values is called an error. The error is also
mutually independent with the same distributions such as normal distribution with zero mean
and constant variables.
3. The distribution of the error term is independent of the joint distribution of explanatory
variables.
4. The unknown parameters of the regression models are constants.
 The idea of linear regression is based on Ordinary Least Square (OLS) approach. This method is
also known as ordinary least squares method.
 In this method, the data points are modelled using a straight line. Any arbitrarily drawn line is not
an optimal line.

 The vertical distance between each point and the line is called an error. These individual errors are
added to compute the total error of the predicted line. This is called sum of residuals.
 The squares of the individual errors can also be computed and added to give a sum of squared
error. The line with the lowest sum of squared error is called line of best fit.
 In another words, OLS is an optimization technique where the difference between the data points
and the line is optimized.
 Mathematically, based on the line equations for points (x1, x2,...,xn) are:

y1 = (a0+ a1x1) + e1,


y₂ = (a0+a1 x2) + e2,
Machine Learning (BCS602)

. . . .
yn = (a0+a1 xn) + en,
 In general, the error is given as: ei = yi - (a0 + a1 xi)
 Here, the terms (e1, e2,…. en) are error associated with the data points and denote the difference
between the true value of the observation and the point on the line. This is also called as residuals.
The residuals can be positive, negative or zero.
 A regression line is the line of best fit for which the sum of the squares of residuals is minimum.
 The minimization can be done as minimization of individual errors by finding the parameters a 0
and a1, such that:
E= ∑ 𝑒 = ∑ (𝑦 − (𝑎 + 𝑎 𝑥 ))
Or as the minimization of sum of absolute values of the individual errors:
E= ∑ |𝑒 | = ∑ |(𝑦 − (𝑎 + 𝑎 𝑥 ))|
Or as the minimization of the sum of the squares of the individual errors:

E= ∑ (𝑒 ) = ∑ (𝑦 − (𝑎 + 𝑎 𝑥 ))
 Sum of the squares of the individual errors, often preferred as individual errors, do not get
cancelled out and are always positive, and sum of squares results in a large increase even for a
small change in the error.
 Therefore, this is preferred for linear regression. Therefore, linear regression is modelled as a
minimization function as follows:
J ( a0, a1 ) = ∑ [𝑦 − 𝑓(𝑥 )]

= ∑ (𝑦 − (𝑎 + 𝑎 𝑥 ))
 Here, J(a, a) is the criterion function of parameters a0 and a1. This needs to be minimized. This is
done by differentiating and substituting to zero. This yields the coefficient values of a 0 and a1. The
values of estimates of a0 and a1 are given as follows:
(𝑥𝑦) − (𝑥̅ )(𝑦)
𝑎 =
𝑥 − (𝑥̅ )
 And the value of a, is given as follows:
𝑎 = (𝑦) − 𝑎 ∗ 𝑥̅

Problems
Let us consider an example where the five week’s sales data (in Thousands) is given as shown below.
Apply linear regression technique to predict the 7th and 9th week sales.

Xi (Week) Yi (sales in Thousands)


1 1.2
2 1.8
3 2.6
4 3.2

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 18


Machine Learning (BCS602)

5 3.8
Here there are 5 items i.e i=1,2,3,4,5. The computation table is shown below. Here, there are 5
samples, so i ranges from 1 to 5.
Xi Yi Xi2 Xi * Yi
1 1.2 1 1.2
2 1.8 4 3.6
3 2.6 9 7.8
4 3.2 16 12.8
5 3.8 25 19
Sum =15 Sum =12.6 Sum =55 Sum = 44.4
2
Average of Average of (yi) Average of (xi ) Average of (xi * yi)
(xi) = 𝑥̅ = 3 = 𝑦 = 2.52 = 𝑥 = 11 𝑥𝑦 = 8.88

Solution
(8.88) − (3)(2.52)
𝑎 = = 0.66
(11) − (3)

𝑎 = (2.52) − 0.66 ∗ 3 = 0.54

The predicted 7th week sale would be (when x= 7), y = 0.54 + 0.66 *7 = 5.16 and 9th week sales
would be (when x=9), y = 0.54 + 0.66 *9 = 6.48

Linear Regression in Matrix Form


Matrix notations can be used for repressing the values of independent and dependent variables.
The equation can be written in the form of matrix as follows:
𝑦1 1 𝑥1 𝑒1
⎡𝑦2⎤ ⎡1 𝑥2⎤ ⎡𝑒2⎤
⎢ ⎥ ⎢ ⎥ 𝑎0 ⎢ ⎥
⎢𝑦3⎥ = ⎢1 𝑥3⎥ 𝑎1 + ⎢𝑒3⎥
⎢ ⋮ ⎥ ⎢⋮ ⋮⎥ ⎢⋮⎥
⎣𝑦𝑛⎦ ⎣ 1 𝑥𝑛 ⎦ ⎣𝑒𝑛⎦

This can be written as


Y= Xa + e, where X is an n*2 matrix, Y is an n*1 vector, a is a 2* 1column vector and e is an n*1
column vector.

Problem
Machine Learning (BCS602)

Find linear regression of the data of week and product sales (in Thousands) given. Use linear
regression in matrix form.
Xi (week) Yi (product Sales in Thousands)
1 1
2 3
3 4
4 8

Solution:
Here, the dependent variable X is be given as
XT = [1 2 3 4]
And the independent variable is given as follows:
YT = [1 3 4 8]
The data can be given in matrix form as follows:

1 1 1
X= 1 2 Y= 3
1 3 4
1 4 8

The regression is given as: a= ((XTX)-1 XT) Y

The computation order of this equation is shown step by step as

1 1
1 1 1 1 4 10
1. T
Computation of (X X) = * 1 2 =
1 2 3 4 1 3 10 30
1 4

4 10 1.5 −0.5
2. Computation of matrix inverse of (XTX)-1 = =
10 30 −0.5 0.2

3. Computation of ((XTX)-1 XT)

1.5 −0.5 1 1 1 1 𝟏 𝟎. 𝟓 𝟎 −𝟎. 𝟓


= * =
−0.5 0.2 1 2 3 4 −𝟎. 𝟑 −𝟎. 𝟏 𝟎. 𝟏 𝟎. 𝟑

4. Finally ((XTX)-1 XT) Y

1
1 0.5 0 −0.5 −𝟏. 𝟓 𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕
* 3 =
−0.3 −0.1 0.1 0.3 4 𝟐. 𝟐 𝒔𝒍𝒐𝒑𝒆
8

VALIDATION OF REGRESSION METHODS

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 20


Machine Learning (BCS602)

The regression model should be evaluated using some metrics for checking the correctness. The
following metrics are used to validate the results of regression.

Standard Error Estimate


Residuals or error is the difference between the actual (y) and predicted value (ý). If the residuals
have normal distribution, then the mean is zero and hence it is desirable. The standard deviation of
residuals is called residual standard error. If it is zero, then it means that the model fits the data
correctly.

Standard error estimate is another useful measure of regression. It is the standard deviation of the
observed values to the predicted values. This is given as:
∑(𝑦 − 𝑦 )
𝑆 =
𝑛−2
Here yi is the observed value and 𝑦 is the predicted value. Here, n is the number of samples.

Mean Absolute Error (MAE)


MAE is the mean of residuals. It is the difference between estimated or predicted target value and
actual target incomes. It can be mathematically defined as follows:
MAE = ∑ |𝑦 − 𝑦 |
Here 𝑦 is the estimated or predicted target output and y is the actual target output and n is the number
of samples used for regression analysis.

Mean Squared Error (MSE)


It is the sum of square of residuals. This value is always positive and closer to 0. This is given
mathematically as:
MSE = ∑ (𝑦 − 𝑦 )

Root Mean Square Error (RMSE)


The square root of the MSE is called RMSE. This is given as
1 2
RMSE = √𝑀𝑆𝐸 = ∑𝑛−1
𝑖=0 𝑦𝑖 − 𝑦𝑖
𝑛

Relative MSE
Relative MSE is the ratio of the prediction ability of the 𝑦 to the average of the trivial population.
The value of zero indicates that the model is prefect and its value ranges between 0 to 1. If the value
is more than 1, then the created model is not a good one.
2
∑𝑛−1
𝑖=0 𝑦𝑖 −𝑦𝑖
RelMSE = 2
∑𝑛−1
𝑖=0 𝑦𝑖 −𝑦𝑖

Coefficient of Variation
Coefficient of variation is unit less and is given as:

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 21


Machine Learning (BCS602)

CV =

Example:
Consider the following training set for predicting the sales of the items.

Xi (Items) Yi (Actual Sales in


Thousands)
1 80
2 90
3 100
4 110
5 120

Consider two fresh item 6 and 7, whose actual values are 80 and 90 respectively. A regression model
predicts the values of the items 6 and 7 as 75 and 82 respectively. Find MAE, MSE, RMSE,
RelMSE, CV.

Solution:

Test Item Actual Value Yi Predicted Value 𝒚𝒊


6 80 75
7 75 85

Mean Absolute Error (MAE)


MAE = ∗ |80 − 75| + |75 − 85| = 7.5

Mean Squared Error (MSE)


MSE =1/2 * |80-75| 2 + |75-85|2 = 62.5

Root Mean Square Error (RMSE)


RMSE = √62.5 = 7.91
Relative MSE
( ) ( )
RelMSE= = 0.1219
( ) ( )

Coefficient of Variation
.
CV = = 0.08

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 22


Machine Learning (BCS602)

CHAPTER 03: DECISION TREE LEARNING


 Decision tree learning model, one of the most popular supervised predictive learning models,
classifies data instances with high accuracy and consistency. The model performs an inductive
inference that reaches a general conclusion from observed examples.
 Decision tree is a concept tree which summarizes the information contained in the training dataset
in the form of a tree structure.
 Once the concept model is built test data can be easily classified.
 This model can be used to classify both categorical target variables and continuous- valued target
variables.
 Given a training dataset X, this model computes a hypothesis function f(X) as decision tree.
 Inputs to the model are data instances or objects with a set of features or attributes which can be
discrete or continuous and the output of the model is a decision tree which predicts or the target
class for the test data object.
 In statistical terms, attributes or features are called as independent variables. The target feature or
target class is called as response variable which indicates the category we need to predict a test
object. are called as
 The decision tree learning model generates a complete hypothesis space in the form of tree
structure with the given training dataset and allows to search through the possible set hypotheses
This kind of search bias is called as preference bias.

Structure of a Decision Tree

 A decision tree has a structure that consists of a root node, internal nodes/decision nodes, branches
and terminal nodes/leaf nodes.
 The topmost node in the tree is the root node. Internal nodes are the test nodes and are also called
as decision nodes. These nodes represent a choice or test of an input attribute and the outcome or
outputs of the test condition are the branches emanating from this decision node.
 The branches are labelled as per the outcomes or output values of the test condition
 Each branch represents a sub-tree or subsection of the entire tree.
 Every decision node is part of path to a leaf node. The leaf nodes represent the labels or the
outcome of a decision path.
 The decision tree model, in general, represents a collection of logical rules of classification in the
form of a tree structure.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 23


Machine Learning (BCS602)

 Decision networks, otherwise called as influence diagrams, have a directed graph structure with
nodes and links. It is an extension of Bayesian belief networks that represents information about
each node's current state, its possible actions, the possible outcome of those actions, and their
utility.
 Figure shows symbols that are used to represent different nodes in the construction of a decision
tree.

 A circle is used to represent a root node, a diamond symbol is used to represent a decision node or
the internal nodes, and all leaf nodes are represented with a rectangle.
 A decision tree consists of two major procedures discussed below.
1. Building the Tree

Goal: Construct decision tree with the given training dataset. The tree is constructed in
topdown fashion. It starts from the root node. At every level of tree construction, one need to
find the best split attributes or best decision node among all attributes. This process is
recursive and continued until we end up in the last level of the tree or finding a leaf node
which cannot be split further.

Output: Decision tree representing the complete hypothesis space.

2. Knowledge Inference or Classification

Goal: Given a test instance, infer to the target class it belongs to.

Classification: Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start
traversing the tree from the root. We traverse as we evaluate the test condition on every
decision node with the test object attribute value and walk to the branch corresponding to the
test's outcome. This process is repeated until we end up in a leaf node which contains the
target class of the test object.

Output: Target label of the test instance.


Machine Learning (BCS602)

Advantages and Disadvantages of Decision Trees


Advantages

1. Easy to model and interpret

2. Simple to understand

3. The input and output attributes can be discrete or continuous predictor variables.

4. Can model a high degree of nonlinearity in the relationship between the target variables and
the predictor variables

5. Quick to train

Disadvantages

1. It is difficult to determine how deeply a decision tree can be grown or when to stop growing
it.

2. If training data has errors or missing attribute values, then the decision tree constructed may
become unstable or biased.

3. If the training data has continuous valued attributes, handling it is computationally complex
and has to be discretized.

4. A complex decision tree may also be over-fitting with the training data.

5. Decision tree learning is not well suited for classifying multiple output classes.

6. Learning an optimal decision tree is also known to be NP-complete.

Fundamentals of Entropy
 Given the training dataset with a set of attributes or features, the decision tree is constructed by
finding the attribute or feature that best describes the target class for the given test instances.
 The best split feature is the one which contains more information about how to split the dataset
among all features so that the target class is accurately identified for the test instances.
 In other words, the best split attribute is more informative to split the dataset into sub datasets and
this process is continued until the stopping criterion is reached. This splitting should be pure at
every stage of selecting the best feature.
 The best feature is selected based on the amount of information among the features which are
basically calculated on probabilities.
 Quantifying information is closely related to information theory. In the field of information
theory, the features are quantified by a measure called Shannon Entropy which is calculated
based on the probability distribution of the events.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 25


Machine Learning (BCS602)

 Entropy is the amount of uncertainty or randomness in the outcome of a random variable or


an event. Moreover, entropy describes about the homogeneity of the data instances.
 The best feature is selected based on the entropy value.
 Higher the entropy → Higher the uncertainty
 Lower the entropy Lower the uncertainty.
 Similarly, if all instances are belong to the same class (only positive) or (only negative) then the
entropy is 0.
 On the other hand, if the instances are equally distributed, which means 50% positive and 50%
negative, then the entropy is 1.
 If there are 10 data instances, out of which some belong positive class and some belong to
negative class, then the entropy is calculated as
 If the entropy is 0, then the split is pure which means that all samples in the set will partition into
one class or category. But if the entropy is 1, the split is impure and the distribution of the samples
is more random.
 The stopping criterion is based on the entropy value
 Let P be the probability distribution of data instances from 1 to n as shown.
 So, P = P1,....... Pn
 Entropy of P is the information measure of this probability distribution given in
Entropy_Info(P) = Entropy_Info(P1,.... Pn)
= - (P1 log2 P1 + P2 log2 P₂ +…….. + Pn log2 Pn)
where, P, is the probability of data instances classified as class 1 and P₂ is the probability of
data instances classified as class 2 and so on.

 P1 = |No of data instances belonging to class 1| / |Total no of data instances in the training
dataset|

Algorithm: General Algorithm for Decision Trees

1. Find the best attribute from the training dataset using an attribute selection measure and place
it at the root of the tree.
2. Split the training dataset into subsets based on the outcomes of the test attribute and each
subset in a branch contains the data instance or tuples with the same value for the selected test
attribute.
3. Repeat step 1 and step 2 on each until we end up in leaf nodes in all the branches of the tree.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 26


Machine Learning (BCS602)

4. This splitting process is recursive until the stopping criterion is reached.

Stopping Criteria

The following are some of the common stopping conditions:


1. The data instances are homogenous which means all belong to the same class Ci and hence its
entropy is 0.
2. A node with some defined minimum number of data instance becomes a leaf.
3. The maximum tree depth is reached, so further splitting is not done and the node becomes a
leaf node.

DECISION TREE INDUCTION ALGORITHMS


 There are many decision tree algorithms, such as ID3, C4.5, CART, CHAID, QUEST, GUIDE,
CRUISE, and CTREE, that are used for classification in real-time environment.
 The most commonly used decision tree algorithms are ID3 (Iterative Dichotomizer 3), developed
by J.R Quinlan in 1986, and C4.5 is an advancement of ID3 presented by the same author in 1993.
CART, that stands for Classification and Regression Trees, is another algorithm which was
developed by Breiman et al. in 1984.
 The accuracy of the tree constructed depends upon the selection of the best split attribute.
 Different algorithms are used for building decision trees which use different measures to decide
on the splitting criterion.
 Algorithms such as ID3, C4.5 and CART are popular algorithms used in the construction of
decision trees.
 The algorithm ID3 uses 'Information Gain' as the splitting criterion whereas the algorithm C4.5
uses 'Gain Ratio' as the splitting criterion. The CART algorithm is popularly used for classifying
both categorical and continuous-valued target variables. CART uses GINI Index to construct a
decision tree.
 Decision trees constructed using ID3 and C4.5 are also called as univariate decision trees which
consider only one feature/attribute to split at each decision node whereas decision trees
constructed using CART algorithm are multivariate decision trees which consider a conjunction of
univariate splits.

ID3 Tree Construction


 ID3 is a supervised learning algorithm which uses a training dataset with labels and constructs a
decision tree.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 27


Machine Learning (BCS602)

 ID3 is an example of univariate decision trees as it considers only one feature at Bach decision
node.
 This leads to axis-aligned splits.
 The tree is then used to classify the future test instances. It constructs the tree using a greedy
approach in a top-down fashion by identifying the best attribute at each level of the tree.
 ID3 works well if the attributes or features are considered as discrete/categorical value.
 The algorithm builds the tree using a purity measure called 'Information Gain' with the give
training data instances and then uses the constructed tree to classify the test data.
 ID3 works well for a large dataset. If the dataset is small, over-fitting may occur Moreover, it is
not accurate if the dataset has missing attribute values.
 No pruning is done during or after construction of the tree and it is prone to outliers.

Algorithm: Procedure to Construct a Decision Tree using ID3


1. Compute Entropy_Info for the whole training dataset based on the target attribute.
Entropy_Info(T) = - ∑𝒎
𝒊 𝟏 𝑷𝒊 𝒍𝒐𝒈𝟐 𝑷𝒊

2. Compute Entropy_Info and Information_Gain for each of the attribute in the training dataset.
|𝑨𝒊 |
Entropy_Info(T, A) = ∑𝒗𝒊 𝟏 ∗ 𝑬𝒏𝒕𝒓𝒐𝒑𝒚_𝑰𝒏𝒇𝒐(𝑨𝒊 )
|𝑻|

Information_Gain(A) = Entropy_Info(T) - Entropy_Info(T, A)

3. Choose the attribute for which entropy is minimum and therefore the gain is maximum as the
best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test condition
of the root node attribute. Accordingly, the training dataset is also split into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset.

Note: We stop branching a node if entropy is 0. The best split attribute at every iteration is the
attribute with the highest information gain.

Information_Gain is a metric that measures how much information is gained by branching on an


attribute A. In other words, it measures the reduction in impurity in an arbitrary subset of data.

Problems

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 28


Machine Learning (BCS602)

Sl. No CGPA Interactiveness Practical Knowledge Communication Job Offer


Skill
1 ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
4 <8 No Average Good No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes

Step 1: Calculate the Entropy for the target class Job offer
Entropy_Info(Job Offer) = Entropy_Info(7,3) = −[ 𝑙𝑜𝑔 + 𝑙𝑜𝑔 ] = 0.8807

Iteration 1:

Step 2: Calculate the Entropy_Info and Infrmation_Gain for each of the attribute in the training dataset

Entropy of CGPA

CGPA Job Offer = Yes Job Offer = No Total


≥9 3 1 4
≥8 4 0 4
<8 0 2 2

Entropy_Info(Job Offer, CGPA) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 −

𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


= 4/10 * 0.8 + 0 + 0  0.3243
Gain(CGPA) = 0.8807 – 0.3243  0.5564

Entropy of Interactiveness

Interactivness Job Offer = Yes Job Offer = No Total

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 29


Machine Learning (BCS602)

Yes 5 1 6
No 2 2 4

Entropy_Info(Job Offer, Interactiveness) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 −

𝑙𝑜𝑔 ]
= 6/10 * 0.6497+ 4/10 * 1  0.7896
Gain(Interactiveness) = 0.8807 – 0.7896  0.0911

Entropy of Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No Total


Very Good 2 0 2
Average 1 2 3
Good 4 1 5

Entropy_Info(Job Offer, Practical Knowledge) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *

[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


= 2/10 * 0 + 3/10 * 0.9177 + 5/10 * 0.7215  0.6361
Gain(Practical Knowledge) = 0.8807 – 0.6361  0.2446

Entropy of Communication Skill

Communication Skill Job Offer = Yes Job Offer = No Total


Good 4 1 5
Moderate 3 0 3
Poor 0 2 2

Entropy_Info(Job Offer, Communication Skill) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *

[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


= 5/10 * 0.9177 + 3/10 * 0+ 2/10 * 0  0.36096
Gain(Communication Skill) = 0.8807 – 0.36096  0.5203

Gain Table

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 30


Machine Learning (BCS602)

Attribute Gain
CGPA 0.5564
Interactiveness 0.0911
Practical Knowledge 0.2446
Communication Skill 0.5203

Step 3: Since the gain of CGPA is maximum we choose this attribute as the best split and consider as
Root node. The resulting tree is as shown below

Iteration 2:

In this iteration the same process of computing the Entopry_Info and Gain are repeated with subset
of the Training set.
The subset consists of 4 data instance as shown below
Interactiveness Practical Knowledge Communication Skill Job Offer
Yes Very Good Good Yes
No Average Poor No
Yes Good Moderate Yes
No Very Good Good Yes

Entropy_Info(Job Offer) = Entropy_Info(3,1) = −[ 𝑙𝑜𝑔 + 𝑙𝑜𝑔 ] = 0.8108


Machine Learning (BCS602)

Entropy of Interactiveness

Interactivness Job Offer = Yes Job Offer = No Total


Yes 2 0 2
No 1 1 2

Entropy_Info(Job Offer, Interactiveness) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 −

𝑙𝑜𝑔 ]
= 0 + 0.4997  0.4997
Gain(Interactiveness) = 0.8108 – 0.4997  0.3111

Entropy of Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No Total


Very Good 2 0 2
Average 0 1 1
Good 1 0 1

Entropy_Info(Job Offer, Practical Knowledge) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *

[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


=0+0+00
Gain(Practical Knowledge) = 0.8108 – 0  0.8108

Entropy of Communication Skill

Communication Skill Job Offer = Yes Job Offer = No Total


Good 2 0 2
Moderate 1 0 1
Poor 0 1 1

Entropy_Info(Job Offer, Communication Skill) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *

[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


= 0 + 0 + 0 0
Gain(Communication Skill) = 0.8108 – 0  0.8108

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 32


Machine Learning (BCS602)

Gain Table

Attribute Gain
Interactiveness 0.3111
Practical Knowledge 0.8108
Communication Skill 0.8108

Here both the attributes Practical Knowledge and Communication Skills have the same Gain we can
either construct the decision tree using Practical Knowledge or Communication Skills
The final Decision Tree is as shown in below

C4.5 CONSTRUCTION
 C4.5 is an improvement over ID3. C4.5 works with continuous and discrete attributes and
missing values, and it also supports post-pruning.
 C5.0 is the successor of C4.5 and is more efficient and used for building smaller decision trees.
 C4.5 works with missing values by marking as '?', but these missing attribute values are not
considered in the calculations.
 The algorithm C4.5 is based on Occam's Razor which says that given two correct solution the
simpler solution has to be chosen.
 Moreover, the algorithm requires a larger training set for better accuracy.
 It uses Gain Ratio as a measure during the construction of decision trees.
Machine Learning (BCS602)

 ID3 is more biased towards attributes with larger values. To overcome this bias issue, C4.5 uses
a purity measure Gain ratio to identify the best split attribute.
 In C4 5 algorithm the Information Gain measure used in ID3 algorithm is normalized by
computing factor called Split Info.
 This normalized information gain of an attribute called as Gain Ratio is computed by the ratio
of the calculated Split_Info and Information Gain of each attribute.
 Then the attribute with the highest normalized information gain, that is, highest gain ratio is
used the splitting criteria.
 Given a Training dataset T,
The Split Info of an attribute A is computed as
| | | |
Split_Info(T, A) = − ∑ ∗ 𝑙𝑜𝑔
| | | |

where, the attribute A has got 'v' distinct values {a1, a2,….,av}, and |Ai| is the number of
instances
for distinct value 'f in attribute A.
The Gain Ratio of an attribute A is computed as
Gain Ratio(A)= Info_Gain(A) / Split_Info(T, A)

Algorithm: Procedure to Construct a Decision Tree using C4.5


1. Compute Entropy_Info for the whole training dataset based on the target attribute
2. Compute Entropy_Info, Info_Gain, Split_Info and Gain_Ratio for each of the attribute in the
training dataset.
3. Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test condition of
the root node attribute. Accordingly, the training dataset is also split into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset.

Problems

Sl. No CGPA Interactiveness Practical Knowledge Communication Job Offer


Skill
1 ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 34


Machine Learning (BCS602)

3 ≥9 No Average Poor No
4 <8 No Average Good No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes

Step 1: Calculate the Entropy for the target class Job offer

Entropy_Info(Job Offer) = Entropy_Info(7,3) = −[ 𝑙𝑜𝑔 + 𝑙𝑜𝑔 ] = 0.8807

Iteration 1:
Step 2: Calculate the Entropy_Info and Infrmation_Gain for each of the attribute in the training dataset

Entropy of CGPA
CGPA Job Offer = Yes Job Offer = No Total
≥9 3 1 4
≥8 4 0 4
<8 0 2 2

Entropy_Info(Job Offer, CGPA) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 −

𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


= 4/10 * 0.8 + 0 + 0  0.3243
Gain(CGPA) = 0.8807 – 0.3243  0.5564

Split_Info(T, CGPA) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 − 𝑙𝑜𝑔


= 0.5285 +0.5285 + 0.4641  1.5211

Gain_Ratio(CGPA) = (Gain(CGPA))/(Split_Info(T, CGPA))


= 0.5564/1.5211 0.3658

Entropy of Interactiveness

Interactivness Job Offer = Yes Job Offer = No Total

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 35


Machine Learning (BCS602)

Yes 5 1 6
No 2 2 4

Entropy_Info(Job Offer, Interactiveness) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 −

𝑙𝑜𝑔 ]
= 6/10 * 0.6497+ 4/10 * 1  0.7896
Gain(Interactiveness) = 0.8807 – 0.7896  0.0911

Split_Info(T, Interactiveness) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔


 1.5211

Gain_Ratio(Interactiveness) = (Gain(Interactiveness))/(Split_Info(T, Interactiveness))


= 0.0911/0.9704 0.0939

Entropy of Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No Total


Very Good 2 0 2
Average 1 2 3
Good 4 1 5

Entropy_Info(Job Offer, Practical Knowledge) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *

[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


= 2/10 * 0 + 3/10 * 0.9177 + 5/10 * 0.7215  0.6361
Gain(Practical Knowledge) = 0.8807 – 0.6361  0.2446

Split_Info(T, Practical Knowledge) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 − 𝑙𝑜𝑔


 1.4853

Gain_Ratio(Practical Knowledge ) = (Gain(Practical Knowledge))/(Split_Info(T, Practical


Knowledge))
= 0.2446/1.4853 0.1648

Entropy of Communication Skill

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 36


Machine Learning (BCS602)

Communication Skill Job Offer = Yes Job Offer = No Total


Good 4 1 5
Moderate 3 0 3
Poor 0 2 2

Entropy_Info(Job Offer, Communication Skill) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *

[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


= 5/10 * 0.9177 + 3/10 * 0+ 2/10 * 0  0.36096
Gain(CGPA) = 0.8807 – 0.36096  0.5203

Split_Info(T, Communication Skill) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 − 𝑙𝑜𝑔


 1.4853

Gain_Ratio(Communication Skill) = (Gain(Communication Skill))/(Split_Info(T, Communication


Skill))
= 0.5202/1.4853 0.3502

Gain Table
Attribute Gain_Ratio
CGPA 0.3658
Interactiveness 0.0939
Practical Knowledge 0.1648
Communication Skill 0.3502

Step 3: Since the gain_ratio of CGPA is maximum we choose this attribute as the best split and
consider as Root node. The resulting tree is as shown below

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 37


Machine Learning (BCS602)

Iteration 2:

In this iteration the same process of computing the Entopry_Info and Gain_info, Split_Info and
Gain_Ratio are repeated with subset of the Training set.
The subset consists of 4 data instance as shown below
Interactiveness Practical Knowledge Communication Skill Job Offer
Yes Very Good Good Yes
No Average Poor No
Yes Good Moderate Yes
No Very Good Good Yes

Entropy_Info(Job Offer) = Entropy_Info(3,1) = −[ 𝑙𝑜𝑔 + 𝑙𝑜𝑔 ] = 0.8108


Entropy of Interactiveness

Interactivness Job Offer = Yes Job Offer = No Total


Yes 2 0 2
No 1 1 2

Entropy_Info(Job Offer, Interactiveness) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 −

𝑙𝑜𝑔 ]
= 0 + 0.4997  0.4997
Gain(Interactiveness) = 0.8108 – 0.4997  0.3111
Machine Learning (BCS602)

Split_Info(T, Interactiveness) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔


1
Gain_Ratio(Interactiveness) = (Gain(Interactiveness))/(Split_Info(T, Interactiveness))
= 0.3111/1 0.3111

Entropy of Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No Total


Very Good 2 0 2
Average 0 1 1
Good 1 0 1

Entropy_Info(Job Offer, Practical Knowledge) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *

[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


=0+0+00
Gain(Practical Knowledge) = 0.8108 – 0  0.8108

Split_Info(T, Practical Knowledge) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 − − 𝑙𝑜𝑔


 1.5
Gain_Ratio(Practical Knowledge) = (Gain(Practical Knowledge))/(Split_Info(T, Practical
Knowledge))
= 0.8108/1.5 0.5408

Entropy of Communication Skill

Communication Skill Job Offer = Yes Job Offer = No Total


Good 2 0 2
Moderate 1 0 1
Poor 0 1 1

Entropy_Info(Job Offer, Communication Skill) = * [− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *

[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ] + *[− 𝑙𝑜𝑔 − 𝑙𝑜𝑔 ]


= 0 + 0 + 0 0
Gain(Communication Skill) = 0.8108 – 0  0.8108

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 39


Machine Learning (BCS602)

Split_Info(T, Communication Skill) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 − 𝑙𝑜𝑔


 1.5

Gain_Ratio(Communication Skill) = (Gain(Communication Skill))/(Split_Info(T, Communication


Skill))
= 0.8108/1.5 0.5408

Gain Table
Attribute Gain_Ratio
Interactiveness 0.3111
Practical Knowledge 0.5408
Communication Skill 0.5408

Here both the attributes Practical Knowledge and Communication Skills have the same Gain_ratio
we can either construct the decision tree using Practical Knowledge or Communication Skills

The final Decision Tree is as shown in below

Dealing with continuous Attributes in c4.5


Machine Learning (BCS602)

 The C4.5 algorithm is further improved by considering attributes which are continuous and a
continuous attribute is discretized by finding a split point or threshold.
 When an attribute ‘A’ has numerical values which are continuous a threshold or best split point
‘s’ is found such that the set of values is categorized into two sets such as A < s and A ≥ s.
 The best split point is attribute value which has maximum information gain for that attribute.

Example:

Sl. No CGPA Job Offer


1 9.5 Yes
2 8.2 Yes
3 9.1 No
4 6.8 No
5 8.5 Yes
6 9.5 Yes
7 7.9 No
8 9.1 Yes
9 8.8 Yes
10 8.8 Yes

Solution:
First sort the values in an ascending order by removing the duplicates and consider only the unique
values of the attribute.
6.8 7.9 8.2 8.5 8.8 8.8 9.1 9.5

Compute the gain for distinct values of this continuous attributes

6.8 7.9 8.2 8.5 8.8 9.1 9.5

Range ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ >

Yes 0 7 0 7 1 6 2 5 4 3 5 2 7 0

No 1 2 2 1 2 1 2 1 2 1 3 0 3 0

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 41


Machine Learning (BCS602)

Entropy(0,1) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 0

Entropy(7,2) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔  0.763

Entropy_Info(T, CGPA ∈ 6.8) = 1/10 *Entropy (0,1) + 9/10 *Entropy (7,2)

= 1/10 *0 + 9/10 *0.763  0.6873

Gain(CGPA ∈ 6.8) = 0.8808 – 0.6873  0.1935

Entropy(0,2) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 0

Entropy(7,1) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔  0.543

Entropy_Info(T, CGPA ∈ 7.9) = 2/10 *Entropy (0,2) + 8/10 *Entropy (7,1)

= 2/10 *0 + 8/10 *0.543  0.434

Gain(CGPA ∈ 7.9) = 0.8808 – 0.4346  0.4462

Entropy(1,2) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔  0.917

Entropy(6,1) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔  0.591

Entropy_Info(T, CGPA ∈ 8.2) = 3/10 *Entropy (1,2) + 7/10 *Entropy (6,1)

= 3/10 *0.9177 + 7/10 *0.5913  0.6892

Gain(CGPA ∈ 8.2) = 0.8808 – 0.6892  0.1916

Entropy(2,2) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 1

Entropy(5,1) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔  0.649

Entropy_Info(T, CGPA ∈ 8.5) = 4/10 *Entropy (2,2) + 6/10 *Entropy (5,1)

= 4/10 *1 + 6/10 *0.649  0.7898

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 42


Machine Learning (BCS602)

Gain(CGPA ∈ 8.5) = 0.8808 – 0.7898  0.091

Entropy(4,2) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔  0.917

Entropy(3,1) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔  0.810

Entropy_Info(T, CGPA ∈ 8.8) = 6/10 *Entropy (4,2) + 4/10 *Entropy (3,1)

= 6/10 *0.9177 + 4/10 *0.8108  0.8749

Gain(CGPA ∈ 8.8) = 0.8808 – 0.8749 0.0059

Entropy(5,3) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔  0.953

Entropy(2,0) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 0

Entropy_Info(T, CGPA ∈ 9.1) = 8/10 *Entropy (5,3) + 2/10 *Entropy (2,0)

= 8/10 *0.953 + 2/10 *0  0.7630

Gain(CGPA ∈ 9.1) = 0.8808 – 0.7630 0.1178

Entropy(7,3) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔  0.880

Entropy_Info(T, CGPA ∈ 9.1) = 10/10 *Entropy (7,3) + 0/10 *Entropy (0,0)

= 10/10 *0.880 + 0/10 *0  0.880

Gain(CGPA ∈ 9.1) = 0.8808 – 0.880 0

6.8 7.9 8.2 8.5 8.8 9.1 9.5

Range ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ >

Yes 0 7 0 7 1 6 2 5 4 3 5 2 7 0

No 1 2 2 1 2 1 2 1 2 1 3 0 3 0

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 43


Machine Learning (BCS602)

Entropy 0 0.763 0 0.543 0.917 0.591 1 0.649 0.917 0.810 0.953 0 0.880 0

Entropy
0.687 0.434 0.689 0.789 0.874 0.763 0.880
_Info

Gain 0.1935 0.4462 0.1916 0.091 0.0059 0.1178 0

Since CGPA with 7.9 has the maximum gain as 0.4462. Hence , CGPA ∈ 7.9 is chosen as the split
point.

We can discretise the continuous values of CGPA as 2 categories with CGPA ≤ 7.9 and CGPA > 7.9.
The resulting discretized instances are shown in below table.

Sl. No CGPA CGPA Job Offer


(Continuous) (Discretized)

1 9.5 > 7.9 Yes

2 8.2 > 7.9 Yes

3 9.1 > 7.9 No

4 6.8 ≤ 7.9 No

5 8.5 > 7.9 Yes

6 9.5 > 7.9 Yes

7 7.9 ≤ 7.9 No

8 9.1 > 7.9 Yes

9 8.8 > 7.9 Yes

10 8.8 > 7.9 Yes

Classification and Regression Trees Construction


 The classification and Regression Trees (CART) algorithm is a multivariate decision tree learning
used for classifying categorical and continuous- valued target variables.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 44


Machine Learning (BCS602)

 CART algorithm is an example of multivariate decision tree that gives oblique splits.
 It solves both classification and regression problems. If the target feature is categorical, it
constructs a classification tree and if the target feature is continuous, it constructs a regression
tree.
 CART uses GINI Index to construct a decision tree.
 GINI Index is defined as the number of data instances for a class or it is the proportion of
instances.
 It constructs the tree as a binary tree by recursively splitting a node into two nodes. Therefore,
even if an attribute has more than two possible values, GINI Index is calculated for all subsets of
the attributes and the subset which has maximum value is selected as the best split subset.
 For example, if an attribute A has three distinct values say {a1, a2, a3} the possible subsets are { },
{a1}, {a2}, {a3}, {a1, a2}, {a1, a3}, {a2, a3}, {a1, a2, a3}. So, if an attribute has 3 distinct values, the
number of possible subsets is 23, which means 8.
 Excluding the empty set { } and the full set {a1, a2, a3}, we have 6 subsets. With 6 subsets, we can
form three possible combinations such as:
o {a1} with {a2, a3}
o {a2} with {a1, a3}
o {a3} with {a1, a2}
 Hence, in this CART algorithm, we need to compute the best splitting attribute and the best split
subset i in the chosen attribute.
 Higher the GINI value, higher is the homogeneity of the data instances.
 Gini_Index(T) is computed as
Gini_Index(T) = 1 - ∑ 𝑃

where, Pi is the probability that a data instance or a tuple 'd' belongs to class C.

| . |
Pi = | |

 GINI Index assumes a binary split on each attribute, therefore, every attribute is considered as a
binary attribute which splits the data instances into two subsets S1, and S2

| | | |
Gini_Index (T, A) =
| |
𝐺𝑖𝑛𝑖 (𝑆 ) + | |
𝐺𝑖𝑛𝑖 (𝑆 )

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 45


Machine Learning (BCS602)

 The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an
attribute. The best splitting attribute is chosen by the minimum Gini_Index which is otherwise
maximum ∆Gini because it reduces the impurity.

∆Gini is computed as ∆Gini (A) = Gini(T) - Gini(T, A)

Algorithm: Procedure to Construct a Decision Tree using CART


1. Compute Gini_Index for the whole training dataset based on the target attribute.

2. Compute Gini_Index for each of the attribute and for the subsets of each attribute in the training
dataset.

3. Choose the best splitting subset which has minimum Gini_Index for an attribute.

4. Compute ∆Gini for the best splitting subset of that attribute.

5. Choose the best splitting attribute that has maximum ∆Gini.

6. The best split attribute with the best split subset is placed as the root node.

7. The root node is branched into two subtrees with each subtree an outcome of the test condition of
the root node attribute. Accordingly, the training dataset is also split into two subsets.

8. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset.

Problem 1:
Construct a decision tree using CART algorithm for the following dataset

Sl. No CGPA Interactiveness Practical Knowledge Communication Job Offer


Skill
1 ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
4 <8 No Average Good No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 46


Machine Learning (BCS602)

10 ≥8 Yes Average Good Yes

Solution
Step 1: Calculate the Gini_Index for the dataset, which consists of 10 data instances. The target
attribute 'Job Offer has 7 instances as Yes and 3 instances as No.

Gini_Index(T) = 1 - (7/10)2 - (3/10)2

= 1- 0.49 - 0.09 =1-0.58

Gini_Index(T) = 0.42

Attribute- CGPA

CGPA Job Offer = Yes Job Offer = No


≥9 3 1
≥8 4 0
<8 0 2

Gini Index(T, CGPA ∈ (≥ 9, ≥ 8)) =1 - (7/8)2 - (1/8)2 = 1-0.7806  0.2194

Gini_Index(T, CGPA ∈ <8) = 1- (0/2)²- (2/2)2 = 1 – 1  0

Gini_Index(T, CGPA ∈ (≥ 9, ≥ 8), <8) = (8/10) × 0.2194 + (2/10)×0  0.17552

Gini_Index(T, CGPA ∈ (≥ 9, <8)) = 1 - (3/6)2 - (3/6)2 = 1 - 0.5  0.5

Gini_Index(T, CGPA ∈ (≥ 8)) = 1 - (4/4)2 - (0/4)2 = 1- 1  0

Gini_Index(T, CGPA ∈ (≥ 9, <8), ≥ 8) = (6/10) × 0.5+ (4/10) × 0  0.3

Gini Index(T, CGPA∈ (≥ 8, <8)) = 1 - (4/6)2 - (2/6)2  1- 0.555  0.445

Gini_Index(T, CGPA ∈ (≥ 9) ) = 1 - (3/4)2- (1/4)2 = 1-0.625  0.375

Gini_Index(T, CGPA ∈ ((≥ 8, <8), ≥ 9)) = (6/10) x 0.445 + (4/10) × 0.375 0.417

Gini_Index of CGPA

Subset Gini_Index
(≥ 9, ≥ 8) <8 0.1755
(≥ 9, < 8) ≥8 0.3
(≥ 8, < 8) ≥9 0.417

∆Gini(CGPA) = Gini(T) - Gini(T, CGPA)


= 0.42 - 0.1755  0.2445

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 47


Machine Learning (BCS602)

Attribute- Interactiveness

Interactiveness Job Offer = Yes Job Offer = No


Yes 5 1
No 2 2

Gini_Index(T, Interactiveness ∈ (Yes)) = 1– (5/6)2 – (1/6)2 = 1 - 0.72  0.28

Gini_Index(T, Interactiveness ∈ {No}) = 1– (2/4)2 – (2/4)2 = 1– 0.5  0.5

Gini_Index(T, Interactiveness ∈ {Yes, No}) = (6/10) *(0.28) + (4/10) * 0.5 = 0.168 + 0.2  0.368

Gini_Index of CGPA

Subset Gini_Index
Yes No 0.368
Compute ∆Gini or the best splitting subset of that attribute

∆Gini(Interactiveness) = Gini(T) - Gini(T, Interactiveness)

= 0.42 – 0.368  0.052

Attribute – Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No

Very Good 2 0
Good 4 1
Average 1 2

Gini_Index(T, Practical Knowledge ∈ {Very Good, Good}) = 1 - (6/7)2 - (1/7)2 = 1-0.7544 


0.2456

Gini_Index(T, Practical Knowledge ∈ (Average)) = 1 – (1/3)2 – (2/3)2 = 1 - 0.555  0.445

Gini Index(T, Practical Knowledge ∈ {Very Good, Good}, Average) = (7/10)2 * 0.245 + (3/10)2
*0.445

 0.3054

Gini_Index(T, Practical Knowledge ∈ {Very Good, Average}) = 1- (3/5)2 – (2/5)2 = 1-0.52 0.48

Gini_Index(T, Practical Knowledge ∈ (Good)) = 1 – (4/5)2 – (1/5)2 =1-0.68 0.32

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 48


Machine Learning (BCS602)

Gini_Index(T, Practical Knowledge ∈ (Very Good, Average), Good) = (5/10) * 0.48 + (5/10) * 0.32 
0.40

Gini_Index(T, Practical Knowledge ∈ (Very Good, Average)) = 1- (5/8)2 –(3/8)2 = 1 - 0.5312 


0.4688

Gini_Index(T, Practical Knowledge ∈ (Very Good)) = 1 – (2/2)2 –(0/2)2 = 1- 1  0

Gini_Index(T, Practical Knowledge ∈ (Good, Average), Very Good) = (8/10)*0.4688 + (2/10)×0 


0.3750

Gini_Index for Practical Knowledge

Subset Gini_Index
{Very Good, Good} Average 0.3054
{Very Good, Average} Good 0.40
{Good, Average} Very Good 0.3750

∆Gini (Practical Knowledge) = Gini(T) - Gini(T, Practical Knowledge)

= 0.42 - 0.3054  0.1146

Attribute – Communication Skill

Communication Skill Job Offer = Yes Job Offer = No

Good 4 1
Moderate 3 0
Poor 0 2

Gini_Index(T, Communication Skills ∈ {Good, Moderate}) = 1 – (7/8)2 - (1/8)2 = 1 -0.7806 


0.2194

Gini_Index(T, Communication Skills ∈ {Poor}) = 1 - (2/2)2 - (2/2)2 = 1 – 1  0

Gini_Index(T, Communication Skills ∈ {Good, Moderate}, Poor) = (8/10) * 0.2194 + (2/10) * 0 


0.1755

Gini Index(T, Communication Skills ∈ {Good, Poor}) = 1 - (4/7)2 - (3/7)2 = 1 – 0.5101  0.489

Gini_Index(T, Communication Skills ∈ {Moderate}) = 1 – (3/3)2 – (0/3)2 = 1-1  0

Gini_Index(T,Communication Skills ∈ {Good, Poor}, Moderate) = (7/10) * 0.489 + (3/10) * 0 


0.342

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 49


Machine Learning (BCS602)

Gini_Index(T, Communication Skills ∈ {Moderate, Poor}) = 1 - (3/5)2 - (2/5)2 = 1 – 0.52 0.48

Gini_Index(T, Communication Skills ∈ {Good}) = 1 - (4/5)2 - (1/5)2 =1 – 0.68  0.32

Gini_Index(T, Communication Skills ∈ {Moderate, Poor}, Good) = (5/10) * 0.48 + (5/10) * 0.32
0.40

Gini_Index for Commuincation Skill

Subset Gini_Index
{ Good, Moderate} Poor 0.1755
{Good, Poor} Moderate 0.3429
{Moderate, Poor} Good 0.40

∆Gini (Communication Skill) = Gini(T) - Gini(T, Communication Skill)

= 0.42 - 0.1755  0.24445

Gini_Index and ∆Gini values of all the attributes

Attribute Gini_index ∆Gini


CGPA 0.1755 0.2445
Interactiveness 0.368 0.052
Practical Knowledge 0.3054 0.1146
Communication Skills 0.1755 0.2445

Since both CGPA and Communication Skills has the same ∆Gini values which is highest among
other attributes we choose either CGPA/ Communication Skills as the best split attribute. Here we
have chosen CGPA

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 50


Machine Learning (BCS602)

Table: Subset of the Training Dataset after Iteration 1

Sl. No CGPA Interactiveness Practical Knowledge Communication Job Offer


Skill
1 ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes

Gini_Index(T) = 1 - (7/8)2 - (1/8)2 = 1- 0.766 - 0.0156  0.2184

Attribute- Interactiveness

Interactiveness Job Offer = Yes Job Offer = No


Yes 5 0
No 2 1

Gini_Index(T, Interactiveness ∈ (Yes)) = 1– (5/5)2 – (0/5)2 = 1 - 1  0

Gini_Index(T, Interactiveness ∈ {No}) = 1– (2/3)2 – (1/3)2 = 1– 0.44 – 0.111  0.449

Gini_Index(T, Interactiveness ∈ {Yes, No}) = (7/8) *(0) + (1/8) * 0.449  0.056


Machine Learning (BCS602)

Gini_Index of CGPA

Subset Gini_Index
Yes No 0.056
Compute ∆Gini or the best splitting subset of that attribute

∆Gini(Interactiveness) = Gini(T) - Gini(T, Interactiveness)

= 0.2184 – 0.056  0.1624

Attribute – Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No

Very Good 2 0
Good 4 0
Average 1 1

Gini_Index(T, Practical Knowledge ∈ {Very Good, Good}) = 1 - (6/6)2 - (0/6)2 = 1- 1  0

Gini_Index(T, Practical Knowledge ∈ (Average)) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25  0.5

Gini Index(T, Practical Knowledge ∈ {Very Good, Good}, Average) = (6/8)2 *0+(2/8)2*0.5  0.125

Gini_Index(T, Practical Knowledge ∈ {Very Good, Average}) = 1- (3/4)2 – (1/4)2 0.375

Gini_Index(T, Practical Knowledge ∈ (Good)) = 1 – (4/4)2 – (0/4)2 =1- 1 0

Gini_Index(T, Practical Knowledge ∈ (Very Good, Average), Good) = (4/8) * 0.375 + (4/8) * 0  0.187

Gini_Index(T, Practical Knowledge ∈ (Very Good, Average)) = 1- (5/6)2 –(1/6)2  0.278

Gini_Index(T, Practical Knowledge ∈ (Very Good)) = 1 – (2/2)2 –(0/2)2 = 1- 1  0

Gini_Index(T, Practical Knowledge ∈ (Good, Average), Very Good) = (6/8)*0.278 + (2/8)×0  0.2085

Gini_Index for Practical Knowledge

Subset Gini_Index
{Very Good, Good} Average 0.125
{Very Good, Average} Good 0.1875
{Good, Average} Very Good 0.2085

∆Gini (Practical Knowledge) = Gini(T) - Gini(T, Practical Knowledge)

= 0.2184 – 0.125  0.0934

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 52


Machine Learning (BCS602)

Attribute – Communication Skill

Communication Skill Job Offer = Yes Job Offer = No

Good 4 0
Moderate 3 0
Poor 0 1

Gini_Index(T, Communication Skills ∈ {Good, Moderate}) = 1 – (7/7)2 - (0/7)2 = 1 - 1 0

Gini_Index(T, Communication Skills ∈ {Poor}) = 1 - (0/1)2 - (1/1)2 = 1 – 1  0

Gini_Index(T, Communication Skills ∈ {Good, Moderate}, Poor) = (7/8) * 0 + (1/8) * 0  0

Gini Index(T, Communication Skills ∈ {Good, Poor}) = 1 - (4/5)2 - (1/5)2 = 1 – 0.64 – 0.04 0.32

Gini_Index(T, Communication Skills ∈ {Moderate}) = 1 – (3/3)2 – (0/3)2 = 1-1  0

Gini_Index(T,Communication Skills ∈ {Good, Poor}, Moderate) = (5/8) * 0.32 + (3/8) * 0  0.2

Gini_Index(T, Communication Skills ∈ {Moderate, Poor}) = 1 - (3/4)2 - (1/4)2 0.375

Gini_Index(T, Communication Skills ∈ {Good}) = 1 - (4/4)2 - (0/4)2 =1 – 1 0

Gini_Index(T, Communication Skills ∈ {Moderate, Poor}, Good) = (4/8) * 0.375 + (4/8) * 0


0.1875

Gini_Index for Commuincation Skill

Subset Gini_Index
{ Good, Moderate} Poor 0
{Good, Poor} Moderate 0.2
{Moderate, Poor} Good 0.1875

∆Gini (Communication Skill) = Gini(T) - Gini(T, Communication Skill)

= 0.2184 - 0  0.2184

Gini_Index and ∆Gini values of all the attributes

Attribute Gini_index ∆Gini


Interactiveness 0.056 0.1624
Practical Knowledge 0.125 0.0934
Communication Skills 0 0.2184

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 53


Machine Learning (BCS602)

Problem 2:

Apply CART algorithm on the below dataset and construct a decision Tree.

Assessment Assignment Project Seminar Result


Good Yes Yes Good Pass
Average Yes No Poor Fail
Good No Yes Good Pass
Poor No No Poor Fail
Good Yes Yes Good Pass
Average No Yes Good Pass
Good No No Fair Pass
Poor Yes Yes Good Fail
Average No No Poor Fail
Good Yes Yes Fair Pass

Solution:

Regression Trees
Regression trees are a variant of decision trees where the target feature is a continuous valued
variable. These trees can be constructed using an algorithm called reduction in variance which uses
standard deviation to choose the best splitting attribute.

Algorithm: Procedure for Constructing Regression Trees


Machine Learning (BCS602)

1. Compute standard deviation for each attribute with respect to target attribute.
2. Compute standard deviation for the number of data instances of each distinct value of an
attribute.
3. Compute weighted standard deviation for each attribute.
4. Compute standard deviation reduction by subtracting weighted standard deviation for each
attribute from standard deviation of each attribute.
5. Choose the attribute with a higher standard deviation reduction as the best split attribute
6. The best split attribute is placed as the root node.
7. The root node is branched into subtrees with each subtree as an outcome of the test condition of
the root node attribute. Accordingly, the training dataset is also split into different subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset

Problem refer class notes

VALIDATING AND PRUNING OF DECISION TREES

 Overfitting is a general problem with decision trees. Once the decision tree is constructed it must
be validated for better accuracy and to avoid over-fitting and under-fitting.
 There is always a tradeoff between accuracy and complexity of the tree. The tree must be simple
and accurate.
 If the tree is more complex, it can classify the data instances accurately for the training set but
when test data is given, the tree constructed may perform poorly which means misclassifications
are higher and accuracy is reduced. This problem is called as over-fitting.
 To avoid overfitting of the tree, one need to prune the trees and construct an optimal decision
tree. Trees can be pre-pruned or post-pruned.
 If tree nodes are pruned during construction or the construction is stopped earlier without
exploring the nodes' branches, then it is called as pre-pruning
 If tree nodes are pruned after the construction is over then it is called as post-pruning.
 Basically, the dataset is split into three sets called training dataset, validation dataset and test
dataset. Generally, 40% of the dataset is used for training the decision tree and the remaining
60% is used for validation and testing.
 Once the decision tree is constructed, it is validated with the validation dataset and the
misclassifications are identified.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 55


Machine Learning (BCS602)

 Using the number of instances correctly classified and number of instances wrongly classified,
Average Squared Error (ASE) is computed.
 The tree nodes are pruned based on these computations and the resulting tree is validated until
we get a tree that performs better.
 Cross validation is another way to construct an optimal decision tree.
 Another approach is that after the tree is constructed using the training set, statistical tests like
error estimation and Chi-square test are used to estimate whether pruning or splitting is required
for a particular node to find a better accurate tree.
 The third approach is using a principle called Minimum Description Length which uses a
complexity measure for encoding the training set and the growth of the decision tree is stopped
when the encoding size (i.e., (size(tree)) + size(misclassifications (tree)) is minimized.
 Some of the tree pruning methods are listed below:
1. Reduced Error Pruning
2. Minimum Error Pruning (MEP)
3. Pessimistic Pruning
4. Error-based Pruning (EBP)
5. Optimal Pruning
6. Minimum Description Length (MDL) Pruning
7. Minimum Message Length Pruning
8. Critical Value Pruning

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 56

You might also like