Mod 3
Mod 3
MODULE -03
NEAREST-NEIGHBOR LEARNING
A natural approach to similarity-based classification is k-Nearest-Neighbors (k-NN), which is a
non-parametric method used for both classification and regression problems.
It is a simple and powerful non-parametric algorithm that predicts the category of the test
instance according to the 'k' training samples which are closer to the test instance and classifies it
to that category which has the largest probability.
A visual representation of this learning is shown in Figure 4.1.
There are two classes of objects called C, and C, in the given figure. When given a test instance
T, the category of this test instance is determined by looking at the class of k = 3 nearest
neighbors.
Thus, the class of this test instance T is predicted as C.
Algorithm: k-NN
Inputs: Training dataset T, distance metric d, Test instance t, the number of nearest neighbors
Output: Predicted class or category.
Prediction: For test instance 1,
Step 1: For each instance i in T, compute the distance between the test instance t and every other
instance i in the training dataset using a distance metric (Euclidean distance).
[Continuous attributes- Euclidean distance between two points in the plane with coordinates (x 1,
y1) and (x2, y2) is given as dist ((x1, y2), (x1, y2)) = (𝐗𝟏 − 𝐗𝟐)² + (𝐘𝟏 − 𝐘𝟐)² ]
[Categorical attributes (Binary) - Hamming Distance: If the value of the two instances is same, the
distance d will be equal to 0 otherwise d = 1.]
Step 2: Sort the distances in an ascending order and select the first k nearest training data
instances to the test instance.
Step 3: Predict the class of the test instance by majority voting (if target attribute is discrete
valued) or mean (if target attribute is continuous valued) of the k selected nearest instances.
Example:
Step 1:
Step 3: Since Red has majority voting. The given new instance/ test data (20, 35, ? ) will be
classified as RED.
Example 2:
Step 1:
Sl. Project Euclidean distance
CGPA Assessment Result
No Submitted
Step 3: Since Fail has majority voting. The given new instance/ test data (6.1, 40, 5 ) will be
classified as Fail.
Inputs: Training dataset 'T', Distance metric 'd', Weighting function w(i), Test instance 't', the
number of nearest neighbors ‘k’.
Output: Predicted class or category
Prediction: For test instance t,
Step 1: For each instance 'i’ in Training dataset T, compute the distance between the test instance t
and every other instance 'i' using a distance metric (Euclidean distance).
Step 2: Sort the distances in the ascending order and select the first 'k nearest training data instances
to the test instance.
Step 3: Predict the class of the test instance by weighted voting technique (Weighting function w(i))
for the k selected nearest instances:
Compute the inverse of each distance of the 'K' selected nearest instances.
Find the sum of the inverses.
Compute the weight by dividing each inverse distance by the sum. (Each weight is a vote
for its associated class).
Example:
Consider the same training dataset given below Table. Use Weighted k-NN and determine the class
of the new test data (7.6, 60, 8).
Step 1:
Step 2: Sort the distance according to ascending order and Select the first 3 nearest training data
instances to the test instance. The selected nearest neighbors are
Step 3: Predict the class of the test instance by weighted voting technique from the 3 selected nearest
instance.
Compute the inverse of each distance of the 3 selected nearest insatnces
Inverse
CGPA Assessment Project Result Euclidean distance (ED) Distance
Submitted (1/ED)
Pass = 0.3451
Predict the class by choosing the class with the maximum vote.
Since the value of Fail is more the new instance belongs to Fail.
NEAREST CENTROID CLASSIFIER
A simple alternative to k-NN classifiers for similarity-based classification is the Nearest Centroid
Classifier.
It is a simple classifier and also called as Mean Difference classifier.
The idea of this classifier is to classify a test instance to the class whose centroid/mean is closest
to that instance.
Algorithm: Nearest Centroid Classifier
Inputs: Training dataset T, Distance metric d. Test instance t
Output: Predicted class or category
Step 1: Compute the mean/centroid of each class.
Step 2: Compute the distance between the test instance and mean/centroid of each class (Euclidean
Distance).
Step 3: Predict the class by choosing the class with the smaller distance.
Example: Consider the sample data shown in below Table with two features x and y. The target
classes are 'A' or 'B'. Predict the class of the instance (6, 5) using Nearest Centroid Classifier.
X Y Class
3 1 A
5 2 A
4 3 A
7 6 B
6 7 B
8 5 B
Step 1: Compute the mean/centroid of each class. In this example there are 2 classes called ‘A’ &
‘B’
Centroid of class ‘A’ = (3 +5 + 4 , 1 + 2 + 3)/ 3 =( 12, 6)/3 = (4 , 2)
Centroid of class ‘B’ = (7 +6 + 8 , 6 + 7 + 5)/ 3 =( 21, 18)/3 = (7 , 6)
Step 2: Calculate the Euclidean distance between test instance (6 ,5) and each of the centroid.
Euc_distance = (6 − 4) + (5 − 2) = 3.6
Euc_distance = (6 − 7) + (5 − 6) = 1.414
Step 3: The test instance has smaller distance to class B as the value of Class B is smaller than Class
A
J(β) = ∑ ℎ (𝑥 ) − 𝑦 ^2
where 'm' is the number of instances in the training dataset.
Now the cost function is modified for locally weighted linear regression including the weights
only for the nearest neighbor points.
Hence, the cost function is given as
J(β) = ∑ 𝑤 ℎ (𝑥 ) − 𝑦 ^2
where wi is the weight associated with each xi.
The weight function used is a Gaussian kernel that gives a higher value for instances that are
close to the test instance, and for instances far away, it tends to zero but never equals to zero.
wi is computed in as,
Wi = 𝑒
where 𝜏 is called the bandwidth parameter and controls the rate at which
Regression analysis is a supervised learning method for predicting continuous variables. The
difference between classification and regression analysis is that regression methods are used to
predict qualitative variables or continuous numbers unlike categorical variables or labels. It is used to
predict linear or non-linear relationships among variables of the given dataset.
INTRODUCTION TO REGRESSION
Regression analysis is the premier method of supervised learning. This is one of the most popular
and oldest supervised learning techniques.
Given a training dataset D containing N training points (x, y), where i = 1...N, regression analysis
is used to model the relationship between one or more independent variables x, and a dependent
variable y.
The relationship between the dependent and independent variables can be represented as a
function as follows:
Y = f(x)
Here, the feature variable x is also known as an explanatory variable, exploratory variable, a
predictor variable, an independent variable, a covariate, or a domain point.
y is a dependent variable. Dependent variables are also called as labels, target variables, or
response variables.
Regression analysis determines the change in response variables when one exploration variable is
varied while keeping all other parameters constant.
This is used to determine the relationship each of the exploratory variables exhibits. Thus,
regression analysis is used for prediction and forecasting.
Regression is used to predict continuous variables or quantitative variables such as price and
revenue. Thus, the primary concern of regression analysis is to find answer to questions such as:
What is the relationship between the variables?
What is the strength of the relationships?
What is the nature of the relationship such as linear or non-linear?
What is the relevance of the attributes?
What is the contribution of each attribute?
There are many applications of regression analysis. Some of the applications of regressions
include predicting:
Sales of a goods or services جاده
In positive correlation, one variable change is associated with the change in another variable.
In negative correlation, the relationship between the variables is reciprocal while in random
correlation, no relationship exists between variables.
While correlation is about relationships among variables, say x and y, regression is about
predicting one variable given another variable.
Machine Learning (BCS602)
Linear Regression: It is a type of regression where a line is fitted upon given data for finding the
linear relationship between one independent variable and one dependent variable to describe
relationships.
Multiple Regression: It is a type of regression where a line is fitted for finding the linear relationship
between two or more independent variables and one dependent variable to describe relationships
among variables.
Logistic Regression: It is used for predicting categorical variables that involve one or more
independent variables and one dependent variable. This is also known as a binary classifier.
Lasso and Ridge Regression Methods: These are special variants of regression method where
regularization methods are used to limit the number and size of coefficients of the independent
variables.
1. Outliers-Outliers are abnormal data. It can bias the outcome of the regression model, as outliers
push the regression line towards it.
2. Number of cases - The ratio of independent and dependent variables should be at least 20:1. For
every explanatory variable, there should be at least 20 samples. Atleast five samples are required in
extreme cases.
3. Missing data - Missing data in training data can make the model unfit for the sampled data.
The vertical distance between each point and the line is called an error. These individual errors are
added to compute the total error of the predicted line. This is called sum of residuals.
The squares of the individual errors can also be computed and added to give a sum of squared
error. The line with the lowest sum of squared error is called line of best fit.
In another words, OLS is an optimization technique where the difference between the data points
and the line is optimized.
Mathematically, based on the line equations for points (x1, x2,...,xn) are:
. . . .
yn = (a0+a1 xn) + en,
In general, the error is given as: ei = yi - (a0 + a1 xi)
Here, the terms (e1, e2,…. en) are error associated with the data points and denote the difference
between the true value of the observation and the point on the line. This is also called as residuals.
The residuals can be positive, negative or zero.
A regression line is the line of best fit for which the sum of the squares of residuals is minimum.
The minimization can be done as minimization of individual errors by finding the parameters a 0
and a1, such that:
E= ∑ 𝑒 = ∑ (𝑦 − (𝑎 + 𝑎 𝑥 ))
Or as the minimization of sum of absolute values of the individual errors:
E= ∑ |𝑒 | = ∑ |(𝑦 − (𝑎 + 𝑎 𝑥 ))|
Or as the minimization of the sum of the squares of the individual errors:
E= ∑ (𝑒 ) = ∑ (𝑦 − (𝑎 + 𝑎 𝑥 ))
Sum of the squares of the individual errors, often preferred as individual errors, do not get
cancelled out and are always positive, and sum of squares results in a large increase even for a
small change in the error.
Therefore, this is preferred for linear regression. Therefore, linear regression is modelled as a
minimization function as follows:
J ( a0, a1 ) = ∑ [𝑦 − 𝑓(𝑥 )]
= ∑ (𝑦 − (𝑎 + 𝑎 𝑥 ))
Here, J(a, a) is the criterion function of parameters a0 and a1. This needs to be minimized. This is
done by differentiating and substituting to zero. This yields the coefficient values of a 0 and a1. The
values of estimates of a0 and a1 are given as follows:
(𝑥𝑦) − (𝑥̅ )(𝑦)
𝑎 =
𝑥 − (𝑥̅ )
And the value of a, is given as follows:
𝑎 = (𝑦) − 𝑎 ∗ 𝑥̅
Problems
Let us consider an example where the five week’s sales data (in Thousands) is given as shown below.
Apply linear regression technique to predict the 7th and 9th week sales.
5 3.8
Here there are 5 items i.e i=1,2,3,4,5. The computation table is shown below. Here, there are 5
samples, so i ranges from 1 to 5.
Xi Yi Xi2 Xi * Yi
1 1.2 1 1.2
2 1.8 4 3.6
3 2.6 9 7.8
4 3.2 16 12.8
5 3.8 25 19
Sum =15 Sum =12.6 Sum =55 Sum = 44.4
2
Average of Average of (yi) Average of (xi ) Average of (xi * yi)
(xi) = 𝑥̅ = 3 = 𝑦 = 2.52 = 𝑥 = 11 𝑥𝑦 = 8.88
Solution
(8.88) − (3)(2.52)
𝑎 = = 0.66
(11) − (3)
The predicted 7th week sale would be (when x= 7), y = 0.54 + 0.66 *7 = 5.16 and 9th week sales
would be (when x=9), y = 0.54 + 0.66 *9 = 6.48
Problem
Machine Learning (BCS602)
Find linear regression of the data of week and product sales (in Thousands) given. Use linear
regression in matrix form.
Xi (week) Yi (product Sales in Thousands)
1 1
2 3
3 4
4 8
Solution:
Here, the dependent variable X is be given as
XT = [1 2 3 4]
And the independent variable is given as follows:
YT = [1 3 4 8]
The data can be given in matrix form as follows:
1 1 1
X= 1 2 Y= 3
1 3 4
1 4 8
1 1
1 1 1 1 4 10
1. T
Computation of (X X) = * 1 2 =
1 2 3 4 1 3 10 30
1 4
4 10 1.5 −0.5
2. Computation of matrix inverse of (XTX)-1 = =
10 30 −0.5 0.2
1
1 0.5 0 −0.5 −𝟏. 𝟓 𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕
* 3 =
−0.3 −0.1 0.1 0.3 4 𝟐. 𝟐 𝒔𝒍𝒐𝒑𝒆
8
The regression model should be evaluated using some metrics for checking the correctness. The
following metrics are used to validate the results of regression.
Standard error estimate is another useful measure of regression. It is the standard deviation of the
observed values to the predicted values. This is given as:
∑(𝑦 − 𝑦 )
𝑆 =
𝑛−2
Here yi is the observed value and 𝑦 is the predicted value. Here, n is the number of samples.
Relative MSE
Relative MSE is the ratio of the prediction ability of the 𝑦 to the average of the trivial population.
The value of zero indicates that the model is prefect and its value ranges between 0 to 1. If the value
is more than 1, then the created model is not a good one.
2
∑𝑛−1
𝑖=0 𝑦𝑖 −𝑦𝑖
RelMSE = 2
∑𝑛−1
𝑖=0 𝑦𝑖 −𝑦𝑖
Coefficient of Variation
Coefficient of variation is unit less and is given as:
CV =
Example:
Consider the following training set for predicting the sales of the items.
Consider two fresh item 6 and 7, whose actual values are 80 and 90 respectively. A regression model
predicts the values of the items 6 and 7 as 75 and 82 respectively. Find MAE, MSE, RMSE,
RelMSE, CV.
Solution:
Coefficient of Variation
.
CV = = 0.08
A decision tree has a structure that consists of a root node, internal nodes/decision nodes, branches
and terminal nodes/leaf nodes.
The topmost node in the tree is the root node. Internal nodes are the test nodes and are also called
as decision nodes. These nodes represent a choice or test of an input attribute and the outcome or
outputs of the test condition are the branches emanating from this decision node.
The branches are labelled as per the outcomes or output values of the test condition
Each branch represents a sub-tree or subsection of the entire tree.
Every decision node is part of path to a leaf node. The leaf nodes represent the labels or the
outcome of a decision path.
The decision tree model, in general, represents a collection of logical rules of classification in the
form of a tree structure.
Decision networks, otherwise called as influence diagrams, have a directed graph structure with
nodes and links. It is an extension of Bayesian belief networks that represents information about
each node's current state, its possible actions, the possible outcome of those actions, and their
utility.
Figure shows symbols that are used to represent different nodes in the construction of a decision
tree.
A circle is used to represent a root node, a diamond symbol is used to represent a decision node or
the internal nodes, and all leaf nodes are represented with a rectangle.
A decision tree consists of two major procedures discussed below.
1. Building the Tree
Goal: Construct decision tree with the given training dataset. The tree is constructed in
topdown fashion. It starts from the root node. At every level of tree construction, one need to
find the best split attributes or best decision node among all attributes. This process is
recursive and continued until we end up in the last level of the tree or finding a leaf node
which cannot be split further.
Goal: Given a test instance, infer to the target class it belongs to.
Classification: Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start
traversing the tree from the root. We traverse as we evaluate the test condition on every
decision node with the test object attribute value and walk to the branch corresponding to the
test's outcome. This process is repeated until we end up in a leaf node which contains the
target class of the test object.
2. Simple to understand
3. The input and output attributes can be discrete or continuous predictor variables.
4. Can model a high degree of nonlinearity in the relationship between the target variables and
the predictor variables
5. Quick to train
Disadvantages
1. It is difficult to determine how deeply a decision tree can be grown or when to stop growing
it.
2. If training data has errors or missing attribute values, then the decision tree constructed may
become unstable or biased.
3. If the training data has continuous valued attributes, handling it is computationally complex
and has to be discretized.
4. A complex decision tree may also be over-fitting with the training data.
5. Decision tree learning is not well suited for classifying multiple output classes.
Fundamentals of Entropy
Given the training dataset with a set of attributes or features, the decision tree is constructed by
finding the attribute or feature that best describes the target class for the given test instances.
The best split feature is the one which contains more information about how to split the dataset
among all features so that the target class is accurately identified for the test instances.
In other words, the best split attribute is more informative to split the dataset into sub datasets and
this process is continued until the stopping criterion is reached. This splitting should be pure at
every stage of selecting the best feature.
The best feature is selected based on the amount of information among the features which are
basically calculated on probabilities.
Quantifying information is closely related to information theory. In the field of information
theory, the features are quantified by a measure called Shannon Entropy which is calculated
based on the probability distribution of the events.
P1 = |No of data instances belonging to class 1| / |Total no of data instances in the training
dataset|
1. Find the best attribute from the training dataset using an attribute selection measure and place
it at the root of the tree.
2. Split the training dataset into subsets based on the outcomes of the test attribute and each
subset in a branch contains the data instance or tuples with the same value for the selected test
attribute.
3. Repeat step 1 and step 2 on each until we end up in leaf nodes in all the branches of the tree.
Stopping Criteria
ID3 is an example of univariate decision trees as it considers only one feature at Bach decision
node.
This leads to axis-aligned splits.
The tree is then used to classify the future test instances. It constructs the tree using a greedy
approach in a top-down fashion by identifying the best attribute at each level of the tree.
ID3 works well if the attributes or features are considered as discrete/categorical value.
The algorithm builds the tree using a purity measure called 'Information Gain' with the give
training data instances and then uses the constructed tree to classify the test data.
ID3 works well for a large dataset. If the dataset is small, over-fitting may occur Moreover, it is
not accurate if the dataset has missing attribute values.
No pruning is done during or after construction of the tree and it is prone to outliers.
2. Compute Entropy_Info and Information_Gain for each of the attribute in the training dataset.
|𝑨𝒊 |
Entropy_Info(T, A) = ∑𝒗𝒊 𝟏 ∗ 𝑬𝒏𝒕𝒓𝒐𝒑𝒚_𝑰𝒏𝒇𝒐(𝑨𝒊 )
|𝑻|
3. Choose the attribute for which entropy is minimum and therefore the gain is maximum as the
best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test condition
of the root node attribute. Accordingly, the training dataset is also split into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset.
Note: We stop branching a node if entropy is 0. The best split attribute at every iteration is the
attribute with the highest information gain.
Problems
Step 1: Calculate the Entropy for the target class Job offer
Entropy_Info(Job Offer) = Entropy_Info(7,3) = −[ 𝑙𝑜𝑔 + 𝑙𝑜𝑔 ] = 0.8807
Iteration 1:
Step 2: Calculate the Entropy_Info and Infrmation_Gain for each of the attribute in the training dataset
Entropy of CGPA
Entropy of Interactiveness
Yes 5 1 6
No 2 2 4
𝑙𝑜𝑔 ]
= 6/10 * 0.6497+ 4/10 * 1 0.7896
Gain(Interactiveness) = 0.8807 – 0.7896 0.0911
Gain Table
Attribute Gain
CGPA 0.5564
Interactiveness 0.0911
Practical Knowledge 0.2446
Communication Skill 0.5203
Step 3: Since the gain of CGPA is maximum we choose this attribute as the best split and consider as
Root node. The resulting tree is as shown below
Iteration 2:
In this iteration the same process of computing the Entopry_Info and Gain are repeated with subset
of the Training set.
The subset consists of 4 data instance as shown below
Interactiveness Practical Knowledge Communication Skill Job Offer
Yes Very Good Good Yes
No Average Poor No
Yes Good Moderate Yes
No Very Good Good Yes
Entropy of Interactiveness
𝑙𝑜𝑔 ]
= 0 + 0.4997 0.4997
Gain(Interactiveness) = 0.8108 – 0.4997 0.3111
Gain Table
Attribute Gain
Interactiveness 0.3111
Practical Knowledge 0.8108
Communication Skill 0.8108
Here both the attributes Practical Knowledge and Communication Skills have the same Gain we can
either construct the decision tree using Practical Knowledge or Communication Skills
The final Decision Tree is as shown in below
C4.5 CONSTRUCTION
C4.5 is an improvement over ID3. C4.5 works with continuous and discrete attributes and
missing values, and it also supports post-pruning.
C5.0 is the successor of C4.5 and is more efficient and used for building smaller decision trees.
C4.5 works with missing values by marking as '?', but these missing attribute values are not
considered in the calculations.
The algorithm C4.5 is based on Occam's Razor which says that given two correct solution the
simpler solution has to be chosen.
Moreover, the algorithm requires a larger training set for better accuracy.
It uses Gain Ratio as a measure during the construction of decision trees.
Machine Learning (BCS602)
ID3 is more biased towards attributes with larger values. To overcome this bias issue, C4.5 uses
a purity measure Gain ratio to identify the best split attribute.
In C4 5 algorithm the Information Gain measure used in ID3 algorithm is normalized by
computing factor called Split Info.
This normalized information gain of an attribute called as Gain Ratio is computed by the ratio
of the calculated Split_Info and Information Gain of each attribute.
Then the attribute with the highest normalized information gain, that is, highest gain ratio is
used the splitting criteria.
Given a Training dataset T,
The Split Info of an attribute A is computed as
| | | |
Split_Info(T, A) = − ∑ ∗ 𝑙𝑜𝑔
| | | |
where, the attribute A has got 'v' distinct values {a1, a2,….,av}, and |Ai| is the number of
instances
for distinct value 'f in attribute A.
The Gain Ratio of an attribute A is computed as
Gain Ratio(A)= Info_Gain(A) / Split_Info(T, A)
Problems
3 ≥9 No Average Poor No
4 <8 No Average Good No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes
Step 1: Calculate the Entropy for the target class Job offer
Iteration 1:
Step 2: Calculate the Entropy_Info and Infrmation_Gain for each of the attribute in the training dataset
Entropy of CGPA
CGPA Job Offer = Yes Job Offer = No Total
≥9 3 1 4
≥8 4 0 4
<8 0 2 2
Entropy of Interactiveness
Yes 5 1 6
No 2 2 4
𝑙𝑜𝑔 ]
= 6/10 * 0.6497+ 4/10 * 1 0.7896
Gain(Interactiveness) = 0.8807 – 0.7896 0.0911
Gain Table
Attribute Gain_Ratio
CGPA 0.3658
Interactiveness 0.0939
Practical Knowledge 0.1648
Communication Skill 0.3502
Step 3: Since the gain_ratio of CGPA is maximum we choose this attribute as the best split and
consider as Root node. The resulting tree is as shown below
Iteration 2:
In this iteration the same process of computing the Entopry_Info and Gain_info, Split_Info and
Gain_Ratio are repeated with subset of the Training set.
The subset consists of 4 data instance as shown below
Interactiveness Practical Knowledge Communication Skill Job Offer
Yes Very Good Good Yes
No Average Poor No
Yes Good Moderate Yes
No Very Good Good Yes
𝑙𝑜𝑔 ]
= 0 + 0.4997 0.4997
Gain(Interactiveness) = 0.8108 – 0.4997 0.3111
Machine Learning (BCS602)
Gain Table
Attribute Gain_Ratio
Interactiveness 0.3111
Practical Knowledge 0.5408
Communication Skill 0.5408
Here both the attributes Practical Knowledge and Communication Skills have the same Gain_ratio
we can either construct the decision tree using Practical Knowledge or Communication Skills
The C4.5 algorithm is further improved by considering attributes which are continuous and a
continuous attribute is discretized by finding a split point or threshold.
When an attribute ‘A’ has numerical values which are continuous a threshold or best split point
‘s’ is found such that the set of values is categorized into two sets such as A < s and A ≥ s.
The best split point is attribute value which has maximum information gain for that attribute.
Example:
Solution:
First sort the values in an ascending order by removing the duplicates and consider only the unique
values of the attribute.
6.8 7.9 8.2 8.5 8.8 8.8 9.1 9.5
Yes 0 7 0 7 1 6 2 5 4 3 5 2 7 0
No 1 2 2 1 2 1 2 1 2 1 3 0 3 0
Yes 0 7 0 7 1 6 2 5 4 3 5 2 7 0
No 1 2 2 1 2 1 2 1 2 1 3 0 3 0
Entropy 0 0.763 0 0.543 0.917 0.591 1 0.649 0.917 0.810 0.953 0 0.880 0
Entropy
0.687 0.434 0.689 0.789 0.874 0.763 0.880
_Info
Since CGPA with 7.9 has the maximum gain as 0.4462. Hence , CGPA ∈ 7.9 is chosen as the split
point.
We can discretise the continuous values of CGPA as 2 categories with CGPA ≤ 7.9 and CGPA > 7.9.
The resulting discretized instances are shown in below table.
4 6.8 ≤ 7.9 No
7 7.9 ≤ 7.9 No
CART algorithm is an example of multivariate decision tree that gives oblique splits.
It solves both classification and regression problems. If the target feature is categorical, it
constructs a classification tree and if the target feature is continuous, it constructs a regression
tree.
CART uses GINI Index to construct a decision tree.
GINI Index is defined as the number of data instances for a class or it is the proportion of
instances.
It constructs the tree as a binary tree by recursively splitting a node into two nodes. Therefore,
even if an attribute has more than two possible values, GINI Index is calculated for all subsets of
the attributes and the subset which has maximum value is selected as the best split subset.
For example, if an attribute A has three distinct values say {a1, a2, a3} the possible subsets are { },
{a1}, {a2}, {a3}, {a1, a2}, {a1, a3}, {a2, a3}, {a1, a2, a3}. So, if an attribute has 3 distinct values, the
number of possible subsets is 23, which means 8.
Excluding the empty set { } and the full set {a1, a2, a3}, we have 6 subsets. With 6 subsets, we can
form three possible combinations such as:
o {a1} with {a2, a3}
o {a2} with {a1, a3}
o {a3} with {a1, a2}
Hence, in this CART algorithm, we need to compute the best splitting attribute and the best split
subset i in the chosen attribute.
Higher the GINI value, higher is the homogeneity of the data instances.
Gini_Index(T) is computed as
Gini_Index(T) = 1 - ∑ 𝑃
where, Pi is the probability that a data instance or a tuple 'd' belongs to class C.
| . |
Pi = | |
GINI Index assumes a binary split on each attribute, therefore, every attribute is considered as a
binary attribute which splits the data instances into two subsets S1, and S2
| | | |
Gini_Index (T, A) =
| |
𝐺𝑖𝑛𝑖 (𝑆 ) + | |
𝐺𝑖𝑛𝑖 (𝑆 )
The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an
attribute. The best splitting attribute is chosen by the minimum Gini_Index which is otherwise
maximum ∆Gini because it reduces the impurity.
2. Compute Gini_Index for each of the attribute and for the subsets of each attribute in the training
dataset.
3. Choose the best splitting subset which has minimum Gini_Index for an attribute.
6. The best split attribute with the best split subset is placed as the root node.
7. The root node is branched into two subtrees with each subtree an outcome of the test condition of
the root node attribute. Accordingly, the training dataset is also split into two subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset.
Problem 1:
Construct a decision tree using CART algorithm for the following dataset
Solution
Step 1: Calculate the Gini_Index for the dataset, which consists of 10 data instances. The target
attribute 'Job Offer has 7 instances as Yes and 3 instances as No.
Gini_Index(T) = 0.42
Attribute- CGPA
Gini_Index(T, CGPA ∈ ((≥ 8, <8), ≥ 9)) = (6/10) x 0.445 + (4/10) × 0.375 0.417
Gini_Index of CGPA
Subset Gini_Index
(≥ 9, ≥ 8) <8 0.1755
(≥ 9, < 8) ≥8 0.3
(≥ 8, < 8) ≥9 0.417
Attribute- Interactiveness
Gini_Index(T, Interactiveness ∈ {Yes, No}) = (6/10) *(0.28) + (4/10) * 0.5 = 0.168 + 0.2 0.368
Gini_Index of CGPA
Subset Gini_Index
Yes No 0.368
Compute ∆Gini or the best splitting subset of that attribute
Very Good 2 0
Good 4 1
Average 1 2
Gini Index(T, Practical Knowledge ∈ {Very Good, Good}, Average) = (7/10)2 * 0.245 + (3/10)2
*0.445
0.3054
Gini_Index(T, Practical Knowledge ∈ {Very Good, Average}) = 1- (3/5)2 – (2/5)2 = 1-0.52 0.48
Gini_Index(T, Practical Knowledge ∈ (Very Good, Average), Good) = (5/10) * 0.48 + (5/10) * 0.32
0.40
Subset Gini_Index
{Very Good, Good} Average 0.3054
{Very Good, Average} Good 0.40
{Good, Average} Very Good 0.3750
Good 4 1
Moderate 3 0
Poor 0 2
Gini Index(T, Communication Skills ∈ {Good, Poor}) = 1 - (4/7)2 - (3/7)2 = 1 – 0.5101 0.489
Gini_Index(T, Communication Skills ∈ {Moderate, Poor}, Good) = (5/10) * 0.48 + (5/10) * 0.32
0.40
Subset Gini_Index
{ Good, Moderate} Poor 0.1755
{Good, Poor} Moderate 0.3429
{Moderate, Poor} Good 0.40
Since both CGPA and Communication Skills has the same ∆Gini values which is highest among
other attributes we choose either CGPA/ Communication Skills as the best split attribute. Here we
have chosen CGPA
Attribute- Interactiveness
Gini_Index of CGPA
Subset Gini_Index
Yes No 0.056
Compute ∆Gini or the best splitting subset of that attribute
Very Good 2 0
Good 4 0
Average 1 1
Gini Index(T, Practical Knowledge ∈ {Very Good, Good}, Average) = (6/8)2 *0+(2/8)2*0.5 0.125
Gini_Index(T, Practical Knowledge ∈ (Very Good, Average), Good) = (4/8) * 0.375 + (4/8) * 0 0.187
Gini_Index(T, Practical Knowledge ∈ (Good, Average), Very Good) = (6/8)*0.278 + (2/8)×0 0.2085
Subset Gini_Index
{Very Good, Good} Average 0.125
{Very Good, Average} Good 0.1875
{Good, Average} Very Good 0.2085
Good 4 0
Moderate 3 0
Poor 0 1
Gini Index(T, Communication Skills ∈ {Good, Poor}) = 1 - (4/5)2 - (1/5)2 = 1 – 0.64 – 0.04 0.32
Subset Gini_Index
{ Good, Moderate} Poor 0
{Good, Poor} Moderate 0.2
{Moderate, Poor} Good 0.1875
= 0.2184 - 0 0.2184
Problem 2:
Apply CART algorithm on the below dataset and construct a decision Tree.
Solution:
Regression Trees
Regression trees are a variant of decision trees where the target feature is a continuous valued
variable. These trees can be constructed using an algorithm called reduction in variance which uses
standard deviation to choose the best splitting attribute.
1. Compute standard deviation for each attribute with respect to target attribute.
2. Compute standard deviation for the number of data instances of each distinct value of an
attribute.
3. Compute weighted standard deviation for each attribute.
4. Compute standard deviation reduction by subtracting weighted standard deviation for each
attribute from standard deviation of each attribute.
5. Choose the attribute with a higher standard deviation reduction as the best split attribute
6. The best split attribute is placed as the root node.
7. The root node is branched into subtrees with each subtree as an outcome of the test condition of
the root node attribute. Accordingly, the training dataset is also split into different subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset
Overfitting is a general problem with decision trees. Once the decision tree is constructed it must
be validated for better accuracy and to avoid over-fitting and under-fitting.
There is always a tradeoff between accuracy and complexity of the tree. The tree must be simple
and accurate.
If the tree is more complex, it can classify the data instances accurately for the training set but
when test data is given, the tree constructed may perform poorly which means misclassifications
are higher and accuracy is reduced. This problem is called as over-fitting.
To avoid overfitting of the tree, one need to prune the trees and construct an optimal decision
tree. Trees can be pre-pruned or post-pruned.
If tree nodes are pruned during construction or the construction is stopped earlier without
exploring the nodes' branches, then it is called as pre-pruning
If tree nodes are pruned after the construction is over then it is called as post-pruning.
Basically, the dataset is split into three sets called training dataset, validation dataset and test
dataset. Generally, 40% of the dataset is used for training the decision tree and the remaining
60% is used for validation and testing.
Once the decision tree is constructed, it is validated with the validation dataset and the
misclassifications are identified.
Using the number of instances correctly classified and number of instances wrongly classified,
Average Squared Error (ASE) is computed.
The tree nodes are pruned based on these computations and the resulting tree is validated until
we get a tree that performs better.
Cross validation is another way to construct an optimal decision tree.
Another approach is that after the tree is constructed using the training set, statistical tests like
error estimation and Chi-square test are used to estimate whether pruning or splitting is required
for a particular node to find a better accurate tree.
The third approach is using a principle called Minimum Description Length which uses a
complexity measure for encoding the training set and the growth of the decision tree is stopped
when the encoding size (i.e., (size(tree)) + size(misclassifications (tree)) is minimized.
Some of the tree pruning methods are listed below:
1. Reduced Error Pruning
2. Minimum Error Pruning (MEP)
3. Pessimistic Pruning
4. Error-based Pruning (EBP)
5. Optimal Pruning
6. Minimum Description Length (MDL) Pruning
7. Minimum Message Length Pruning
8. Critical Value Pruning