DT Dr. Rabi Shaw
DT Dr. Rabi Shaw
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
January 4, 2025
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Data
Similarity Measures Between Data Objects
Introduction to Classification
Decision Tree
Data
Collection of data objects and their attributes.
An attribute is a property or characteristic of an object.
A collection of attributes describe an object.
Classification Techniques
Dr. Rabi Shaw
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification
Definition
The task of classification is to assign an object into one of the
predefined categories.
Definition
Given a dataset D of objects and a set of classes C, the classification
problem is to define a mapping f : D → C, where each object is
assigned to only one class. A class Cj ∈ C contains precisely those
objects mapped to it. Cj = {oi | f (oi ) = Cj }
Classification Techniques
Dr. Rabi Shaw
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification
Working Principle
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Process
Training Set
(xi, y i) Classifier
l
Model (xk , y )
(xk , ?)
Test Set
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Decision Tree
Table: Training Set (Source: Tan, Kumar and Steinbach, Introduction to Data Mining)
INCOME No
<80K >80K
No YES
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Refund
s No
Ye
Marital Status
No
Married
Single,Divorced
INCOME No
<80K >80K
No YES
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Refund
s No
Ye
Marital Status
No
Married
Single,Divorced
INCOME No
<80K >80K
No YES
Decision Tree
Model is interpretable
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Table: Training Set (Source: Tan, Kumar and Steinbach, Introduction to Data Mining)
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Splitting Criterion
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
րM1
Using Attribute AցM 2
րM3
Using Attribute BցM 4
(xk , ?)
Test Set
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
(xk , ?)
Test Set
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
2
P
GINI (t) = 1 − j [p(j|t)]
GINI =?
Class #object
C1 0
C2 6
GINI =?
Class #object
C1 1
C2 5
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
2
P
GINI (t) = 1 − j [p(j|t)]
GINI =?
Class #object
C1 0
C2 6
GINI =?
Class #object
C1 1
C2 5
GINI =?
Class #object
C1 3
C2 3
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
2
P
GINI (t) = 1 − j [p(j|t)]
GINI =?
Class #object
C1 0
C2 6
GINI =?
Class #object
C1 1
C2 5
GINI =?
Class #object
C1 3
C2 3
GINI = 0.278
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
2
P
GINI (t) = 1 − j [p(j|t)]
ni = # records/objects at child i .
n = # records at p.
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Attribute AրN
ցN2
1
Class N1 N2
C1 5 1
C2 2 4
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Attribute AրN
ցN2
1
Class N1 N2
C1 5 1
C2 2 4
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Attribute AրN
ցN2
1
Class N1 N2
C1 5 1
C2 2 4
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Entropy
X
Entropy (t) = − p(j/t) log2 p(j/t)
j
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
P
Entropy (t) = − j p(j/t) log2 p(j/t)
Entropy =?
Class #object
C1 0
C2 6
Entropy =?
Class #object
C1 1
C2 5
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
P
Entropy (t) = − j p(j/t) log2 p(j/t)
Entropy =?
Class #object
C1 0
C2 6
Entropy =?
Class #object
C1 1
C2 5
Entropy =?
Class #object
C1 3
C2 3
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
P
Entropy (t) = − j p(j/t) log2 p(j/t)
Entropy =?
Class #object
C1 0
C2 6
Entropy =?
Class #object
C1 1
C2 5
Entropy =?
Class #object
C1 3
C2 3
Entropy =?
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Information Gain
P ni
Information Gain (A) = Entropy (A) − i ( n )Entropy (ni )
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Attribute AրN
ցN2
1
Class N1 N2
C1 5 1
C2 2 4
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Attribute AրN
ցN2
1
Class N1 N2
C1 5 1
C2 2 4
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Attribute AրN
ցN2
1
Class N1 N2
C1 5 1
C2 2 4
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Attribute AրN
ցN2
1
Class N1 N2
C1 5 1
C2 2 4
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Gain Ratio
Information Gain is biased towards the test with many
outcomes.
Gain(A)
Gain Ratio(A) =
SPLITINFO(A)
Attribute with maximum Gain Ratio is selected as the splitting
attribute.
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Error
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Stopping Criteria
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Decision Tree
Inexpensive to construct
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Model Overfitting
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Presence of Noise :
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Presence of Noise :
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
P
Error on training set: e(T ) = i e(t)
P ′
Generalization error : j e (t)
P ′
Estimating j e (t)
For each leaf node e ′ (t) = e(t) × 0.5
Estimated Error=e(T ) + N × 0.5,
N = number of leaf nodes in the DT.
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
P
Error on training set: e(T ) = i e(t)
P ′
Generalization error : j e (t)
P ′
Estimating j e (t)
For each leaf node e ′ (t) = e(t) × 0.5
Estimated Error=e(T ) + N × 0.5,
N = number of leaf nodes in the DT.
For a tree with 40 leaf nodes and 10 errors on training (out of
1000 instances):
Estimated Error=
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Algorithm:
KNN = ∅
For each t ∈ T
1 if |KNN| <= K
KNN = KNN ∪ {t}
2 else
1 Find a x ′ ∈ KNN such that dis(x, x ′ ) > dis(x, t)
2 KNN = KNN − {x ′ }; KNN = KNN ∪ {t}
The pattern x belongs to a Class in which most of the
patterns in KNN belong to.
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Model Evaluation
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
TP + TN
Accuracy=
TP + FP + FN + TN
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Limitation of Accuracy
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Limitation of Accuracy
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Classification Cost
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Accuracy=? Cost=?
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
Cost-Sensitive Measures
Precision(P)=
TP
TP + FP
Recall(R)=
TP
TP + FN
F-measure=
2×P ×R
P +R
Classification Techniques
Outline
Data
Introduction to Classification
Decision Tree
Distance Based Classification Algorithms
THANK YOU