Data Mining
Classification: Decision Tree Induction, ID3,
                    C 4.5, CART
            By Eesha Tur Razia Babar
 2/1/2021        Introduction to Data Mining, 2nd Edition   1
    Classification: Definition
●   Given a collection of records (training set )
     – Each record is by characterized by a tuple (x,y),
       where x is the attribute set and y is the class
       label
         ◆ x: attribute, predictor, independent variable, input
         ◆ y: class, response, dependent variable, output
●   Task:
     – Learn a model that maps each attribute set x into
       one of the predefined class labels y
     2/1/2021            Introduction to Data Mining, 2nd Edition   2
Examples of Classification Task
     Task          Attribute set, x                               Class label, y
Categorizing Features extracted from                          spam or non-spam
email        email message header
messages     and content
Identifying   Features extracted from                         malignant or benign
tumor cells   x-rays or MRI scans                             cells
Cataloging    Features extracted from                         Elliptical, spiral, or
galaxies      telescope images                                irregular-shaped
                                                              galaxies
 2/1/2021          Introduction to Data Mining, 2nd Edition                        3
Descriptive Modeling
 2/1/2021    Introduction to Data Mining, 2nd Edition   4
Predictive Modeling
 2/1/2021    Introduction to Data Mining, 2nd Edition   5
General Approach for Building
Classification Model
 2/1/2021   Introduction to Data Mining, 2nd Edition   6
Performance Metrics
 2/1/2021   Introduction to Data Mining, 2nd Edition   7
Classification Techniques
●    Base Classifiers
       –   Decision Tree based Methods
       –   Rule-based Methods
       –   Nearest-neighbor
       –   Naïve Bayes and Bayesian Belief Networks
       –   Support Vector Machines
       –   Neural Networks, Deep Neural Nets
●    Ensemble Classifiers
       – Boosting, Bagging, Random Forests
    2/1/2021           Introduction to Data Mining, 2nd Edition   8
Example of a Decision Tree
                al               al             u   s
              ic              i c
                                             uo
            or           o  r
        te
           g
                     te
                        g
                                         ntin        as
                                                        s
                                                   l
      ca       ca                     co          c
                                                                                             Splitting Attributes
                                                                                       Home
                                                                                       Owner
                                                                            Yes                     No
                                                                           NO                       MarSt
                                                                                     Single, Divorced        Married
                                                                                           Income            NO
                                                                               < 80K                 > 80K
                                                                                   NO               YES
       Training Data                                                         Model: Decision Tree
 2/1/2021                                       Introduction to Data Mining, 2nd Edition                          9
EXAMPLE
 2/1/2021   Introduction to Data Mining, 2nd Edition   10
Apply Model to Test Data
                                                      Test Data
     Start from the root of tree.
                Home
    Yes         Owner      No
   NO                      MarSt
             Single, Divorced             Married
                  Income                NO
        < 80K                > 80K
            NO             YES
 2/1/2021                   Introduction to Data Mining, 2nd Edition   11
Apply Model to Test Data
                                                      Test Data
                Home
    Yes         Owner      No
   NO                      MarSt
             Single, Divorced             Married
                  Income                NO
        < 80K                > 80K
            NO             YES
 2/1/2021                   Introduction to Data Mining, 2nd Edition   12
Apply Model to Test Data
                                                      Test Data
                Home
    Yes         Owner      No
   NO                      MarSt
             Single, Divorced             Married
                  Income                NO
        < 80K                > 80K
            NO             YES
 2/1/2021                   Introduction to Data Mining, 2nd Edition   13
Apply Model to Test Data
                                                      Test Data
                Home
    Yes         Owner      No
   NO                      MarSt
             Single, Divorced             Married
                  Income                NO
        < 80K                > 80K
            NO             YES
 2/1/2021                   Introduction to Data Mining, 2nd Edition   14
Apply Model to Test Data
                                                      Test Data
                Home
    Yes         Owner      No
   NO                      MarSt
             Single, Divorced             Married
                  Income                NO
        < 80K                > 80K
            NO             YES
 2/1/2021                   Introduction to Data Mining, 2nd Edition   15
Apply Model to Test Data
                                                      Test Data
                Home
    Yes         Owner      No
   NO                      MarSt
             Single, Divorced             Married                      Assign Defaulted to
                                                                         “No”
                  Income                NO
        < 80K                > 80K
            NO             YES
 2/1/2021                   Introduction to Data Mining, 2nd Edition                     16
Another Example of Decision Tree
                      cal         al
                                                 us
                o  ri         ric             u o
               g            go             tin       ss
            te          t e               n       la
        ca            ca               co        c                         MarSt         Single,
                                                            Married                    Divorced
                                                               NO                       Home
                                                                              Yes       Owner       No
                                                                                NO                 Income
                                                                                         < 80K              > 80K
                                                                                            NO              YES
                                                             There could be more than one tree that
                                                             fits the same data!
 2/1/2021                                   Introduction to Data Mining, 2nd Edition                         17
Decision Tree Classification Task
                                                        Decision
                                                        Tree
 2/1/2021    Introduction to Data Mining, 2nd Edition              18
Decision Tree Induction (How to build
a decision tree)
●    Many Algorithms:
      – Hunt’s Algorithm (one of the earliest) which is
         the basis of the following algorithms.
      – CART, ID3, C4.5, SLIQ, SPRINT
●    Non-parametric algorithms do not assume a fixed form
     for the relationship between input and output. Instead,
     they adapt to the data's structure. Examples include:
1.   Decision Trees
2.   k-Nearest Neighbors (k-NN)
     2/1/2021        Introduction to Data Mining, 2nd Edition   19
Greedy Algorithms
●    In computer science, a greedy algorithm is an
     algorithm that finds a solution to problems in the
     shortest time possible. It picks the path that
     seems optimal at the moment without regard for
     the overall optimization of the solution that would
     be formed
●    One such algorithm is Hunt’s algorithm, which is
     the basis of many existing decision tree induction
     algorithms, including ID3, C4.5, and CART
    2/1/2021        Introduction to Data Mining, 2nd Edition   20
General Structure of Hunt’s Algorithm
●    Let Dt be the set of training
     records that reach a node t
●    General Procedure:
      – If Dt contains records that
        belong the same class yt,
        then t is a leaf node labeled
        as yt
       – If Dt contains records that
         belong to more than one
                                                                          Dt
         class, use an attribute test
         to split the data into smaller
         subsets. Recursively apply                                   ?
         the procedure to each
         subset.
    2/1/2021               Introduction to Data Mining, 2nd Edition            21
   Hunt’s Algorithm
                (7,3)
                                         (3,0)             (4,3)
                                        (3,0)
(3,0)
                                                                   (3,0)
        (1,3)           (3,0)
                                        (1,0)            (0,3)
        2/1/2021                Introduction to Data Mining, 2nd Edition   22
   Hunt’s Algorithm
                (7,3)
                                         (3,0)             (4,3)
                                        (3,0)
(3,0)
                                                                   (3,0)
        (1,3)           (3,0)
                                        (1,0)            (0,3)
        2/1/2021                Introduction to Data Mining, 2nd Edition   23
   Hunt’s Algorithm
                (7,3)
                                         (3,0)             (4,3)
                                        (3,0)
(3,0)
                                                                   (3,0)
        (1,3)           (3,0)
                                        (1,0)            (0,3)
        2/1/2021                Introduction to Data Mining, 2nd Edition   24
   Hunt’s Algorithm
                (7,3)
                                         (3,0)             (4,3)
                                        (3,0)
(3,0)
                                                                   (3,0)
        (1,3)           (3,0)
                                        (1,0)            (0,3)
        2/1/2021                Introduction to Data Mining, 2nd Edition   25
Special Case 1:
●    It is possible for some of the child nodes created in Step
     2 to be empty; i.e., there are no records associated with
     these nodes. This can happen if none of the training
     records have a combination of attribute values
     associated with such nodes. In this case, the node is
     declared a leaf node with the same class label as the
     majority class of training records associated with its
     parent node.
    2/1/2021         Introduction to Data Mining, 2nd Edition   26
Special Case 2:
●    In Step 2, if all the records associated with Dt
     have identical attribute values (except for the
     class label), then it is not possible to split these
     records any further. In this case, the node is
     declared a leaf node with the same class label as
     the majority class of training records associated
     with this node.
    2/1/2021        Introduction to Data Mining, 2nd Edition   27
Design Issues of Decision Tree Induction
●    How should training records be split?
      – Method for expressing test condition
           ◆   depending on attribute types
      – Measure for evaluating the goodness of a test
        condition
●    How should the splitting procedure stop?
      – Stop splitting if all the records belong to the
        same class or have identical attribute values
      – Early termination
    2/1/2021               Introduction to Data Mining, 2nd Edition   28
Methods for Expressing Test Conditions
●    Depends on attribute types
      – Binary
      – Nominal
      – Ordinal
      – Continuous
    2/1/2021       Introduction to Data Mining, 2nd Edition   29
Test Condition for Binary Attribute
●    The test condition for a
     binary attribute
     generates two
     potential outcomes.
    2/1/2021        Introduction to Data Mining, 2nd Edition   30
Test Condition for Nominal Attributes
●    Multi-way split:
     – Use as many partitions as
       distinct values.
●    Binary split:
     – Divides values into two subsets
    2/1/2021         Introduction to Data Mining, 2nd Edition   31
Test Condition for Ordinal Attributes
●    Multi-way split:
     – Use as many partitions
       as distinct values
●    Binary split:
     – Divides values into two
        subsets
     – Preserve order
        property among
        attribute values                                        This grouping
                                                                violates order
                                                                property
    2/1/2021         Introduction to Data Mining, 2nd Edition               32
Test Condition for Continuous Attributes
 2/1/2021    Introduction to Data Mining, 2nd Edition   33
Splitting Based on Continuous Attributes
●    Different ways of handling
      – Discretization to form an ordinal categorical
         attribute
             Ranges can be found by equal interval bucketing,
            equal frequency bucketing (percentiles), or
            clustering.
           ◆ Static – discretize once at the beginning
           ◆ Dynamic – repeat at each node
      – Binary Decision: (A < v) or (A ≥ v)
           ◆ consider all possible splits and finds the best cut
           ◆ can be more compute intensive
    2/1/2021             Introduction to Data Mining, 2nd Edition   34
How to determine the Best Split
Before Splitting: 10 records of class 0,
        10 records of class 1
             Which test condition is the best?
 2/1/2021               Introduction to Data Mining, 2nd Edition   35
How to determine the Best Split
●    Greedy approach:
      – Nodes with purer class distribution are
        preferred
●    Need a measure of node impurity:
               High degree of impurity                     Low degree of impurity
    2/1/2021                  Introduction to Data Mining, 2nd Edition              36
Measures of Node Impurity
●    Gini Index
●    Entropy
●    Misclassification error
    2/1/2021        Introduction to Data Mining, 2nd Edition   37
Finding the Best Split
1.       Compute impurity measure (P) before splitting
2.       Compute impurity measure (M) after splitting
            ● Compute impurity measure of each child node
            ● M is the weighted impurity of child nodes
3.       Choose the attribute test condition that
         produces the highest gain
                Gain = P - M
          or equivalently, lowest impurity measure after splitting
          (M)
     2/1/2021            Introduction to Data Mining, 2nd Edition   38
Finding the Best Split
             Before Splitting:                                     P
            A?                                                         B?
 Yes                    No                                  Yes               No
 Node                Node                                   Node            Node
  N1                  N2                                     N3              N4
 M11                  M12                                    M21            M22
            M1                                                         M2
                     Gain = P – M1             vs        P – M2
 2/1/2021               Introduction to Data Mining, 2nd Edition              39
Compute Impurity Measures
 2/1/2021    Introduction to Data Mining, 2nd Edition   40
Measure of Impurity: GINI
    2/1/2021   Introduction to Data Mining, 2nd Edition   41
Measure of Impurity: GINI
●    Gini Index for a given node t :
       – For 2-class problem (p, 1 – p):
           ◆   GINI = 1 – p2 – (1 – p)2 = 2p (1-p)
    2/1/2021                  Introduction to Data Mining, 2nd Edition   42
Computing Gini Index of a Single Node
            P(C1) = 0/6 = 0              P(C2) = 6/6 = 1
            Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
            P(C1) = 1/6                  P(C2) = 5/6
            Gini = 1 – (1/6)2 – (5/6)2 = 0.278
            P(C1) = 2/6                  P(C2) = 4/6
            Gini = 1 – (2/6)2 – (4/6)2 = 0.444
 2/1/2021   Introduction to Data Mining, 2nd Edition       43
Information Gain
●   To determine how well a test condition performs, we
    need to compare the degree of impurity of the parent
    node (before splitting) with the degree of impurity of
    the child nodes (after splitting). The larger their
    difference, the better the test condition.
    2/1/2021         Introduction to Data Mining, 2nd Edition   44
Binary Attributes: Computing Information Gain
                                      B?
                        Yes                              No
                        Node                         Node
Gini(N1)                 N1                           N2
= 1 – (5/6)2 – (1/6)2
                                                                Weighted Gini of N1 N2
= 0.278
                                                                = 6/12 * 0.278 +
Gini(N2)                                                          6/12 * 0.444
= 1 – (2/6)2 – (4/6)2                                           = 0.361
= 0.444                                                      Gain = 0.486 – 0.361 = 0.125
  2/1/2021                Introduction to Data Mining, 2nd Edition                   45
Categorical Attributes: Computing Gini Index
●    For each distinct value, gather counts for each class in
     the dataset
●    Use the count matrix to make decisions
       Multi-way split                                       Two-way split
                                                     (find best partition of values)
                         Which of these is the best?
    2/1/2021                  Introduction to Data Mining, 2nd Edition                 46
    Measure of Impurity: Entropy
     2/1/2021     Introduction to Data Mining, 2nd Edition   47
Computing Entropy of a Single Node
            P(C1) = 0/6 = 0             P(C2) = 6/6 = 1
            Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
            P(C1) = 1/6                P(C2) = 5/6
            Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
            P(C1) = 2/6                P(C2) = 4/6
            Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
 2/1/2021    Introduction to Data Mining, 2nd Edition       48
Computing Information Gain After Splitting
    2/1/2021   Introduction to Data Mining, 2nd Edition   49
Problem with large number of partitions
●    Node impurity measures tend to prefer splits that
     result in large number of partitions, each being
     small but pure
      – Customer ID has highest information gain
        because entropy for all the children is zero
    2/1/2021        Introduction to Data Mining, 2nd Edition   50
Gain Ratio
    2/1/2021   Introduction to Data Mining, 2nd Edition   51
Gain Ratio
       SplitINFO = 1.52              SplitINFO = 0.72                SplitINFO = 0.97
    2/1/2021              Introduction to Data Mining, 2nd Edition                  52
Measure of Impurity: Classification Error
    2/1/2021   Introduction to Data Mining, 2nd Edition   53
Computing Error of a Single Node
            P(C1) = 0/6 = 0             P(C2) = 6/6 = 1
            Error = 1 – max (0, 1) = 1 – 1 = 0
            P(C1) = 1/6                P(C2) = 5/6
            Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
            P(C1) = 2/6                P(C2) = 4/6
            Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
 2/1/2021    Introduction to Data Mining, 2nd Edition     54
Misclassification Error vs Gini Index
                              A?
                   Yes                            No
                   Node                      Node
                    N1                        N2
Gini(N1)
= 1 – (3/3)2 – (0/3)2                                         Gini(Children)
=0                                                            = 3/10 * 0
                                                              + 7/10 * 0.489
Gini(N2)                                                      = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489                                                       Gini improves but
                                                              error remains the
                                                              same!!
   2/1/2021               Introduction to Data Mining, 2nd Edition                55
Why?
• Entropy is the phenomenon of information theory
  used to calculate uncertainty or impurity in
  information
• The misclassification loss computes the fraction of
  misclassified samples
• Therefore,      compared       to     entropy,   the
  misclassification loss is not sensitive to changes in
  the class probabilities, due to which entropy is
  often used in building the decision tree for
  classification.
 2/1/2021        Introduction to Data Mining, 2nd Edition   56
Misclassification Error vs Gini Index
                               A?
                   Yes                             No
                   Node                       Node
                    N1                         N2
        Misclassification error for all three cases = 0.3 !
 2/1/2021                  Introduction to Data Mining, 2nd Edition   57
Decision Tree Algorithm
 2/1/2021     Introduction to Data Mining, 2nd Edition   58
Algorithm
●    The createNode() function extends the decision tree
     by creating a new node. A node in the decision tree
     has either a test condition, denoted as node.test cond,
     or a class label, denoted as node.label.
●    The find best split() function determines which
     attribute should be selected as the test condition for
     splitting the training records. As previously noted, the
     choice of test condition depends on which impurity
     measure is used to determine the goodness of a split.
     Some widely used measures include entropy, the Gini
     index, and the χ2 statistic.
    2/1/2021         Introduction to Data Mining, 2nd Edition   59
Algorithm Continued :
●    The Classify() function determines the class label to be
     assigned to a leaf node. For each leaf node t, let p(i|t)
     denote the fraction of training records from class i
     associated with the node t. In most cases, the leaf node
     is assigned to the class that has the majority number of
     training records.
●    The stopping cond() function is used to terminate the
     tree-growing process by testing whether all the records
     have either the same class label or the same attribute
     values. Another way to terminate the recursive function
     is to test whether the number of records have fallen
     below some minimum threshold.
    2/1/2021         Introduction to Data Mining, 2nd Edition   60
Decision Tree Based Classification
●    Advantages:
       –   Relatively inexpensive to construct
       –   Extremely fast at classifying unknown records
       –   Easy to interpret for small-sized trees
       –   Robust to noise (especially when methods to avoid overfitting are
           employed)
●    Disadvantages: .
       –   Due to the greedy nature of splitting criterion, interacting attributes (that
           can distinguish between classes together but not individually) may be
           passed over in favor of other attributed that are less discriminating.
       –   Each decision boundary involves only a single attribute
    2/1/2021                  Introduction to Data Mining, 2nd Edition              61
CART, ID3, C4.5, SLIQ, SPRINT
 2/1/2021   Introduction to Data Mining, 2nd Edition   62
Algorithms
 2/1/2021    Introduction to Data Mining, 2nd Edition   63