Decision Trees
Decision Tree Overview
●   Algorithm uses tree structure to model relationships
    among the features and the potential outcomes
●   It breaks down dataset into smaller subset with
    increase in depth of tree
●   It’s a flowchart for deciding how to classify a new
    observation
Decision Tree Overview
●   Decision trees are simple and useful for intepretation
●   Has nice graphical representation
●   Typically not competitive with best supervised learning approaches
●   Random forests and Boosting are some ways to modify the approach
●   Sometimes result in dramatic increase in prediction accuracy
●   We will focus on basic Classification And Regression Trees (CART)
Terminology in CART
●   Root node – first decision node starting from the original data
●   Internal nodes – points along the tree where we split the data
●   Branch – segment of the tree that connect the nodes
●   Leaf/Terminal node – holds final result of splitting the data
Decision Tree Overview
                                                         Root Node
                                                                                                            Branch
                       Internal node                                                   Internal node
                                                           Spli
                                                            t
       Terminal node                   Internal node                   Terminal node                   Terminal node
                       Terminal node                   Terminal node
Classification Vs Regression
                    ●   Classification
                         ○   Spam/ Not spam
                         ○   Admit to ICU or not
                         ○   Lend money/ deny
                         ○   Intrusion detection
                    ●   Regression
                         ○   Predict stock returns
                         ○   Pricing a house or a car
                         ○   Weather predictions (Temp, Rainfall etc.)
                         ○   Economic growth predictions
Decision Tree Overview
●   The decision tree algorithm learns (i.e creates the decision tree from the data set)
    through the optimization of an error function.
●   CART is one of the special techniques which can be used for solving both regression &
    classification problems.
●   In this lecture, we will be focusing on how to solve a classification problem. The
    approach is similar for solving a regression problem.
Visualizing a Classification tree
●   Let’s look at a simple example
●   How do we classify the observations into red or blue?
Steps in a decision tree
●
Steps in a decision tree
●
                           RED CLASS   BLUE CLASS
Steps in a decision tree
●   Once the final divisions are made, if
    a new data point comes in which has
    value (0.2,0.3) – the classification    BLUE CLASS    RED CLASS
    tree predicts that the new
    observation is Red as it falls in the
    region where the majority is Red
                                              RED CLASS       BLUE CLASS
Representation as a tree
      Red         Blue
                           Blue   Red
●
How do we know where to split?
●   The below dataset was easy where the pattern was easy to visualize. What about datasets which
    are complicated where the patterns are more complex? How does the algorithm decide where to
    split?
                                      BLUE CLASS        RED CLASS
                                         RED CLASS          BLUE CLASS
How do we know where to split?
●   We choose the variable and place to split that results in the smallest sum
    of the Gini indices of the two new regions
Maximum and minimum of Gini Index (2 classes)
                 G=1(1-1)+0(1-0)=0
Simple example
How do we know where to split?
                  G1        G2
                       G3        G4
   Let’s build a small tree for a real world problem
                                         Root Node
                                  MALE               FEMALE
                                 Node                  Node
Group 1
Group 2
   Let’s build a small tree for a real world problem
                           ●
Group 1
Group 2
Let’s build a small tree for a real world problem
                         ●   A computer would calculate the Gini index for
                             splitting with different independent variables –
                             Occupation, Age, Gender etc. and finds the
                             optimal variable to split the data
Overfitting in Decision trees
●   Splitting non-stop eventually
    leads to each point being its
    own region
●   Such a decision tree model
    would perform very poorly
    on unseen test data.
●   Our model is overfitting the
    data !
How do we know when to stop?
●   One strategy is to stop after the decrease in Gini index due to each split
    is smaller than some threshold
●   Not good as we may miss some good splits due to stopping early at poor
    splits
●   Better strategy is to grow a very large tree, then prune it to obtain a
    subtree
Pruning
●
Pruning
●   Pruning is not available in MS Azure
●   How to decide optimal size of a tree?
●   Answer – tuning (trial and error) and cross validation
●   Build different decision trees with different hyperparameter values. Use
    cross validation to observe their performance, and pick the best one
Hyperparameters for Decision Tree
●   Maximum Depth- the largest length between the root to leaf – We can define the
    depth of the decision tree
●   Minimum number of samples per leaf - we can set a minimum for the number of
    samples we allow on each leaf.
●   Minimum sample split - the minimum number of samples required to split an internal
    node
●   Maximum features - the number of features that one looks for in each split.
Advantages Vs Disadvantages
               Advantages                              Disadvantages
 • Graphical representation of results:   • Prediction accuracy is not as good as
   very easy to explain to people           more complicated approaches
   (probably even easier than linear      • Computational issue with large
   regression)                              categorical variables
 • Intepretation (logic of the model)     • Final tree is not very stable (small
 • Handle missing data                      change in data leads to very different
 • Handles numeric and categorical          tree)
   variables
 • Captures nonlinear effect