L3. Decision Trees

Decision Trees
Gonzalo Martínez Muñoz
Universidad Autónoma de Madrid

2!
•  What is a decision tree?
•  History
•  Decision tree learning algorithm
•  Growing the tree
•  Pruning the tree
•  Capabilities
Outline

3!
•  Hierarchical learning model that recursively
partitions the space using decision rules
•  Valid for classiﬁcation and regression
•  Ok, but what is a decision tree?
What is a decision tree?

4!
•  Labor negotiations: predict if a collective agreement
is good or bad

Example for classiﬁcation
good
dental plan?
>full!≤full
goodbad
wage increase
1st year
>36%!≤36%
N/A none half full
dental plan
1020304050
wageincrease1styear
Leaf nodes!
Internal!
nodes!
Root node!

5!
•  Boston housing: predict house values in Boston by
neigbourhood

Example for regression
500000
average
rooms
>5.5!≤ 5.5
350000200000
Average
house year
>75!≤75
3 4 5 6
ave. rooms
5060708090
ave.houseyear

6!
•  Precursors: Expert Based Systems (EBS)
EBS = Knowledge database + Inference Engine
•  MYCIN: Medical diagnosis system based, 600 rules
•  XCON: System for conﬁguring VAX computers, 2500 rules
(1982)
•  The rules were created by experts by hand!!
•  Knowledge acquisition has to be automatized
•  Substitute the Expert by its archive with solved cases
History

7!
•  CHAID (CHi-squared Automatic Interaction
Detector) Gordon V. Kass ,1980
•  CART (Classiﬁcation and Regression Trees),
Breiman, Friedman, Olsen and Stone, 1984
•  ID3 (Iterative Dichotomiser 3), Quinlan, 1986
•  C4.5, Quinlan 1993: Based on ID3
History

•  Consider two binary variables. How many ways can
we split the space using a decision tree?

•  Two possible splits and two possible assignments
to the leave nodes à At least 8 possible trees
8!
Computational
0 1
01

9!
Computational
•  Under what conditions someone waits in a
restaurant?

There are 2 x 2 x 2 x 2 x 3 x 3 x 2 x 2 x 4 x 4 = 9216 cases
and two classes à 29216 possible hypothesis and many more
possible trees!!!!

10!
•  It is just not feasible to ﬁnd the optimal solution
•  A bias should be selected to build the models.
•  This is a general a problem in Machine Learning.
Computational

11!
For decision trees a greedy approach is generally
selected:
•  Built step by step, instead of building the tree as a
whole
•  At each step the best split with respect to the train
data is selected (following a split criterion).
•  The tree is grown until a stopping criterion is met
•  The tree is generally pruned (following a pruning
criterion) to avoid over-ﬁtting.
Computational

12!
Basic Decision Tree Algorithm
trainTree(Dataset L)
1. T = growTree(L)
2. pruneTree(T,L)
3. return T
growTree(Dataset L)
1.  T.s = findBestSplit(L)
2. if T.s == null return null
3. (L1, L2) = splitData(L,T.s)
4. T.left = growTree(L1)
5. T.right = growTree(L2)
6. return T
Removes
subtrees
uncertain
about their
validity.

13!
Finding the best split
findBestSplit(Dataset L)
1.  Try all possible splits
2. return best
0!
1!
2!
3!
4!
5!
0! 1! 2! 3! 4! 5!
There are not too
many. Feasible
computationally
O(# of attrib x # of
points)
But which one
is best?

14!
Split Criterion
•  It should measure the impurity of node:

𝑖( 𝑡)=∑𝑖=1↑𝑚▒ 𝑓↓𝑖 (1− 𝑓↓𝑖 ) 

Gini impurity
(CART)
where fi is the fraction of instances of class i in node t
•  The improvement of a split is the variation of impurity
before and after the split

∆ 𝑖(𝑡, 𝑠)= 𝑖(𝑡)− 𝑝↓𝐿 𝑖( 𝑡↓𝐿 )− 𝑝↓𝑅 𝑖( 𝑡↓𝑅 )
where pL (pR) is the proportion of instances going to the
left (right) node

15!
Split Criterion
0!
1!
2!
3!
4!
5!
0! 1! 2! 3! 4! 5!
∆ 𝑖(𝑡, 𝑠)= 𝑖(𝑡)− 𝑝↓𝐿 𝑖( 𝑡↓𝐿 )− 𝑝↓𝑅 𝑖( 𝑡↓𝑅 )!
𝑖(𝑡)=2×8/12 ×4/12 =0,44!
𝑝↓𝐿 =5/12  ; 𝑝↓𝑅 =
7/12 !
𝑖( 𝑡↓𝐿 )=2×2/5 ×
3/5 !
𝑖( 𝑡↓𝑅 )=2×6/7 ×
1/7 !
∆ 𝑖(𝑡, 𝑠)= 0.102!
0.011
0.102 0.026 0.020
0.044

0.111
0.056
0.011
Which one is
best?

16!
•  Based on entropy
𝐻(𝑡)=−∑𝑖=1↑𝑚▒ 𝑓↓𝑖 log↓2 ⁠ 𝑓↓𝑖   
•  Information gain (used in ID4 and C4.5)
𝐼𝐺(𝑡, 𝑠)= 𝐻(𝑡)− 𝑝↓𝐿 𝐻( 𝑡↓𝐿 )− 𝑝↓𝑅 𝐻( 𝑡↓𝑅 )
•  Information gain ratio (C4.5)
𝐼𝐺𝑅(𝑡, 𝑠)= 𝐼𝐺(𝑡, 𝑠)/ 𝐻( 𝑠)
Other splitting criteria

17!
•  Any splitting criteria with this shape is good

Splitting criteria
(this is for
binary
problems)!

18!

Full grown tree
0!
1!
2!
3!
4!
5!
0! 1! 2! 3! 4! 5!
good
X2
>3.5!
≤3.5
bad
X1
>2.5!≤2.5
X2
>2.5!≤2.5
goodX2
>1.5!≤1.5
good X1
>4.5!≤4.5
goodX1
>3.5!≤3.5
good bad
Are we happy
with this tree?
good
How about this
one?

19!
•  All instances assigned to the node considered for
splitting have the same class label
•  No split is found to further partition the data
•  The number of instances in each terminal node is smaller
than a predeﬁned threshold
•  The impurity gain of the best split is not below a given
threshold
•  The tree has reached a maximum depth
Stopping criteria
The last three elements are also call pre-pruning

20!
Pruning
•  Another option is post pruning (or pruning). Consist in:
•  Grow the tree as much as possible
•  Prune it afterwards by substituting one subtree by a
single leaf node if the error does not worsen
significantly.
•  This process is continued until no more pruning is
possible.
•  Actually we go back to smaller trees but through a
different path
•  The idea of pruning is to avoid overfitting

21!
Cost-complexity pruning (CART)
•  Cost-complexity based pruning:

𝑅↓𝛼 (𝑡)= 𝑅(𝑡)+ 𝛼· 𝐶(𝑡)
•  R(t) is the error of the decision tree rooted at node t
•  C(t) is the number of leaf nodes from node t
•  Parameter α speciﬁes the relative weight between
the accuracy and complexity of the tree

22!

Pruning CART
0!
1!
2!
3!
4!
5!
0! 1! 2! 3! 4! 5!
good
X2
>3.5!
≤3.5
bad
X1
>2.5!≤2.5
X2
>2.5!≤2.5
goodX2
>1.5!≤1.5
good X1
>4.5!≤4.5
goodX1
>3.5!≤3.5
good bad
good
𝑅↓𝛼=0.1 (𝑡)=1/5+0.1·1=0.3
𝑅↓𝛼=0.1 (𝑡)=0+0.1·5=0.5
Pruned:!
Unpruned!
Let’s say 𝛼=0.1!

23!
•  CART uses 10-fold cross-validation within the
training data to estimate alpha. Iteratively nine folds
are used for training a tree and one for test.
•  A tree is trained on nine folds and it is pruned using
all possible alphas (that are ﬁnite).
•  Then each of those trees is tested on the remaining
fold.
•  The process is repeated 10 times and the alpha value
that gives the best generalization accuracy is kept
Cost-complexity pruning (CART)

24!
•  C4.5 estimates the accuracy % on the leaf nodes
using the upper conﬁdence bound (parameter) of a
normal distribution instead of the data.
•  Error estimate for subtree is the weighted sum of
the error estimates for all its leaves
•  This error is higher when few data instances fall on
a leaf.
•  Hence, leaf nodes with few instances tend to be
pruned.
Statistical pruning (C4.5)

25!
•  CART pruning is slower since it has to build 10
extra trees to estimate alpha.
•  C4.5 pruning is faster, however the algorithm does
not propose a way to compute the conﬁdence
threshold
•  The statistical grounds for C4.5 pruning are
questionable.
•  Using cross validation is safer
Pruning (CART vs C4.5)

26!
Missing values
•  What can be done if a value is missing?
•  Suppose the value for “Pat” for one instance is
unknown.
•  The instance with the missing value fall through the
three branches but weighted
•  And the validity of the split
is computed as before

27!
Oblique splits
•  CART algorithms allows for oblique splits, i.e. splits
that are not orthogonal to the attributes axis
•  The algorithm searches for planes with good
impurity reduction
•  The growing tree process becomes slower
•  But trees become more expressive and compact
N1>N2
true!false
- +

28!
•  Minimum number of instances necessary to split a
node.
•  Pruning/No pruning
•  Pruning conﬁdence. How much to prune?
•  For computational issues the number of nodes or
depth of the tree can be limited
Parameters

29!
Algorithms details
Splitting criterion! Pruning criterion! Other features!
CART!
•  Gini!
•  Twoing!
Cross-validation post-
pruning!
•  Regression/Classif.!
•  Nominal/numeric
attributes!
•  Missing values!
•  Oblique splits!
•  Nominal splits
grouping!
!
ID3! Information Gain (IG)! Pre-pruning.!
•  Classiﬁcation!
•  Nominal attributes!
C4.5!
•  Information Gain
(IG)!
•  Information Gain
Ratio (IGR)!
!
Statistical based post-
pruning!
•  Classiﬁcation!
•  Nominal/numeric
attributes!
•  Missing values!
•  Rule generator!
•  Multiple nodes split!
!

30!
Bad things about DT
•  None! Well maybe something…
•  Cannot handle so well complex interactions between
attributes. Lack of expressive power

31!
Bad things about DT
•  Replication problem. Can end up with similar subtrees
in mutually exclusive regions.

32!
Good things about DT
•  Self-explanatory. Easy for non experts to understand.
Can be converted to rules.
•  Handle both nominal and numeric attributes.
•  Can handle uninformative and redundant attributes.
•  Can handle missing values.
•  Nonparametric method. In principle, no predeﬁned
idea of the concept to learn
•  Easy to tune. Do not have hundreds of parameters

Thanks for listening
Go for a coffee
Questions?
NO!YES
Go for a coffee!Go for a coffee!
Correctly
answered?
YES!NO

L3. Decision Trees

More Related Content

What's hot

Viewers also liked

Similar to L3. Decision Trees

More from Machine Learning Valencia

Recently uploaded

In this document

L3. Decision Trees