Machine Learning – COMS3007
Decision Trees
Benjamin Rosman
Benjamin.Rosman1@wits.ac.za / benjros@gmail.com
Based heavily on course notes by Chris
Williams and Victor Lavrenko, and Clint
van Alten
Tennis data Features
What type of ML is this?
Class
Tennis data
• (Rain, Mild, Normal, Weak)?
• (Sunny, Cool, High, Weak)?
• (Overcast, Hot, Normal, Strong)?
Example – play tennis?
Feature
Values variable
Class / targets
Decision trees – intuition
• Ask a series of questions based on the values of
variables
• Natural way to make decisions
• Effectively divide up the space of all data points
into regions to have simple predictions in each
region
Reading off the rules
• What rule does this tree encode?
• “We can play tennis if:
• It is sunny with normal humidity, or
• It is overcast, or
• It is raining with weak wind.”
• Follow the branches ending with the class we want
(“Yes” in this case).
• “Or” across branches (“∨”), “and” down branches (“∧”)
• 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠
= 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦 ∧ 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝑛𝑜𝑟𝑚𝑎𝑙
∨ 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡
∨ 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑟𝑎𝑖𝑛 ∧ 𝑊𝑖𝑛𝑑 = 𝑤𝑒𝑎𝑘
Classifying with the tree
• Given a new data point…
• Read down the tree
• Answer questions top-down
• (Sunny, Cool, High, Weak)?
• No
• (Rain, Mild, Normal, Weak)?
• Yes
• (Overcast, Hot, Normal, Strong)?
• Yes
Constructing a tree
• Want to be able to learn a tree
• From training data
• Involves deciding which questions to ask where
• What are the best nodes to have higher up?
• Once it has been learned, we can report its
accuracy on the testing data
• A common algorithm for learning trees is the ID3
algorithm
The ID3 Algorithm
Main loop:
1. F ← “best” feature for next node
2. Assign F as decision feature for node
3. For each value of F, create new branch from node
• One for each value f that F can take
4. Sort/distribute training examples to leaf nodes
• Assign all data points that have F=f to branch f
• Leaf node predicts its most common label
5. If training examples perfectly classified, then
STOP, else iterate over new leaf nodes
Choosing the best feature
• Why does it make a difference what features we
choose?
• Don’t ask uninformative questions up front: waste of
computation, and may lead to having to repeat the same
important question later on, on many different branches.
• E.g. Imagine we first split on another feature “day”, which
doesn’t affect the predictions at all. Then we would have 7
copies of the tree.
• How to choose a good feature?
• We want one that has strong discriminating power, i.e. ends
up with branches that are more ordered than before the split
Bad features
Month
Jan …
Feb
Entropy
• Need a measure of how “good” a feature is
• Entropy: measure of randomness or uncertainty in
a system
• Let 𝑝 = {𝑝1 , 𝑝2 , … , 𝑝𝑛 } be a set of probabilities,
with σ𝑛𝑖=1 𝑝𝑖 = 1
• Entropy is 𝐻 𝑝 = − σ𝑛𝑖=1 𝑝𝑖 log 2 𝑝𝑖
The base 2 in the log is convention, and does NOT depend on the number of
outcomes. It is because we measure entropy in “bits”.
Entropy example
• Let 𝑝 = {𝑝𝐻 , 𝑝𝑇 } be the outcomes of a coin flip
• Entropy is 𝐻 𝑝 = −(𝑝𝐻 log 2 𝑝𝐻 + 𝑝𝑇 log 2 𝑝𝑇 )
1 1
• Unbiased coin: 𝑝𝐻 = , 𝑝𝑇 =
2 2
• Maximum uncertainty
• 𝐻 𝑝 = −0.5 log 2 0.5 − 0.5 log 2 0.5 = 1
• Maximally biased coin: 𝑝𝐻 = 1, 𝑝𝑇 = 0 𝐻 𝑝
• Minimum uncertainty
• 𝐻 𝑝 = −1 log 2 1 − 0 log 2 0 = 0
Lower entropy is better! It means we can make
more accurate predictions
𝑝𝐻
Note: we always take 0 log 2 0 = 0
Note: max = 1 only because there are TWO outcomes to a coin flip.
Otherwise it would be higher. Why?
Information gain
• In ID3, we want the feature F which gives the
largest reduction in entropy over data D: Gain(D, F)
• Choose feature F with maximum gain
1
• 𝐺𝑎𝑖𝑛 𝐷, 𝐹 = 𝐻 𝐷 − σ |𝐷 |𝐻(𝐷𝑓 )
|𝐷| 𝑓∈𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝐹 𝑓
Entropy of the
Consider gain Consider all Weight by the subset of the data
relative to starting branches: all fraction of the data D where F=f
entropy. Note this is values of f this constitutes
the same for every f,
so is often left out
F1 F2 F3 Class
Example A
B
S
S
G
E
Y
N
A T G N
B T G Y
A S G Y
• What is 𝑝𝑌 and 𝑝𝑁 ? B S E N
3 3
• 𝑝𝑌 = ; 𝑝𝑁 =
6 6
• So, entropy of data H(D):
1 1 1 1
• 𝐻 𝑝 = − log 2 − log 2 = 1
2 2 2 2
• Also, |D| = 6
• Now compute the gain for each of the three features
F1, F2, F3
• So we can decide which to split on
F2 F3 Class
F1
Example (F1) A S G Y
B S E N
A T G N
• Gain(D, F1)
• f = {A,B} B T G Y
• For F1 = A (yellow rows): A S G Y
• 𝐷𝐴 = 3 B S E N
2 1
• 𝑝𝑌 = ; 𝑝𝑁 =
3 3
2 2 1 1
• 𝐻 𝐷𝐴 = − log 2 − log 2 = 0.918
3 3 3 3
• For F1 = B (purple rows):
• 𝐷𝐵 = 3 We have a slight
1 2
• 𝑝𝑌 = ; 𝑝𝑁 = reduction in entropy:
3 3
• 𝐻 𝐷𝐵 = − log 2
1 1 2 2
− log 2 = 0.918 splitting on F1 would give
3 3 3 3
1
us two branches that have
• 𝐺𝑎𝑖𝑛 𝐷, 𝐹1 = 𝐻 𝐷 − σ𝑓∈ 𝐴,𝐵 𝐷𝑓 𝐻 𝐷𝑓 slightly less uncertainty:
𝐷
1 i.e. a 2/3 vs 1/3 split
• =1 − 3 × 0.918 + 3 × 0.918 = 0.082
6
F1 F3 Class
F2
Example (F2) A S G Y
B S E N
A T G N
• Gain(D, F2)
• f = {S,T} B T G Y
• For F2 = S (yellow rows): A S G Y
• 𝐷𝑆 = 4 B S E N
1 1
• 𝑝𝑌 = ; 𝑝𝑁 =
2 2
1 1 1 1
• 𝐻 𝐷𝑆 = − log 2 − log 2 = 1
2 2 2 2
• For F2 = T (purple rows):
• 𝐷𝑇 = 2 We have no reduction in
1 1
• 𝑝𝑌 = ; 𝑝𝑁 = entropy: splitting on F2
2 2
1 1 1 1 would give us two
• 𝐻 𝐷𝑇 = − log 2 − log 2 = 1
2 2 2 2
1
branches with the same
• 𝐺𝑎𝑖𝑛 𝐷, 𝐹2 = 𝐻 𝐷 − 𝐷
σ𝑓∈ 𝑆,𝑇 𝐷𝑓 𝐻 𝐷𝑓 uncertainty we have now
1
• =1 − 4×1 + 2×1 =0
6
F1 F2 Class
F3
Example (F3) A S G Y
B S E N
A T G N
• Gain(D, F3)
• f = {G,E} B T G Y
• For F3 = G (yellow rows): A S G Y
• 𝐷𝐺 = 4 B S E N
3 1
• 𝑝𝑌 = ; 𝑝𝑁 =
4 4
3 3 1 1
• 𝐻 𝐷𝐺 = − log 2 − log 2 = 0.811
4 4 4 4
• For F3 = E (purple rows):
• 𝐷𝐸 = 2 We have a large reduction
0 2
• 𝑝𝑌 = ; 𝑝𝑁 = in entropy: splitting on F3
2 2
• 𝐻 𝐷𝑇 = − log 2
0 0 2 2
− log 2 = 0 would give us one branch
2 2 2 2
1
that is all “N”, and
• 𝐺𝑎𝑖𝑛 𝐷, 𝐹3 = 𝐻 𝐷 − σ𝑓∈ 𝐺,𝐸 𝐷𝑓 𝐻 𝐷𝑓 another that is 75% “Y”
𝐷
1
• =1 − 4 × 0.811 + 2 × 0 = 0.459
6
Example tree Now repeat the process on
every leaf node that has
• Gain(D,F1) = 0.082 imperfectly classified training
• Gain(D,F2) = 0 data.
Note the next feature used
• Gain(D,F3) = 0.459 for splitting may be different
• Max gain => Split on F3 on each branch.
F3
G E
Allocate data points to
branches where F3 = G/E
F1 F2 F3 Class F1 F2 F3 Class
A S G Y B S E N
A T G N B S E N
B T G Y Prediction = “N”
A S G Y
Prediction = “Y” (75% accuracy)
Avoiding overfitting – pruning
• Overfitting quite likely (particularly when there are only a few data
points at a leaf)
• Correct with pruning: replace a subtree with a leaf node with the most
common label from the subtree
Prune
here
F7 F7
C F8 C A
B A A
Remember:
full set of known data = training data +
What to prune? validation data + testing data
• Use ID3 to build tree T using training data
• Compute validation error using validation data:
(plug each validation point into the tree for a prediction)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠
• 𝑒𝑟𝑟𝑜𝑟𝑉 𝑇 =
𝑠𝑖𝑧𝑒 𝑜𝑓 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝑠𝑒𝑡
• Consider any subtree of T, and prune to give T’
• If 𝑒𝑟𝑟𝑜𝑟𝑉 𝑇 ′ < 𝑒𝑟𝑟𝑜𝑟𝑉 (𝑇):
• Replace T with T’ and repeat
• Else: keep T and repeat
• Finally, calculate error on test set for reporting
Continuous variables
• What if the attributes take continuous values?
𝑭𝟏 𝑭𝟐 Target
0.83 0.56 A
0.12 0.22 C
0.68 0.19 B
0.79 0.34 A
⋮ ⋮ ⋮
• For each attribute, ask questions like: 𝐹1 < 0.6? 𝐹2 < 0.3?
• Values 0.6 and 0.3 here are called split-points. How to
choose them?
• Choose possible points over an interval, e.g. 0.1, 0.2, …, 0.9
• Calculate the Gain of each of these options
• E.g. Is 𝐹1 < 0.1? Is 𝐹1 < 0.2? Is 𝐹1 < 0.3? …
• Choose the attribute/split-point with maximum Gain
Continuous variables – example
For two features, we may have:
𝐹2 Giving the tree:
1
B A A A 𝐹1 < 0.4
B A A
0.6
B Yes No
C C
B
C B 𝐹2 < 0.6
B C
B C
C Yes No
0.4 1 𝐹1
C A
Regression trees
• What if the target takes continuous values?
• This is regression
𝑭𝟏 𝑭𝟐 … 𝑭𝒎 𝒕𝒂𝒓𝒈𝒆𝒕
⋮ ⋮ ⋮ ⋮ 2.7
⋮ ⋮ ⋮ ⋮ 2.35
⋮ ⋮ ⋮ ⋮ 1.98
⋮ ⋮ ⋮ ⋮ -0.33
⋮ ⋮ ⋮ ⋮ -1.05
⋮ ⋮ ⋮ ⋮ 0.77
⋮ ⋮ ⋮ ⋮ 0.5
• Entropy doesn’t work here!
Sum of squares error
• Instead of entropy, use the sum of squares error:
SoSe(D)
• First, calculate the mean of the N targets 𝑡𝑖 :
1 𝑁
• 𝑚𝑒𝑎𝑛 = 𝜇 = σ 𝑡
𝑁 𝑖=1 𝑖 This error >= 0.
It is minimised when
all the targets are at
• Then: the mean. The error
1 𝑁 increases as they
2
• 𝑆𝑜𝑆𝑒(𝐷) = σ 𝑡𝑖 − 𝜇 become spread out.
𝑁 𝑖=1
𝑭𝟏 𝑭𝟐 … 𝑭𝒎 𝒕𝒂𝒓𝒈𝒆𝒕
⋮ ⋮ ⋮ ⋮ 2.7
SoSe example ⋮ ⋮ ⋮ ⋮ 2.35
⋮ ⋮ ⋮ ⋮ 1.98
⋮ ⋮ ⋮ ⋮ -0.33
⋮ ⋮ ⋮ ⋮ -1.05
⋮ ⋮ ⋮ ⋮ 0.77
• Mean: ⋮ ⋮ ⋮ ⋮ 0.5
1
• 𝜇= 2.7 + 2.35 + 1.98 − 0.33 − 1.05 + 0.77 + 0.5 = 0.989
7
• Then, SoSe(D) =
2.7 − 𝜇 2 + 2.35 − 𝜇 2 + 1.98 − 𝜇 2 + −0.33 − 𝜇 2
+ −1.05 − 𝜇 2 + 0.77 − 𝜇 2 + 0.5 − 𝜇 2 = 11.95
• (and then divide by N)
Learning the regression tree
• Same as the ID3 algorithm
• For each attribute F, calculate:
• 𝑆𝑜𝑆𝑒𝐹 (𝐷) = σ𝑓∈𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝐹 |𝐷𝑓 |𝑆𝑜𝑆𝑒(𝐷𝑓 )
• Choose the F which minimises 𝑆𝑜𝑆𝑒𝐹 (𝐷)
• Create a branch for each value f of F
• The label given to a leaf node is the average of all
training values at that node
Summary
• Decision trees as rules
• Classifying with a decision tree
• Building a tree: ID3
• Choosing the best feature: entropy and gain
• Avoiding overfitting: pruning
• Continuous variables: split points
• Regression trees: sum of squares error