Back to decision trees
• Which attribute do we choose at each level?
• The one with the highest information gain
– The one that reduces the unpredictability the most
Bill Howe, UW
Before: 14 records, 9 are “yes”
outlook temperature humidity windy play
overcast cool normal TRUE yes
overcast hot high FALSE yes If we choose outlook:
overcast hot normal FALSE yes
overcast : 4 records, 4 are “yes”
overcast mild high TRUE yes
rainy cool normal TRUE no
rainy mild high TRUE no
rainy cool normal FALSE yes
rainy mild high FALSE yes rainy : 5 records, 3 are “yes”
rainy mild normal FALSE yes
sunny hot high FALSE no
sunny hot high TRUE no
sunny mild high FALSE no sunny : 5 records, 2 are “yes”
sunny cool normal FALSE yes
sunny mild normal TRUE yes
Expected new entropy:
= 0.69
Before: 14 records, 9 are “yes”
outlook temperature humidity windy play
overcast cool normal TRUE yes
overcast hot high FALSE yes If we choose temperature:
overcast hot normal FALSE yes
cool : 4 records, 3 are “yes”
overcast mild high TRUE yes
rainy cool normal TRUE no 0.81
rainy mild high TRUE no
rainy cool normal FALSE yes
rainy mild high FALSE yes rainy : 4 records, 2 are “yes”
rainy mild normal FALSE yes
sunny hot high FALSE no 1.0
sunny hot high TRUE no
sunny mild high FALSE no sunny : 6 records, 4 are “yes”
sunny cool normal FALSE yes
sunny mild normal TRUE yes 0.92
Expected new entropy:
0.81(4/14) + 1.0(4/14) + 0.92(6/14)
= 0.91
Before: 14 records, 9 are “yes”
outlook temperature humidity windy play
overcast cool normal TRUE yes
overcast hot high FALSE yes If we choose humidity:
overcast hot normal FALSE yes
normal : 7 records, 6 are “yes”
overcast mild high TRUE yes
rainy cool normal TRUE no 0.59
rainy mild high TRUE no
rainy cool normal FALSE yes
rainy mild high FALSE yes high : 7 records, 2 are “yes”
rainy mild normal FALSE yes
sunny hot high FALSE no 0.86
sunny hot high TRUE no
sunny mild high FALSE no
sunny cool normal FALSE yes
sunny mild normal TRUE yes
Expected new entropy:
0.59(7/14) + 0.86(7/14)
= 0.725
Before: 14 records, 9 are “yes”
outlook temperature humidity windy play
overcast cool normal TRUE yes
overcast hot high FALSE yes If we choose windy:
overcast hot normal FALSE yes
TRUE : 8 records, 6 are “yes”
overcast mild high TRUE yes
rainy cool normal TRUE no 0.81
rainy mild high TRUE no
rainy cool normal FALSE yes
rainy mild high FALSE yes FALSE : 5 records, 3 are “yes”
rainy mild normal FALSE yes
sunny hot high FALSE no 0.97
sunny hot high TRUE no
sunny mild high FALSE no
sunny cool normal FALSE yes
sunny mild normal TRUE yes
Expected new entropy:
0.81(8/14) + 0.97(6/14)
= 0.87
Before: 14 records, 9 are “yes”
outlook temperature humidity windy play
overcast cool normal TRUE yes
overcast hot high FALSE yes
overcast hot normal FALSE yes outlook
overcast mild high TRUE yes
rainy cool normal TRUE no 0.94 – 0.69 = 0.25 highest gain
rainy mild high TRUE no
rainy cool normal FALSE yes temperature
rainy mild high FALSE yes
rainy mild normal FALSE yes 0.94 – 0.91 = 0.03
sunny hot high FALSE no
sunny hot high TRUE no
humidity
sunny mild high FALSE no
sunny cool normal FALSE yes
sunny mild normal TRUE yes 0.94 – 0.725 = 0.215
windy
0.94 – 0.87 = 0.07
Document Classification
Includes “Falcons” Includes “Mars”
yes no yes no
50% “Sports” 50% “Sports” 2% “Sports” 50% “Sports”
50% “Science” 50% “Science” 95% “Science” 50% “Science”
Bill Howe, UW
Building a Decision Tree (ID3 Algorithm)
• Assume attributes are discrete
– Discretize continuous attributes
• Choose the attribute with the highest Information Gain
• Create branches for each value of attribute
• Examples partitioned based on selected attributes
• Repeat with remaining attributes
• Stopping conditions
– All examples assigned the same label
– No examples left
Bill Howe, UW
Problems
• Expensive to train
• Prone to overfitting
– Drive to perfection on training data, bad on
test data
– Pruning can help: remove or aggregate
subtrees that provide little discriminatory
power (C45)
Bill Howe, UW
C4.5 Extensions
• Continuous Attributes
outlook temperature humidity windy play
overcast cool 60 TRUE yes
overcast hot 80 FALSE yes
overcast hot 63 FALSE yes
overcast mild 81 TRUE yes
rainy cool 58 TRUE no
rainy mild 90 TRUE no
rainy cool 54 FALSE yes
rainy mild 92 FALSE yes
rainy mild 59 FALSE yes
sunny hot 90 FALSE no
sunny hot 89 TRUE no
sunny mild 90 FALSE no
sunny cool 60 FALSE yes
sunny mild 62 TRUE yes
Bill Howe, UW
Consider every possible binary
partition; choose the partition
with the highest gain
outlook temperature humidity windy play
rainy mild 54 FALSE yes
overcast hot 58 FALSE yes
overcast cool 59 TRUE yes
rainy cool 60 FALSE yes E(6/6)
overcast mild 60 TRUE yes
overcast hot 62 FALSE yes
= 0.0 E(9/10)+ E(1/10)
rainy mild 63 TRUE no
sunny cool 80 FALSE yes = 0.47
rainy mild 81 FALSE yes
sunny mild 89 TRUE yes
sunny hot 90 FALSE no E(3/8) + E(5/8)
rainy cool 90 TRUE no
sunny hot 90 TRUE no = 0.95 E(4/4)
sunny mild 92 FALSE no
= 0.0
Expect = 8/14*0.95 + 6/14*0 Expect = 10/14*0.47 + 4/14*0
= 0.54 = 0.33
Bill Howe, UW