0% found this document useful (0 votes)
47 views28 pages

2024 Decision Trees

Uploaded by

thapelondlovu74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views28 pages

2024 Decision Trees

Uploaded by

thapelondlovu74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning – COMS3007

Decision Trees
Benjamin Rosman
Benjamin.Rosman1@wits.ac.za / benjros@gmail.com

Based heavily on course notes by Chris


Williams and Victor Lavrenko, and Clint
van Alten
Tennis data Features
What type of ML is this?

Class
Tennis data

• (Rain, Mild, Normal, Weak)?

• (Sunny, Cool, High, Weak)?

• (Overcast, Hot, Normal, Strong)?


Example – play tennis?
Feature
Values variable

Class / targets
Decision trees – intuition
• Ask a series of questions based on the values of
variables
• Natural way to make decisions
• Effectively divide up the space of all data points
into regions to have simple predictions in each
region
Reading off the rules
• What rule does this tree encode?
• “We can play tennis if:
• It is sunny with normal humidity, or
• It is overcast, or
• It is raining with weak wind.”
• Follow the branches ending with the class we want
(“Yes” in this case).
• “Or” across branches (“∨”), “and” down branches (“∧”)
• 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠
= 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦 ∧ 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝑛𝑜𝑟𝑚𝑎𝑙
∨ 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡
∨ 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑟𝑎𝑖𝑛 ∧ 𝑊𝑖𝑛𝑑 = 𝑤𝑒𝑎𝑘
Classifying with the tree
• Given a new data point…
• Read down the tree
• Answer questions top-down
• (Sunny, Cool, High, Weak)?
• No
• (Rain, Mild, Normal, Weak)?
• Yes
• (Overcast, Hot, Normal, Strong)?
• Yes
Constructing a tree
• Want to be able to learn a tree
• From training data
• Involves deciding which questions to ask where
• What are the best nodes to have higher up?
• Once it has been learned, we can report its
accuracy on the testing data

• A common algorithm for learning trees is the ID3


algorithm
The ID3 Algorithm
Main loop:
1. F ← “best” feature for next node
2. Assign F as decision feature for node
3. For each value of F, create new branch from node
• One for each value f that F can take
4. Sort/distribute training examples to leaf nodes
• Assign all data points that have F=f to branch f
• Leaf node predicts its most common label
5. If training examples perfectly classified, then
STOP, else iterate over new leaf nodes
Choosing the best feature
• Why does it make a difference what features we
choose?
• Don’t ask uninformative questions up front: waste of
computation, and may lead to having to repeat the same
important question later on, on many different branches.
• E.g. Imagine we first split on another feature “day”, which
doesn’t affect the predictions at all. Then we would have 7
copies of the tree.
• How to choose a good feature?
• We want one that has strong discriminating power, i.e. ends
up with branches that are more ordered than before the split
Bad features
Month

Jan …

Feb
Entropy
• Need a measure of how “good” a feature is
• Entropy: measure of randomness or uncertainty in
a system
• Let 𝑝 = {𝑝1 , 𝑝2 , … , 𝑝𝑛 } be a set of probabilities,
with σ𝑛𝑖=1 𝑝𝑖 = 1
• Entropy is 𝐻 𝑝 = − σ𝑛𝑖=1 𝑝𝑖 log 2 𝑝𝑖

The base 2 in the log is convention, and does NOT depend on the number of
outcomes. It is because we measure entropy in “bits”.
Entropy example
• Let 𝑝 = {𝑝𝐻 , 𝑝𝑇 } be the outcomes of a coin flip
• Entropy is 𝐻 𝑝 = −(𝑝𝐻 log 2 𝑝𝐻 + 𝑝𝑇 log 2 𝑝𝑇 )
1 1
• Unbiased coin: 𝑝𝐻 = , 𝑝𝑇 =
2 2
• Maximum uncertainty
• 𝐻 𝑝 = −0.5 log 2 0.5 − 0.5 log 2 0.5 = 1
• Maximally biased coin: 𝑝𝐻 = 1, 𝑝𝑇 = 0 𝐻 𝑝
• Minimum uncertainty
• 𝐻 𝑝 = −1 log 2 1 − 0 log 2 0 = 0

Lower entropy is better! It means we can make


more accurate predictions
𝑝𝐻
Note: we always take 0 log 2 0 = 0
Note: max = 1 only because there are TWO outcomes to a coin flip.
Otherwise it would be higher. Why?
Information gain
• In ID3, we want the feature F which gives the
largest reduction in entropy over data D: Gain(D, F)
• Choose feature F with maximum gain
1
• 𝐺𝑎𝑖𝑛 𝐷, 𝐹 = 𝐻 𝐷 − σ |𝐷 |𝐻(𝐷𝑓 )
|𝐷| 𝑓∈𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝐹 𝑓

Entropy of the
Consider gain Consider all Weight by the subset of the data
relative to starting branches: all fraction of the data D where F=f
entropy. Note this is values of f this constitutes
the same for every f,
so is often left out
F1 F2 F3 Class

Example A
B
S
S
G
E
Y
N
A T G N
B T G Y
A S G Y
• What is 𝑝𝑌 and 𝑝𝑁 ? B S E N
3 3
• 𝑝𝑌 = ; 𝑝𝑁 =
6 6
• So, entropy of data H(D):
1 1 1 1
• 𝐻 𝑝 = − log 2 − log 2 = 1
2 2 2 2
• Also, |D| = 6

• Now compute the gain for each of the three features


F1, F2, F3
• So we can decide which to split on
F2 F3 Class
F1
Example (F1) A S G Y
B S E N
A T G N
• Gain(D, F1)
• f = {A,B} B T G Y
• For F1 = A (yellow rows): A S G Y
• 𝐷𝐴 = 3 B S E N
2 1
• 𝑝𝑌 = ; 𝑝𝑁 =
3 3
2 2 1 1
• 𝐻 𝐷𝐴 = − log 2 − log 2 = 0.918
3 3 3 3
• For F1 = B (purple rows):
• 𝐷𝐵 = 3 We have a slight
1 2
• 𝑝𝑌 = ; 𝑝𝑁 = reduction in entropy:
3 3
• 𝐻 𝐷𝐵 = − log 2
1 1 2 2
− log 2 = 0.918 splitting on F1 would give
3 3 3 3
1
us two branches that have
• 𝐺𝑎𝑖𝑛 𝐷, 𝐹1 = 𝐻 𝐷 − σ𝑓∈ 𝐴,𝐵 𝐷𝑓 𝐻 𝐷𝑓 slightly less uncertainty:
𝐷
1 i.e. a 2/3 vs 1/3 split
• =1 − 3 × 0.918 + 3 × 0.918 = 0.082
6
F1 F3 Class
F2
Example (F2) A S G Y
B S E N
A T G N
• Gain(D, F2)
• f = {S,T} B T G Y
• For F2 = S (yellow rows): A S G Y
• 𝐷𝑆 = 4 B S E N
1 1
• 𝑝𝑌 = ; 𝑝𝑁 =
2 2
1 1 1 1
• 𝐻 𝐷𝑆 = − log 2 − log 2 = 1
2 2 2 2
• For F2 = T (purple rows):
• 𝐷𝑇 = 2 We have no reduction in
1 1
• 𝑝𝑌 = ; 𝑝𝑁 = entropy: splitting on F2
2 2
1 1 1 1 would give us two
• 𝐻 𝐷𝑇 = − log 2 − log 2 = 1
2 2 2 2
1
branches with the same
• 𝐺𝑎𝑖𝑛 𝐷, 𝐹2 = 𝐻 𝐷 − 𝐷
σ𝑓∈ 𝑆,𝑇 𝐷𝑓 𝐻 𝐷𝑓 uncertainty we have now
1
• =1 − 4×1 + 2×1 =0
6
F1 F2 Class
F3
Example (F3) A S G Y
B S E N
A T G N
• Gain(D, F3)
• f = {G,E} B T G Y
• For F3 = G (yellow rows): A S G Y
• 𝐷𝐺 = 4 B S E N
3 1
• 𝑝𝑌 = ; 𝑝𝑁 =
4 4
3 3 1 1
• 𝐻 𝐷𝐺 = − log 2 − log 2 = 0.811
4 4 4 4
• For F3 = E (purple rows):
• 𝐷𝐸 = 2 We have a large reduction
0 2
• 𝑝𝑌 = ; 𝑝𝑁 = in entropy: splitting on F3
2 2
• 𝐻 𝐷𝑇 = − log 2
0 0 2 2
− log 2 = 0 would give us one branch
2 2 2 2
1
that is all “N”, and
• 𝐺𝑎𝑖𝑛 𝐷, 𝐹3 = 𝐻 𝐷 − σ𝑓∈ 𝐺,𝐸 𝐷𝑓 𝐻 𝐷𝑓 another that is 75% “Y”
𝐷
1
• =1 − 4 × 0.811 + 2 × 0 = 0.459
6
Example tree Now repeat the process on
every leaf node that has
• Gain(D,F1) = 0.082 imperfectly classified training
• Gain(D,F2) = 0 data.
Note the next feature used
• Gain(D,F3) = 0.459 for splitting may be different
• Max gain => Split on F3 on each branch.
F3
G E
Allocate data points to
branches where F3 = G/E

F1 F2 F3 Class F1 F2 F3 Class
A S G Y B S E N
A T G N B S E N
B T G Y Prediction = “N”
A S G Y

Prediction = “Y” (75% accuracy)


Avoiding overfitting – pruning
• Overfitting quite likely (particularly when there are only a few data
points at a leaf)
• Correct with pruning: replace a subtree with a leaf node with the most
common label from the subtree

Prune
here
F7 F7

C F8 C A

B A A
Remember:
full set of known data = training data +
What to prune? validation data + testing data

• Use ID3 to build tree T using training data


• Compute validation error using validation data:
(plug each validation point into the tree for a prediction)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠
• 𝑒𝑟𝑟𝑜𝑟𝑉 𝑇 =
𝑠𝑖𝑧𝑒 𝑜𝑓 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝑠𝑒𝑡
• Consider any subtree of T, and prune to give T’
• If 𝑒𝑟𝑟𝑜𝑟𝑉 𝑇 ′ < 𝑒𝑟𝑟𝑜𝑟𝑉 (𝑇):
• Replace T with T’ and repeat
• Else: keep T and repeat

• Finally, calculate error on test set for reporting


Continuous variables
• What if the attributes take continuous values?
𝑭𝟏 𝑭𝟐 Target
0.83 0.56 A
0.12 0.22 C
0.68 0.19 B
0.79 0.34 A
⋮ ⋮ ⋮
• For each attribute, ask questions like: 𝐹1 < 0.6? 𝐹2 < 0.3?
• Values 0.6 and 0.3 here are called split-points. How to
choose them?
• Choose possible points over an interval, e.g. 0.1, 0.2, …, 0.9
• Calculate the Gain of each of these options
• E.g. Is 𝐹1 < 0.1? Is 𝐹1 < 0.2? Is 𝐹1 < 0.3? …
• Choose the attribute/split-point with maximum Gain
Continuous variables – example
For two features, we may have:
𝐹2 Giving the tree:
1
B A A A 𝐹1 < 0.4
B A A
0.6
B Yes No
C C
B
C B 𝐹2 < 0.6
B C
B C
C Yes No
0.4 1 𝐹1

C A
Regression trees
• What if the target takes continuous values?
• This is regression
𝑭𝟏 𝑭𝟐 … 𝑭𝒎 𝒕𝒂𝒓𝒈𝒆𝒕
⋮ ⋮ ⋮ ⋮ 2.7
⋮ ⋮ ⋮ ⋮ 2.35
⋮ ⋮ ⋮ ⋮ 1.98
⋮ ⋮ ⋮ ⋮ -0.33
⋮ ⋮ ⋮ ⋮ -1.05
⋮ ⋮ ⋮ ⋮ 0.77
⋮ ⋮ ⋮ ⋮ 0.5

• Entropy doesn’t work here!


Sum of squares error
• Instead of entropy, use the sum of squares error:
SoSe(D)

• First, calculate the mean of the N targets 𝑡𝑖 :


1 𝑁
• 𝑚𝑒𝑎𝑛 = 𝜇 = σ 𝑡
𝑁 𝑖=1 𝑖 This error >= 0.
It is minimised when
all the targets are at
• Then: the mean. The error
1 𝑁 increases as they
2
• 𝑆𝑜𝑆𝑒(𝐷) = σ 𝑡𝑖 − 𝜇 become spread out.
𝑁 𝑖=1
𝑭𝟏 𝑭𝟐 … 𝑭𝒎 𝒕𝒂𝒓𝒈𝒆𝒕
⋮ ⋮ ⋮ ⋮ 2.7
SoSe example ⋮ ⋮ ⋮ ⋮ 2.35
⋮ ⋮ ⋮ ⋮ 1.98
⋮ ⋮ ⋮ ⋮ -0.33
⋮ ⋮ ⋮ ⋮ -1.05
⋮ ⋮ ⋮ ⋮ 0.77
• Mean: ⋮ ⋮ ⋮ ⋮ 0.5
1
• 𝜇= 2.7 + 2.35 + 1.98 − 0.33 − 1.05 + 0.77 + 0.5 = 0.989
7

• Then, SoSe(D) =
2.7 − 𝜇 2 + 2.35 − 𝜇 2 + 1.98 − 𝜇 2 + −0.33 − 𝜇 2
+ −1.05 − 𝜇 2 + 0.77 − 𝜇 2 + 0.5 − 𝜇 2 = 11.95
• (and then divide by N)
Learning the regression tree
• Same as the ID3 algorithm
• For each attribute F, calculate:
• 𝑆𝑜𝑆𝑒𝐹 (𝐷) = σ𝑓∈𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝐹 |𝐷𝑓 |𝑆𝑜𝑆𝑒(𝐷𝑓 )
• Choose the F which minimises 𝑆𝑜𝑆𝑒𝐹 (𝐷)
• Create a branch for each value f of F
• The label given to a leaf node is the average of all
training values at that node
Summary
• Decision trees as rules
• Classifying with a decision tree
• Building a tree: ID3
• Choosing the best feature: entropy and gain
• Avoiding overfitting: pruning
• Continuous variables: split points
• Regression trees: sum of squares error

You might also like