0% found this document useful (0 votes)

5 views44 pages

Lecture 11

The document discusses decision trees, a model used for classification where each internal node tests an attribute and each leaf assigns a class. It covers the hypothesis space, expressiveness, learning process, and the importance of information gain and entropy in making splits. Additionally, it highlights the challenges of overfitting and the need for regularization techniques to create simpler trees.

Uploaded by

Sanchay Saxena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views44 pages

Lecture 11

Uploaded by

Sanchay Saxena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Decision

trees
Lecture 11

David Sontag
New York University

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and

Andrew Moore
Hypotheses: decision trees f :X!Y
• Each internal node
tests an attribute xi
Cylinders
• One branch for
each possible
3 4 5 6 8
attribute value xi=v
good bad bad
Maker Horsepower
• Each leaf assigns a
class y
america asia europe low med high
• To classify input x:
bad good good bad good bad
traverse the tree
from root to leaf,
output the labeled y

Human interpretable!
Hypothesis space
• How many possible
hypotheses?
• What functions can be
represented?

Cylinders

3 4 5 6 8
good Maker bad bad Horsepower

america asia europe low med high

bad good good bad good bad

Expressiveness

What funcGons can be represented?

Discrete-input, discrete-output case:
– Decision trees can express any function of the input attribu
– E.g., for Boolean functions, truth table row path to leaf
A
A B A xor B
• Decision trees can represent F F F
F T

any funcGon of the input F

T
T
F
T
T F
B
T F
B
T

aIributes! T T F F T T F

(Figure from Stuart Russell)

Continuous-input, continuous-output case:
• For Boolean funcGons, path
– Can approximate any function arbitrarily closely
to leaf gives truth table row
Trivially, there is a consistent decision tree for any training set
Cylinders
w/ one path to leaf for each example (unless f nondeterministic
• Could require exponenGally
but it probably won’t3 generalize
4 5 6to new8 examples
many nodes good
Need some kind of Maker bad bad
regularization to ensure more compact deci
Horsepower

america asia europe low med high CS194-10 F

bad good good bad good bad

cyl=3 ∨ (cyl=4 ∧ (maker=asia ∨ maker=europe)) ∨ …

Learning simplest decision tree is NP-‐hard

• Learning the simplest (smallest) decision tree is

an NP-‐complete problem [Hyaﬁl & Rivest ’76]
• Resort to a greedy heurisGc:
– Start from empty decision tree
– Split on next best a1ribute (feature)
– Recurse
Key idea: Greedily learn trees using
recursion

Records
in which
cylinders
=4

Records
in which
cylinders
=5
Take the And partition it
Original according
Dataset.. Records
to the value of
in which
the attribute we cylinders
split on =6

Records
in which
cylinders
=8
Recursive Step

Build tree from Build tree from Build tree from Build tree from
These records.. These records.. These records.. These records..

Records in
Records in which cylinders
which cylinders =8
=6
Records in
Records in
which cylinders
which cylinders
=5
=4
Second level of tree

Recursively build a tree from the seven

(Similar recursion in
records in which there are four cylinders
and the maker was based in Asia the other cases)
A full tree
Spli^ng: choosing a good aIribute

Would we prefer to split on X1 or X2? X1 X2 Y

T T T
T F T
X1 X2
t f T T T
t f
T F T
Y=t : 4 Y=t : 1 Y=t : 3 Y=t : 2 F T T
Y=f : 0 Y=f : 3 Y=f : 1 Y=f : 2
F F F
F T F
Idea: use counts at leaves to define
F F F
probability distributions, so we can
measure uncertainty!
Measuring uncertainty

• Good split if we are more certain about

classiﬁcaGon a_er split
– DeterminisGc good (all true or all false)
– Uniform distribuGon bad
– What about distribuGons in between?

P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8

P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4

Entropy
Entropy H(Y) of a random variable Y

Entropy of a coin ﬂip

More uncertainty, more entropy!

Entropy
Information Theory interpretation:
H(Y) is the expected number of bits
needed to encode a randomly
drawn value of Y (under most
efficient code)

Probability of heads

High, Low Entropy

• “High Entropy”
– Y is from a uniform like distribuGon
– Flat histogram
– Values sampled from it are less predictable
• “Low Entropy”
– Y is from a varied (peaks and valleys)
distribuGon
– Histogram has many lows and highs
– Values sampled from it are more predictable

(Slide from Vibhav Gogate)

Entropy of a coin ﬂip

Entropy Example

Entropy
Probability of heads

P(Y=t) = 5/6
X1 X2 Y
P(Y=f) = 1/6
T T T
T F T
H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6 T T T
= 0.65 T F T
F T T
F F F
CondiGonal Entropy
CondiGonal Entropy H( Y |X) of a random variable Y condiGoned on a
random variable X

X1 X2 Y
Example: X1
t f T T T
T F T
P(X1=t) = 4/6 Y=t : 4 Y=t : 1
P(X1=f) = 2/6 Y=f : 0 T T T
Y=f : 1
T F T
F T T
H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)
- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2) F F F
= 2/6
InformaGon gain
• Decrease in entropy (uncertainty) a_er spli^ng

X1 X2 Y
In our running example: T T T
T F T
IG(X1) = H(Y) – H(Y|X1)
= 0.65 – 0.33 T T T
T F T
IG(X1) > 0 ! we prefer the split! F T T
F F F
Learning decision trees
• Start from empty decision tree
• Split on next best a1ribute (feature)
– Use, for example, informaGon gain to select
aIribute:

• Recurse
When to stop?

First split looks good! But, when do we stop?

Base Case
One

Don’t split a
node if all
matching
records have
the same
output value
Base Case
Two

Don’t split a
node if data
points are
identical on
remaining
attributes
Base Cases: An idea
• Base Case One: If all records in current data
subset have the same output then don’t recurse
• Base Case Two: If all records have exactly the
same set of input aIributes then don’t recurse

Proposed Base Case 3:

If all attributes have small
information gain then don’t
recurse

•This is not a good idea

The problem with proposed case 3

y = a XOR b

The information gains:

If we omit proposed case 3:
The resulting decision tree:
y = a XOR b

Instead, perform
pruning after building a
tree
Decision trees will overﬁt
Decision trees will overﬁt

• Standard decision trees have no learning bias

– Training set error is always zero!
• (If there is no label noise)
– Lots of variance
– Must introduce some bias towards simpler trees
• Many strategies for picking simpler trees
– Fixed depth
– Minimum number of samples per leaf
• Random forests
Real-‐Valued inputs
What should we do if some of the inputs are real-‐valued?

Infinite
number of
possible split
values!!!
“One branch for each numeric value”
idea:

Hopeless: hypothesis with such a high

branching factor will shatter any dataset
and overfit
Threshold splits

• Binary tree: split on

aIribute X at value t Year
– One branch: X < t
<78 ≥78
– Other branch: X ≥ t
bad
Year good

• Requires small change

<70 ≥70
• Allow repeated splits on same
variable along a path bad good
The set of possible thresholds
• Binary tree, split on aIribute X
– One branch: X < t
– Other branch: X ≥ t
Optimal splits for continuous attributes
• Search through possible values of t
– Infinitely
Seems hmany possible split points c to define node test Xj > c ?
ard!!!
No! Moving split Optimal splits
point along for continuous
the empty space between two attributes
observed values
• But has
only a ﬁ nite n umber o f t’s a re i mportant:
no effect on information gain or empirical loss; so just use midpoint
Infinitely many possible split points c to define node test Xj > c ?
Xj
No! Moving split point along the empty space between two observed values
tc11empirical
has no effect on information gain or tc22 loss; so just use midpoint
Xj
– Sort data
Moreover, only according to X into
splits between {x1,…,xfrom
examples m} different classes
c1 c2
– Consider
can be optimalsplit
forpinformation
oints of the
gainform xi + (xloss
or empirical
i+1 – xi)/2
reduction
Xj
– Morever,
Moreover,only only
splits
splitsbbetween
etween examples
examples from
of diﬀerent
differentcclasses
lasses maIer!
c1 informationc2gain or empirical loss reduction
can be optimal for
Xj

tc1
1 tc 2
2 (Figures from Stuart Russell)
Picking the best threshold

• Suppose X is real valued with threshold t

• Use: IG*(Y|X) for conGnuous variables

What you need to know about decision trees

• Decision trees are one of the most popular ML tools
– Easy to understand, implement, and use
– ComputaGonally cheap (to solve heurisGcally)
• InformaGon gain to select aIributes (ID3, C4.5,…)
• Presented for classificaGon, can be used for regression and
density esGmaGon too
• Decision trees will overfit!!!
– Must use tricks to find “simple trees”, e.g.,
• Fixed depth/Early stopping
• Pruning
– Or, use ensembles of different trees (random forests)
Ensemble learning

Slides adapted from Navneet Goyal, Tan, Steinbach,

Kumar, Vibhav Gogate
Ensemble methods
Machine learning competition with a $1 million prize
Bias/Variance Tradeoﬀ

Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001

Reduce Variance Without Increasing Bias

• Averaging reduces variance:

(when predictions
are independent)

Average models to reduce model variance

One problem:
only one training set
where do multiple models come from?
Bagging: Bootstrap AggregaGon

• Leo Breiman (1994)

• Take repeated bootstrap samples from training set D
• Bootstrap sampling: Given set D containing N training
examples, create D’ by drawing N examples at random
with replacement from D.

• Bagging:
– Create k bootstrap samples D1 … Dk.
– Train disGnct classiﬁer on each Di.
– Classify new instance by majority vote / average.
General Idea
Example of Bagging

• Sampling with replacement

Training Data
Data ID

• Build classiﬁer on each bootstrap sample

• Each data point has probability (1 – 1/n)n of being
selected as test data
• Training data = 1-‐ (1 – 1/n)n of the original data
51
decision tree learning algorithm; very similar to ID3

52
shades of blue/red indicate strength of vote for particular classification
Random Forests
• Ensemble method speciﬁcally designed for decision
tree classiﬁers

• Introduce two sources of randomness: “Bagging”

and “Random input vectors”
– Bagging method: each tree is grown using a bootstrap
sample of training data
– Random vector method: At each node, best split is chosen
from a random sample of m aIributes instead of all
aIributes
Random Forests
Random Forests Algorithm

Decision Tree
No ratings yet
Decision Tree
52 pages
4.decision Tree
No ratings yet
4.decision Tree
39 pages
Decision Trees: Classifier
No ratings yet
Decision Trees: Classifier
23 pages
Decision Tree
No ratings yet
Decision Tree
23 pages
LVC 1 Post-Session Summary
No ratings yet
LVC 1 Post-Session Summary
9 pages
Decision Trees in Machine Learning
No ratings yet
Decision Trees in Machine Learning
62 pages
M04 Trees
No ratings yet
M04 Trees
43 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
Ds 6
No ratings yet
Ds 6
24 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
Machine Learning 10601 Recitation 8 Oct 21, 2009: Oznur Tastan
No ratings yet
Machine Learning 10601 Recitation 8 Oct 21, 2009: Oznur Tastan
46 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
Tree Based Algorithms in Machine Learning
No ratings yet
Tree Based Algorithms in Machine Learning
8 pages
Decision Trees
No ratings yet
Decision Trees
5 pages
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
No ratings yet
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
49 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
7 DecisionTree
No ratings yet
7 DecisionTree
58 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Decision Trees
No ratings yet
Decision Trees
128 pages
ML Unit 3 Notes-1
No ratings yet
ML Unit 3 Notes-1
118 pages
Lecture 05-06 2025 DT
No ratings yet
Lecture 05-06 2025 DT
59 pages
Decision Trees
No ratings yet
Decision Trees
34 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
Decision Trees for CS Students
No ratings yet
Decision Trees for CS Students
54 pages
CSE 455 Artificial Intelligence: Decision Trees
No ratings yet
CSE 455 Artificial Intelligence: Decision Trees
16 pages
Decision Tree Basics for Data Scientists
No ratings yet
Decision Tree Basics for Data Scientists
61 pages
M2 Decision Trees
No ratings yet
M2 Decision Trees
37 pages
Dtree&rf
No ratings yet
Dtree&rf
26 pages
A08 Decision Trees 2up
No ratings yet
A08 Decision Trees 2up
20 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
Jdavis Indlearn2
No ratings yet
Jdavis Indlearn2
91 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Mod 4-1
No ratings yet
Mod 4-1
42 pages
06 Trees Handout
No ratings yet
06 Trees Handout
39 pages
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
No ratings yet
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
7 pages
Decision Trees for Beginners
No ratings yet
Decision Trees for Beginners
22 pages
Tree Based Machine Learning Algorithms Decision Trees Random Forests and Boosting B0756FGJCP
100% (1)
Tree Based Machine Learning Algorithms Decision Trees Random Forests and Boosting B0756FGJCP
109 pages
22.InfoTheory DecisionTrees Short
No ratings yet
22.InfoTheory DecisionTrees Short
25 pages
Classification Trees
No ratings yet
Classification Trees
48 pages
Lecturenotes DecisionTree Spring15
No ratings yet
Lecturenotes DecisionTree Spring15
16 pages
Decision Trees
No ratings yet
Decision Trees
37 pages
Decision Tree Induction Basics
No ratings yet
Decision Tree Induction Basics
55 pages
Decision Tree Classifier Project
100% (1)
Decision Tree Classifier Project
20 pages
6.034 Notes: Section 5.1: Slide 5.1.1
No ratings yet
6.034 Notes: Section 5.1: Slide 5.1.1
22 pages
Lecture 12 - Decision and Regression Trees
No ratings yet
Lecture 12 - Decision and Regression Trees
35 pages
Neural Nets (Wrap-Up) and Decision Trees: CS 188: Artificial Intelligence
No ratings yet
Neural Nets (Wrap-Up) and Decision Trees: CS 188: Artificial Intelligence
26 pages
AIML Lec-11
No ratings yet
AIML Lec-11
18 pages
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
No ratings yet
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
25 pages
DM Chapter 4
No ratings yet
DM Chapter 4
6 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
16 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Decissin Tree & Over Fitting
No ratings yet
Decissin Tree & Over Fitting
22 pages
Geographies Primer - West Africa
No ratings yet
Geographies Primer - West Africa
1 page
NABARD Overview
No ratings yet
NABARD Overview
1 page
01-04-2024 Case Study
No ratings yet
01-04-2024 Case Study
2 pages
Semiconductor Business - India
No ratings yet
Semiconductor Business - India
1 page
Insurance in India
No ratings yet
Insurance in India
1 page
Thehinduzone 18 Oct 2021
No ratings yet
Thehinduzone 18 Oct 2021
10 pages
ASER 2024 Report - 20250129 - 112908 - TMPJDZM - Goj
No ratings yet
ASER 2024 Report - 20250129 - 112908 - TMPJDZM - Goj
6 pages
Sez & Eez
No ratings yet
Sez & Eez
1 page
Understanding Decision Trees
No ratings yet
Understanding Decision Trees
2 pages
Transfer Pricing
No ratings yet
Transfer Pricing
1 page
Daily Pager
No ratings yet
Daily Pager
1 page
Binomial Distribution
No ratings yet
Binomial Distribution
4 pages
Quiz 3
No ratings yet
Quiz 3
2 pages
Quiz8 Sol
No ratings yet
Quiz8 Sol
2 pages
Quiz 2
No ratings yet
Quiz 2
3 pages
Intelligence MCQ
No ratings yet
Intelligence MCQ
12 pages
Zen Wor X Case Study
No ratings yet
Zen Wor X Case Study
6,248 pages
Format For Slogan Competition
No ratings yet
Format For Slogan Competition
1 page
Clra Form-Vi B
No ratings yet
Clra Form-Vi B
1 page
Assignment 1
No ratings yet
Assignment 1
3 pages
The Information - A History, A Theory, A Flood - James Gleick - Book Review PDF
No ratings yet
The Information - A History, A Theory, A Flood - James Gleick - Book Review PDF
2 pages
DC Mid Imp
No ratings yet
DC Mid Imp
2 pages
Occam's Quantum Razor: How Quantum Mechanics Can Reduce The Complexity of Classical Models
No ratings yet
Occam's Quantum Razor: How Quantum Mechanics Can Reduce The Complexity of Classical Models
6 pages
Sheet 1 Solution
No ratings yet
Sheet 1 Solution
5 pages
Nms 2nd Unit
No ratings yet
Nms 2nd Unit
30 pages
VTU E&CE (CBCS) 5th Sem Information Theory and Coding Full Notes (1-5 Modules)
80% (5)
VTU E&CE (CBCS) 5th Sem Information Theory and Coding Full Notes (1-5 Modules)
691 pages
Math Ia
No ratings yet
Math Ia
14 pages
Unit 5 Data Compression
No ratings yet
Unit 5 Data Compression
12 pages
Information and Hurst.v1.2
No ratings yet
Information and Hurst.v1.2
27 pages
01 - 01 - Introduction To Command and Control - en
No ratings yet
01 - 01 - Introduction To Command and Control - en
3 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
6 pages
ML DecisionTrees
No ratings yet
ML DecisionTrees
46 pages
Provided by Liberty University Digital Commons
No ratings yet
Provided by Liberty University Digital Commons
32 pages
Theory in Programming Practice
100% (1)
Theory in Programming Practice
250 pages
Electrical Engineering 229A Lecture Notes Information Theory and Coding
No ratings yet
Electrical Engineering 229A Lecture Notes Information Theory and Coding
117 pages
Veneta Haralampieva
No ratings yet
Veneta Haralampieva
66 pages
Get Theory of Neural Information Processing Systems A. C. C. Coolen PDF Ebook With Full Chapters Now
No ratings yet
Get Theory of Neural Information Processing Systems A. C. C. Coolen PDF Ebook With Full Chapters Now
45 pages
Information Theory & Entropy Basics
No ratings yet
Information Theory & Entropy Basics
31 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction To Concepts and Methods
No ratings yet
Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction To Concepts and Methods
50 pages
IML-IITKGP - Assignment 2 Solution
No ratings yet
IML-IITKGP - Assignment 2 Solution
11 pages
L2 Challenges in NLP
No ratings yet
L2 Challenges in NLP
18 pages
ECE450 Information Theory ECE Department University of Rochester
No ratings yet
ECE450 Information Theory ECE Department University of Rochester
3 pages
Data Science & Analytics Basics
No ratings yet
Data Science & Analytics Basics
71 pages
Infodynamics: Entropy in Info Systems
No ratings yet
Infodynamics: Entropy in Info Systems
6 pages
HW 2 Sol
No ratings yet
HW 2 Sol
25 pages
Information Theory
50% (2)
Information Theory
30 pages
EC2311 Communication Engineering Question Bank
No ratings yet
EC2311 Communication Engineering Question Bank
6 pages

Lecture 11

Uploaded by

Lecture 11

Uploaded by

Decision

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and

america asia europe low med high

bad good good bad good bad

What funcGons can be represented?

any funcGon of the input F

(Figure from Stuart Russell)

america asia europe low med high CS194-10 F

bad good good bad good bad

cyl=3 ∨ (cyl=4 ∧ (maker=asia ∨ maker=europe)) ∨ …

• Learning the simplest (smallest) decision tree is

Recursively build a tree from the seven

Would we prefer to split on X1 or X2? X1 X2 Y

• Good split if we are more certain about

P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8

P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4

Entropy of a coin ﬂip

More uncertainty, more entropy!

Probability of heads

(Slide from Vibhav Gogate)

First split looks good! But, when do we stop?

Proposed Base Case 3:

•This is not a good idea

The information gains:

• Standard decision trees have no learning bias

Hopeless: hypothesis with such a high

• Binary tree: split on

• Requires small change

• Suppose X is real valued with threshold t

• Use: IG*(Y|X) for conGnuous variables

Slides adapted from Navneet Goyal, Tan, Steinbach,

Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001

• Averaging reduces variance:

Average models to reduce model variance

• Leo Breiman (1994)

• Sampling with replacement

• Build classiﬁer on each bootstrap sample

• Introduce two sources of randomness: “Bagging”

You might also like