Decision
trees
Lecture
11
David
Sontag
New
York
University
Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and
Andrew Moore
Hypotheses: decision trees f :X!Y
• Each internal node
tests an attribute xi
Cylinders
• One branch for
each possible
3
4
5
6
8
attribute value xi=v
good bad bad
Maker
Horsepower
• Each leaf assigns a
class y
america
asia
europe
low
med
high
• To classify input x:
bad good good bad good bad
traverse the tree
from root to leaf,
output the labeled y
Human
interpretable!
Hypothesis space
• How many possible
hypotheses?
• What functions can be
represented?
Cylinders
3
4
5
6
8
good Maker
bad bad Horsepower
america
asia
europe
low
med
high
bad good good bad good bad
Expressiveness
What
funcGons
can
be
represented?
Discrete-input, discrete-output case:
– Decision trees can express any function of the input attribu
– E.g., for Boolean functions, truth table row path to leaf
A
A B A xor B
• Decision
trees
can
represent
F F F
F T
any
funcGon
of
the
input
F
T
T
F
T
T F
B
T F
B
T
aIributes!
T T F F T T F
(Figure
from
Stuart
Russell)
Continuous-input, continuous-output case:
• For
Boolean
funcGons,
path
– Can approximate any function arbitrarily closely
to
leaf
gives
truth
table
row
Trivially, there is a consistent decision tree for any training set
Cylinders
w/ one path to leaf for each example (unless f nondeterministic
• Could
require
exponenGally
but it probably won’t3
generalize
4
5
6to
new8
examples
many
nodes
good
Need some kind of Maker
bad bad
regularization to ensure more compact deci
Horsepower
america
asia
europe
low
med
high
CS194-10 F
bad good good bad good bad
cyl=3 ∨ (cyl=4 ∧ (maker=asia ∨ maker=europe)) ∨ …
Learning
simplest
decision
tree
is
NP-‐hard
• Learning
the
simplest
(smallest)
decision
tree
is
an
NP-‐complete
problem
[Hyafil
&
Rivest
’76]
• Resort
to
a
greedy
heurisGc:
– Start
from
empty
decision
tree
– Split
on
next
best
a1ribute
(feature)
– Recurse
Key
idea:
Greedily
learn
trees
using
recursion
Records
in which
cylinders
=4
Records
in which
cylinders
=5
Take the And partition it
Original according
Dataset.. Records
to the value of
in which
the attribute we cylinders
split on =6
Records
in which
cylinders
=8
Recursive
Step
Build tree from Build tree from Build tree from Build tree from
These records.. These records.. These records.. These records..
Records in
Records in which cylinders
which cylinders =8
=6
Records in
Records in
which cylinders
which cylinders
=5
=4
Second
level
of
tree
Recursively build a tree from the seven
(Similar recursion in
records in which there are four cylinders
and the maker was based in Asia the other cases)
A full tree
Spli^ng:
choosing
a
good
aIribute
Would we prefer to split on X1 or X2? X1 X2 Y
T T T
T F T
X1 X2
t f T T T
t f
T F T
Y=t : 4 Y=t : 1 Y=t : 3 Y=t : 2 F T T
Y=f : 0 Y=f : 3 Y=f : 1 Y=f : 2
F F F
F T F
Idea: use counts at leaves to define
F F F
probability distributions, so we can
measure uncertainty!
Measuring
uncertainty
• Good
split
if
we
are
more
certain
about
classificaGon
a_er
split
– DeterminisGc
good
(all
true
or
all
false)
– Uniform
distribuGon
bad
– What
about
distribuGons
in
between?
P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8
P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4
Entropy
Entropy
H(Y)
of
a
random
variable
Y
Entropy
of
a
coin
flip
More uncertainty, more entropy!
Entropy
Information Theory interpretation:
H(Y) is the expected number of bits
needed to encode a randomly
drawn value of Y (under most
efficient code)
Probability
of
heads
High,
Low
Entropy
• “High
Entropy”
– Y
is
from
a
uniform
like
distribuGon
– Flat
histogram
– Values
sampled
from
it
are
less
predictable
• “Low
Entropy”
– Y
is
from
a
varied
(peaks
and
valleys)
distribuGon
– Histogram
has
many
lows
and
highs
– Values
sampled
from
it
are
more
predictable
(Slide from Vibhav Gogate)
Entropy
of
a
coin
flip
Entropy
Example
Entropy
Probability
of
heads
P(Y=t) = 5/6
X1 X2 Y
P(Y=f) = 1/6
T T T
T F T
H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6 T T T
= 0.65 T F T
F T T
F F F
CondiGonal
Entropy
CondiGonal
Entropy
H( Y |X)
of
a
random
variable
Y
condiGoned
on
a
random
variable
X
X1 X2 Y
Example: X1
t f T T T
T F T
P(X1=t) = 4/6 Y=t : 4 Y=t : 1
P(X1=f) = 2/6 Y=f : 0 T T T
Y=f : 1
T F T
F T T
H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)
- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2) F F F
= 2/6
InformaGon
gain
• Decrease
in
entropy
(uncertainty)
a_er
spli^ng
X1 X2 Y
In our running example: T T T
T F T
IG(X1) = H(Y) – H(Y|X1)
= 0.65 – 0.33 T T T
T F T
IG(X1) > 0 ! we prefer the split! F T T
F F F
Learning
decision
trees
• Start
from
empty
decision
tree
• Split
on
next
best
a1ribute
(feature)
– Use,
for
example,
informaGon
gain
to
select
aIribute:
• Recurse
When
to
stop?
First split looks good! But, when do we stop?
Base Case
One
Don’t split a
node if all
matching
records have
the same
output value
Base Case
Two
Don’t split a
node if data
points are
identical on
remaining
attributes
Base
Cases:
An
idea
• Base
Case
One:
If
all
records
in
current
data
subset
have
the
same
output
then
don’t
recurse
• Base
Case
Two:
If
all
records
have
exactly
the
same
set
of
input
aIributes
then
don’t
recurse
Proposed Base Case 3:
If all attributes have small
information gain then don’t
recurse
•This is not a good idea
The
problem
with
proposed
case
3
y = a XOR b
The information gains:
If
we
omit
proposed
case
3:
The resulting decision tree:
y = a XOR b
Instead, perform
pruning after building a
tree
Decision
trees
will
overfit
Decision
trees
will
overfit
• Standard
decision
trees
have
no
learning
bias
– Training
set
error
is
always
zero!
• (If
there
is
no
label
noise)
– Lots
of
variance
– Must
introduce
some
bias
towards
simpler
trees
• Many
strategies
for
picking
simpler
trees
– Fixed
depth
– Minimum
number
of
samples
per
leaf
• Random
forests
Real-‐Valued
inputs
What
should
we
do
if
some
of
the
inputs
are
real-‐valued?
Infinite
number of
possible split
values!!!
“One
branch
for
each
numeric
value”
idea:
Hopeless: hypothesis with such a high
branching factor will shatter any dataset
and overfit
Threshold
splits
• Binary
tree:
split
on
aIribute
X
at
value
t
Year
– One
branch:
X
<
t
<78
≥78
– Other
branch:
X
≥
t
bad
Year
good
• Requires small change
<70
≥70
• Allow repeated splits on same
variable along a path bad good
The
set
of
possible
thresholds
• Binary
tree,
split
on
aIribute
X
– One
branch:
X
<
t
– Other
branch:
X
≥
t
Optimal splits for continuous attributes
• Search
through
possible
values
of
t
– Infinitely
Seems
hmany possible split points c to define node test Xj > c ?
ard!!!
No! Moving split Optimal splits
point along for continuous
the empty space between two attributes
observed values
• But
has
only
a
fi nite
n umber
o f
t’s
a re
i mportant:
no effect on information gain or empirical loss; so just use midpoint
Infinitely many possible split points c to define node test Xj > c ?
Xj
No! Moving split point along the empty space between two observed values
tc11empirical
has no effect on information gain or tc22 loss; so just use midpoint
Xj
– Sort
data
Moreover, only according
to
X
into
splits between {x1,…,xfrom
examples m}
different classes
c1 c2
– Consider
can be optimalsplit
forpinformation
oints
of
the
gainform
xi
+
(xloss
or empirical
i+1
–
xi)/2
reduction
Xj
– Morever,
Moreover,only
only
splits
splitsbbetween
etween
examples
examples
from
of
different
differentcclasses
lasses
maIer!
c1 informationc2gain or empirical loss reduction
can be optimal for
Xj
tc1
1 tc 2
2 (Figures
from
Stuart
Russell)
Picking
the
best
threshold
• Suppose
X
is
real
valued
with
threshold
t
• Want
IG(Y
|
X:t),
the
informaGon
gain
for
Y
when
tesGng
if
X
is
greater
than
or
less
than
t
• Define:
• H(Y|X:t)
=
p(X
<
t)
H(Y|X
<
t)
+
p(X
>=
t)
H(Y|X
>=
t)
• IG(Y|X:t)
=
H(Y)
-‐
H(Y|X:t)
• IG*(Y|X)
=
maxt
IG(Y|X:t)
• Use:
IG*(Y|X)
for
conGnuous
variables
What
you
need
to
know
about
decision
trees
• Decision
trees
are
one
of
the
most
popular
ML
tools
– Easy
to
understand,
implement,
and
use
– ComputaGonally
cheap
(to
solve
heurisGcally)
• InformaGon
gain
to
select
aIributes
(ID3,
C4.5,…)
• Presented
for
classificaGon,
can
be
used
for
regression
and
density
esGmaGon
too
• Decision
trees
will
overfit!!!
– Must
use
tricks
to
find
“simple
trees”,
e.g.,
• Fixed
depth/Early
stopping
• Pruning
– Or,
use
ensembles
of
different
trees
(random
forests)
Ensemble
learning
Slides adapted from Navneet Goyal, Tan, Steinbach,
Kumar, Vibhav Gogate
Ensemble
methods
Machine learning competition with a $1 million prize
Bias/Variance
Tradeoff
Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001
Reduce
Variance
Without
Increasing
Bias
• Averaging
reduces
variance:
(when predictions
are independent)
Average models to reduce model variance
One problem:
only one training set
where do multiple models come from?
Bagging:
Bootstrap
AggregaGon
• Leo
Breiman
(1994)
• Take
repeated
bootstrap
samples
from
training
set
D
• Bootstrap
sampling:
Given
set
D
containing
N
training
examples,
create
D’
by
drawing
N
examples
at
random
with
replacement
from
D.
• Bagging:
– Create
k
bootstrap
samples
D1
…
Dk.
– Train
disGnct
classifier
on
each
Di.
– Classify
new
instance
by
majority
vote
/
average.
General
Idea
Example
of
Bagging
• Sampling
with
replacement
Training Data
Data ID
• Build
classifier
on
each
bootstrap
sample
• Each
data
point
has
probability
(1
–
1/n)n
of
being
selected
as
test
data
• Training
data
=
1-‐
(1
–
1/n)n
of
the
original
data
51
decision tree learning algorithm; very similar to ID3
52
shades of blue/red indicate strength of vote for particular classification
Random
Forests
• Ensemble
method
specifically
designed
for
decision
tree
classifiers
• Introduce
two
sources
of
randomness:
“Bagging”
and
“Random
input
vectors”
– Bagging
method:
each
tree
is
grown
using
a
bootstrap
sample
of
training
data
– Random
vector
method:
At
each
node,
best
split
is
chosen
from
a
random
sample
of
m
aIributes
instead
of
all
aIributes
Random
Forests
Random
Forests
Algorithm