0% found this document useful (0 votes)
5 views44 pages

Lecture 11

The document discusses decision trees, a model used for classification where each internal node tests an attribute and each leaf assigns a class. It covers the hypothesis space, expressiveness, learning process, and the importance of information gain and entropy in making splits. Additionally, it highlights the challenges of overfitting and the need for regularization techniques to create simpler trees.

Uploaded by

Sanchay Saxena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views44 pages

Lecture 11

The document discusses decision trees, a model used for classification where each internal node tests an attribute and each leaf assigns a class. It covers the hypothesis space, expressiveness, learning process, and the importance of information gain and entropy in making splits. Additionally, it highlights the challenges of overfitting and the need for regularization techniques to create simpler trees.

Uploaded by

Sanchay Saxena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Decision

 trees  
Lecture  11  

David  Sontag  
New  York  University  

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and


Andrew Moore
Hypotheses: decision trees f :X!Y
• Each internal node
tests an attribute xi
Cylinders  
• One branch for
each possible
3   4   5   6   8  
attribute value xi=v
good bad bad
Maker   Horsepower  
• Each leaf assigns a
class y
america   asia   europe   low   med   high  
• To classify input x:
bad good good bad good bad
traverse the tree
from root to leaf,
output the labeled y

Human  interpretable!  
Hypothesis space
• How many possible
hypotheses?
• What functions can be
represented?

Cylinders  

3   4   5   6   8  
good Maker   bad bad Horsepower  

america   asia   europe   low   med   high  

bad good good bad good bad


Expressiveness

What  funcGons  can  be  represented?  


Discrete-input, discrete-output case:
– Decision trees can express any function of the input attribu
– E.g., for Boolean functions, truth table row path to leaf
A
A B A xor B
• Decision  trees  can  represent   F F F
F T

any  funcGon  of  the  input   F


T
T
F
T
T F
B
T F
B
T

aIributes!   T T F F T T F

(Figure  from  Stuart  Russell)  


Continuous-input, continuous-output case:
• For  Boolean  funcGons,  path  
– Can approximate any function arbitrarily closely
to  leaf  gives  truth  table  row  
Trivially, there is a consistent decision tree for any training set
Cylinders  
w/ one path to leaf for each example (unless f nondeterministic
• Could  require  exponenGally  
but it probably won’t3   generalize
4   5   6to   new8   examples
many  nodes   good
Need some kind of Maker   bad bad
regularization to ensure more compact deci
Horsepower  

america   asia   europe   low   med   high   CS194-10 F

bad good good bad good bad

cyl=3 ∨ (cyl=4 ∧ (maker=asia ∨ maker=europe)) ∨ …


Learning  simplest  decision  tree  is  NP-­‐hard  

• Learning  the  simplest  (smallest)  decision  tree  is  


an  NP-­‐complete  problem  [Hyafil  &  Rivest  ’76]    
• Resort  to  a  greedy  heurisGc:  
– Start  from  empty  decision  tree  
– Split  on  next  best  a1ribute  (feature)  
– Recurse  
Key  idea:  Greedily  learn  trees  using  
recursion  

Records
in which
cylinders
=4

Records
in which
cylinders
=5
Take the And partition it
Original according
Dataset.. Records
to the value of
in which
the attribute we cylinders
split on =6

Records
in which
cylinders
=8
Recursive  Step  

Build tree from Build tree from Build tree from Build tree from
These records.. These records.. These records.. These records..

Records in
Records in which cylinders
which cylinders =8
=6
Records in
Records in
which cylinders
which cylinders
=5
=4
Second  level  of  tree  

Recursively build a tree from the seven


(Similar recursion in
records in which there are four cylinders
and the maker was based in Asia the other cases)
A full tree
Spli^ng:  choosing  a  good  aIribute  

Would we prefer to split on X1 or X2? X1 X2 Y


T T T
T F T
X1 X2
t f T T T
t f
T F T
Y=t : 4 Y=t : 1 Y=t : 3 Y=t : 2 F T T
Y=f : 0 Y=f : 3 Y=f : 1 Y=f : 2
F F F
F T F
Idea: use counts at leaves to define
F F F
probability distributions, so we can
measure uncertainty!
Measuring  uncertainty  

• Good  split  if  we  are  more  certain  about  


classificaGon  a_er  split  
– DeterminisGc  good  (all  true  or  all  false)  
– Uniform  distribuGon  bad  
– What  about  distribuGons  in  between?  

P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8

P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4


Entropy  
Entropy  H(Y)  of  a  random  variable  Y

Entropy  of  a  coin  flip  

More uncertainty, more entropy!

Entropy  
Information Theory interpretation:
H(Y) is the expected number of bits
needed to encode a randomly
drawn value of Y (under most
efficient code)

Probability  of  heads  


High,  Low  Entropy  

• “High  Entropy”    
– Y  is  from  a  uniform  like  distribuGon  
– Flat  histogram  
– Values  sampled  from  it  are  less  predictable  
• “Low  Entropy”    
– Y  is  from  a  varied  (peaks  and  valleys)  
distribuGon  
– Histogram  has  many  lows  and  highs  
– Values  sampled  from  it  are  more  predictable  

(Slide from Vibhav Gogate)


Entropy  of  a  coin  flip  

Entropy  Example  

Entropy  
Probability  of  heads  

P(Y=t) = 5/6
X1 X2 Y
P(Y=f) = 1/6
T T T
T F T
H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6 T T T
= 0.65 T F T
F T T
F F F
CondiGonal  Entropy  
CondiGonal  Entropy  H( Y |X)  of  a  random  variable  Y  condiGoned  on  a  
random  variable  X

X1 X2 Y
Example: X1
t f T T T
T F T
P(X1=t) = 4/6 Y=t : 4 Y=t : 1
P(X1=f) = 2/6 Y=f : 0 T T T
Y=f : 1
T F T
F T T
H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)
- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2) F F F
= 2/6
InformaGon  gain  
• Decrease  in  entropy  (uncertainty)  a_er  spli^ng  

X1 X2 Y
In our running example: T T T
T F T
IG(X1) = H(Y) – H(Y|X1)
= 0.65 – 0.33 T T T
T F T
IG(X1) > 0 ! we prefer the split! F T T
F F F
Learning  decision  trees  
• Start  from  empty  decision  tree  
• Split  on  next  best  a1ribute  (feature)  
– Use,  for  example,  informaGon  gain  to  select  
aIribute:  

• Recurse  
When  to  stop?  

First split looks good! But, when do we stop?


Base Case
One

Don’t split a
node if all
matching
records have
the same
output value
Base Case
Two

Don’t split a
node if data
points are
identical on
remaining
attributes
Base  Cases:  An  idea  
• Base  Case  One:  If  all  records  in  current  data  
subset  have  the  same  output  then  don’t  recurse  
• Base  Case  Two:  If  all  records  have  exactly  the  
same  set  of  input  aIributes  then  don’t  recurse  

Proposed Base Case 3:


If all attributes have small
information gain then don’t
recurse

•This is not a good idea


The  problem  with  proposed  case  3  

y = a XOR b

The information gains:


If  we  omit  proposed  case  3:  
The resulting decision tree:
y = a XOR b

Instead, perform
pruning after building a
tree
Decision  trees  will  overfit  
Decision  trees  will  overfit  

• Standard  decision  trees  have  no  learning  bias  


– Training  set  error  is  always  zero!  
• (If  there  is  no  label  noise)  
– Lots  of  variance  
– Must  introduce  some  bias  towards  simpler  trees  
• Many  strategies  for  picking  simpler  trees  
– Fixed  depth  
– Minimum  number  of  samples  per  leaf    
• Random  forests  
Real-­‐Valued  inputs  
What  should  we  do  if  some  of  the  inputs  are  real-­‐valued?  

Infinite
number of
possible split
values!!!
“One  branch  for  each  numeric  value”  
idea:  

Hopeless: hypothesis with such a high


branching factor will shatter any dataset
and overfit
Threshold  splits  

• Binary  tree:  split  on  


aIribute  X  at  value  t   Year  
– One  branch:  X  <  t  
<78   ≥78  
– Other  branch:  X  ≥  t  
bad
Year   good

• Requires small change


<70   ≥70  
• Allow repeated splits on same
variable along a path bad good
The  set  of  possible  thresholds  
• Binary  tree,  split  on  aIribute  X  
– One  branch:  X  <  t  
– Other  branch:  X  ≥  t  
Optimal splits for continuous attributes
• Search  through  possible  values  of  t  
– Infinitely
Seems  hmany possible split points c to define node test Xj > c ?
ard!!!  
No! Moving split Optimal splits
point along for continuous
the empty space between two attributes
observed values
• But  has
only   a   fi nite   n umber   o f  t’s   a re   i mportant:  
no effect on information gain or empirical loss; so just use midpoint
Infinitely many possible split points c to define node test Xj > c ?
Xj
No! Moving split point along the empty space between two observed values
tc11empirical
has no effect on information gain or tc22 loss; so just use midpoint
Xj
– Sort  data  
Moreover, only according   to  X  into  
splits between {x1,…,xfrom
examples m}   different classes
c1 c2
– Consider  
can be optimalsplit  
forpinformation
oints  of  the  
gainform   xi  +  (xloss
or empirical
i+1  –  xi)/2  
reduction
Xj
– Morever,  
Moreover,only  only
splits  
splitsbbetween
etween  examples
examples  from
of  different  
differentcclasses
lasses  maIer!  
c1 informationc2gain or empirical loss reduction
can be optimal for
Xj

tc1
1 tc 2
2 (Figures  from  Stuart  Russell)  
Picking  the  best  threshold  

• Suppose  X  is  real  valued  with  threshold  t  


• Want  IG(Y  |  X:t),  the  informaGon  gain  for  Y  when  
tesGng  if  X  is  greater  than  or  less  than  t  
• Define:    
• H(Y|X:t)  =    p(X  <  t)  H(Y|X  <  t)  +  p(X  >=  t)  H(Y|X  >=  t)  
• IG(Y|X:t)  =  H(Y)  -­‐  H(Y|X:t)  
• IG*(Y|X)  =  maxt  IG(Y|X:t)  

• Use:  IG*(Y|X)  for  conGnuous  variables  


What  you  need  to  know  about  decision  trees  

• Decision  trees  are  one  of  the  most  popular  ML  tools  
– Easy  to  understand,  implement,  and  use  
– ComputaGonally  cheap  (to  solve  heurisGcally)  
• InformaGon  gain  to  select  aIributes  (ID3,  C4.5,…)  
• Presented  for  classificaGon,  can  be  used  for  regression  and  
density  esGmaGon  too  
• Decision  trees  will  overfit!!!  
– Must  use  tricks  to  find  “simple  trees”,  e.g.,  
• Fixed  depth/Early  stopping  
• Pruning  
– Or,  use  ensembles  of  different  trees  (random  forests)  
Ensemble  learning  

Slides adapted from Navneet Goyal, Tan, Steinbach,


Kumar, Vibhav Gogate
Ensemble  methods  
Machine learning competition with a $1 million prize
Bias/Variance  Tradeoff  

Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001



Reduce  Variance  Without  Increasing  Bias  

• Averaging  reduces  variance:  


(when predictions
are independent)

Average models to reduce model variance


One problem:
only one training set
where do multiple models come from?
Bagging:  Bootstrap  AggregaGon  

• Leo  Breiman  (1994)  


• Take  repeated  bootstrap  samples  from  training  set  D  
• Bootstrap  sampling:  Given  set  D  containing  N  training  
examples,  create  D’  by  drawing  N  examples  at  random  
with  replacement  from  D.  

• Bagging:  
– Create  k  bootstrap  samples  D1  …  Dk.  
– Train  disGnct  classifier  on  each  Di.  
– Classify  new  instance  by  majority  vote  /  average.  
General  Idea  
Example  of  Bagging  

• Sampling  with  replacement  


Training Data
Data ID

• Build  classifier  on  each  bootstrap  sample  


• Each  data  point  has  probability  (1  –  1/n)n  of  being  
selected  as  test  data  
• Training  data  =  1-­‐  (1  –  1/n)n  of  the  original  data  
51
decision tree learning algorithm; very similar to ID3

52
shades of blue/red indicate strength of vote for particular classification
Random  Forests  
• Ensemble  method  specifically  designed  for  decision  
tree  classifiers  

• Introduce  two  sources  of  randomness:  “Bagging”  


and  “Random  input  vectors”  
– Bagging  method:  each  tree  is  grown  using  a  bootstrap  
sample  of  training  data  
– Random  vector  method:  At  each  node,  best  split  is  chosen  
from  a  random  sample  of  m  aIributes  instead  of  all  
aIributes  
Random  Forests  
Random  Forests  Algorithm  

You might also like