0% found this document useful (0 votes)
56 views35 pages

Rainforest

The document is a tutorial overview on advances in decision tree construction, covering both classification and regression trees. It discusses key concepts such as split selection, pruning, data access, and evaluation methods for classifiers. The tutorial aims to provide insights into building accurate and interpretable decision tree models while addressing challenges like misclassification rates and the importance of impurity functions in split selection.

Uploaded by

Arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views35 pages

Rainforest

The document is a tutorial overview on advances in decision tree construction, covering both classification and regression trees. It discusses key concepts such as split selection, pruning, data access, and evaluation methods for classifiers. The tutorial aims to provide insights into building accurate and interpretable decision tree models while addressing challenges like misclassification rates and the importance of impurity functions in split selection.

Uploaded by

Arun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Advances in

Decision Tree Construction

Johannes Gehrke
Cornell University
johannes@cs.cornell.edu
http://www.cs.cornell.edu/johannes

Wei-Yin Loh
University of Wisconsin-Madison
loh@stat.wisc.edu
http://www.stat.wisc.edu/~loh

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Tutorial Overview

O Part I: Classification Trees


O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Tutorial Overview

O Part I: Classification Trees


O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-1
Classification

Goal: Learn a function that assigns a record


to one of several predefined classes.

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Classification Example

O Example training database Age Car Class


O Two predictor attributes: 20 M Yes
Age and Car-type (Sport, 30 M Yes
Minivan and Truck) 25 T No
O Age is ordered, Car-type is 30 S Yes
categorical attribute
40 S Yes
O Class label indicates
20 T No
whether person bought
product 30 M Yes
O Dependent attribute is 25 M Yes
categorical 40 M Yes
20 S No
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Types of Variables

O Numerical: Domain is ordered and can be


represented on the real line (e.g., age, income)
O Nominal or categorical: Domain is a finite set
without any natural ordering (e.g., occupation,
marital status, race)
O Ordinal: Domain is ordered, but absolute
differences between values is unknown (e.g.,
preference scale, severity of an injury)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-2
Definitions

O Random variables X1, …, Xk (predictor variables)


and Y (dependent variable)
O Xi has domain dom(Xi), Y has domain dom(Y)
O P is a probability distribution on
dom(X1) x … x dom(Xk) x dom(Y)
Training database D is a random sample from P
O A predictor d is a function
d: dom(X1) … dom(Xk) Æ dom(Y)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Classification Problem

O C is called the class label, d is called a classifier.


O Take r be record randomly drawn from P.
Define the misclassification rate of d:
RT(d,P) = P(d(r.X1, …, r.Xk) != r.C)

Problem definition: Given dataset D that is a random


sample from probability distribution P, find classifier d
such that RT(d,P) is minimized.
(More on regression problems in the second part of the
tutorial.)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Goals and Requirements

Goals:
O To produce an accurate classifier/regression function
O To understand the structure of the problem
Requirements on the model:
O High accuracy
O Understandable by humans, interpretable
O Fast construction for very large training databases

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-3
What are Decision Trees?

Age Minivan
<30 >=30 YES
Sports, YES
Car Type YES Truck
NO
Minivan Sports, Truck

YES NO
0 30 60 Age
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Decision Trees

O A decision tree T encodes d (a classifier or


regression function) in form of a tree.
O A node t in T without children is called a leaf
node. Otherwise t is called an internal node.
O Each internal node has an associated splitting
predicate. Most common are binary predicates.
Example splitting predicates:
O Age <= 20
O Profession in {student, teacher}
O 5000*Age + 3*Salary – 10000 > 0

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Internal and Leaf Nodes

Internal nodes:
O Binary Univariate splits:
O Numerical or ordered X: X <= c, c in dom(X)
O Categorical X: X in A, A subset dom(X)
O Binary Multivariate splits:
O Linear combination split on numerical variables:
Σ aiXi <= c
O k-ary (k>2) splits analogous
Leaf nodes:
O Node t is labeled with one class label c in dom(C)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-4
Example

Encoded classifier:
If (age<30 and
Age carType=Minivan)
Then YES
<30 >=30
If (age <30 and
Car Type YES (carType=Sports or
carType=Truck))
Minivan Sports, Truck Then NO
If (age >= 30)
YES NO Then NO

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Evaluation of Misclassification Error

Problem:
O In order to quantify the quality of a classifier d, we need
to know its misclassification rate RT(d,P).
O But unless we know P, RT(d,P) is unknown.
O Thus we need to estimate RT(d,P) as good as possible.

Approaches:
O Resubstitution estimate
O Test sample estimate
O V-fold Cross Validation

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Resubstitution Estimate

The Resubstitution estimate R(d,D) estimates


RT(d,P) of a classifier d using D:
O Let D be the training database with N records.
O R(d,D) = 1/N Σ I(d(r.X) != r.C))
O Intuition: R(d,D) is the proportion of training
records that is misclassified by d
O Problem with resubstitution estimate:
Overly optimistic; classifiers that overfit the
training dataset will have very low resubstitution
error.
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-5
Test Sample Estimate

O Divide D into D1 and D2


O Use D1 to construct the classifier d
O Then use resubstitution estimate R(d,D2)
to calculate the estimated misclassification
error of d
O Unbiased and efficient, but removes D2
from training dataset D

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

V-fold Cross Validation

Procedure:
O Construct classifier d from D
O Partition D into V datasets D1, …, DV
O Construct classifier di using D \ Di
O Calculate the estimated misclassification error
R(di,Di) of di using test sample Di
Final misclassification estimate:
O Weighted combination of individual
misclassification errors:
R(d,D) = 1/V Σ R(di,Di)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Cross-Validation: Example

d1

d2

d3
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-6
Cross-Validation

O Misclassification estimate obtained through


cross-validation is usually nearly unbiased
O Costly computation (we need to compute d, and
d1, …, dV); computation of di is nearly as
expensive as computation of d
O Preferred method to estimate quality of learning
algorithms in the machine learning literature

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Tutorial Overview

O Part I: Classification Trees


O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Decision Tree Construction


O Top-down tree construction schema:
O Examine training database and find best splitting
predicate for the root node
O Partition training database
O Recurse on each child node
BuildTree(Node t, Training database D, Split Selection Method S)
(1) Apply S to D to find splitting criterion
(2) if (t is not a leaf node)
(3) Create children nodes of t
(4) Partition D into children partitions
(5) Recurse on each partition
(6) endif
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-7
Decision Tree Construction (Contd.)

O Three algorithmic components:


O Splitselection (CART, C4.5, QUEST, CHAID,
CRUISE, …)
O Pruning (direct stopping rule, test dataset
pruning, cost-complexity pruning, statistical
tests, bootstrapping)
O Data access (CLOUDS, SLIQ, SPRINT,
RainForest, BOAT, UnPivot operator)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Split Selection Methods

O Multitude of split selection methods in the


literature
O In this tutorial:
O Impurity-basedsplit selection: CART (most
common in today’s data mining tools)
O Model-based split selection: QUEST

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Split Selection Methods: CART

O Classification And Regression Trees


(Breiman, Friedman, Ohlson, Stone, 1984;
considered “the” reference on decision tree
construction)
O Commercial version sold by Salford Systems
(www.salford-systems.com)
O Many other, slightly modified implementations
exist (e.g., IBM Intelligent Miner implements the
CART split selection method)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-8
CART Split Selection Method

Motivation: We need a way to choose


quantitatively between different splitting
predicates
O Idea: Quantify the impurity of a node
O Method: Select splitting predicate that
generates children nodes with minimum
impurity from a space of possible splitting
predicates

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Intuition: Impurity Function


X1 X2 Class
X1<=1 (50%,50%)
1 1 Yes
1 2 Yes
1 2 Yes Yes No
1 2 Yes
(83%,17%) (0%,100%)
1 2 Yes
1 1 No
2 1 No X2<=1 (50%,50%)
2 1 No
2 2 No No Yes
2 2 No
(25%,75%) (66%,33%)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Impurity Function

Let p(j|t) be the proportion of class j training records at


node t. Then the node impurity measure at node t:
i(t) = phi(p(1|t), …, p(J|t))

Properties:
O phi is symmetric
O Maximum value at arguments (J-1, …, J-1)
O phi(1,0,…,0) = … =phi(0,…,0,1) = 0

The reduction in impurity through splitting predicate s (t splits into


children nodes tL with impurity phi(tL) and tR with impurity phi(tR))
is:
∆phi(s,t) = phi(t) – pL phi(tL) – pR phi(tR)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-9
Example

Root node t: p(1|t)=0.5; p(2|t)=0.5 X1<=1 (50%,50%)


Left child node t:
P(1|t)=0.83; p(2|t)=-.17
Impurity of root node: phi(0.5,0.5) Yes No
Impurity of left child node:
phi(0.83,0.17) (83%,17%) (0%,100%)
Impurity of right child node:
phi(0.0,1.0)
Impurity of whole tree:
0.6* phi(0.83,0.17) + 0.4 * phi(0,1)
Impurity reduction:
phi(0.5,0.5) - 0.6*phi(0.83,0.17) - 0.4*phi(0,1)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Error Reduction as Impurity Function


T
X1<=1 (50%,50%)
O Possible impurity 1
function:
Resubstitution error Yes No
R(T,D).
(83%,17%) (0%,100%)
O Example:
R(no tree, D) = 0.5 T2 X2<=1 (50%,50%)
R(T1,D) = 0.6*0.17
R(T2,D) =
0.4*0.25 + 0.6*0.33 No Yes

(25%,75%) (66%,33%)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Problems with Resubstitution Error

O Obvious problem:
There are situations
where no split can
decrease impurity X3<=1 (80%,20%)
O Example:
R(no tree, D) = 0.2
R(T1,D) Yes Yes
=0.6*0.17+0.4*0.25
=0.2 6: (83%,17%) 4: (75%,25%)
O More subtle problems
exist

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-10
Remedy: Concavity

Concave Impurity Functions


Use impurity functions that are concave: phi’’ < 0
Example concave impurity functions
O Entropy: phi(t) = - Σ p(j|t) log(p(j|t))
O Gini index: phi(t) = Σ p(j|t)2

Nonnegative Decrease in Impurity


Theorem: Let phi(p1, …, pJ) be a strictly concave function on j=1, …, J,
Σj pj = 1.
Then for any split s: ∆phi(s,t) >= 0
With equality if and only if: p(j|tL) = p(j|tR) = p(j|t), j = 1, …, J

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

CART Univariate Split Selection

O Use gini-index as impurity function


O For each numerical or ordered attribute X,
consider all binary splits s of the form
X <= x
where x in dom(X)
O For each categorical attribute X, consider all
binary splits s of the form
X in A, where A subset dom(X)
O At a node t, select split s* such that
∆phi(s*,t) is maximal over all s considered
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

CART: Shortcut for Categorical Splits

Computational shortcut if |Y|=2.


O Theorem: Let X be a categorical attribute with
dom(X) = {b1, …, bk}, |Y|=2, phi be a concave
function, and let
p(X=b1) <= … <= p(X=bk).
Then the best split is of the form:
X in {b1, b2, …, bl} for some l < k
O Benefit: We need only to check k-1 subsets of
dom(X) instead of 2(k-1)-1 subsets

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-11
Problems with CART Split Selection

O Biased towards variables with more splits


(M-category variable has 2M-1-1) possible
splits, an M-valued ordered variable has
(M-1) possible splits
(Explanation and remedy later)
O Computationally expensive for categorical
variables with large domains

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

QUEST: Model-based split selection

“The purpose of models is not to fit the data


but to sharpen the questions.”
Karlin, Samuel (1923 - )
(11th R A Fisher Memorial Lecture, Royal Society 20, April 1983.)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Split Selection Methods: QUEST

O Quick, Unbiased, Efficient, Statistical Tree


(Loh and Shih, Statistica Sinica, 1997)
Freeware, available at www.stat.wisc.edu/~loh
Also implemented in SPSS.

O Main new ideas:


O Separate splitting predicate selection into variable
selection and split point selection
O Use statistical significance tests instead of impurity
function

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-12
QUEST Variable Selection

Let X1, …, Xl be numerical predictor variables, and


let Xl+1, …, Xk be categorical predictor
variables.
1. Find p-value from ANOVA F-test for each numerical
variable.
2. Find p-value for each X2-test for each categorical
variable.
3. Choose variable Xk’ with overall smallest p-value pk’
(Actual algorithm is more complicated.)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

QUEST Split Point Selection

CRIMCOORD transformation of categorical


variables into numerical variables:
1. Take categorical variable X with domain
dom(X)={x1, …, xl}
2. For each record in the training database, create
vector (v1, …, vl) where vi = I(X=xi)
3. Find principal components of set of vectors V
4. Project the dimensionality-reduced data onto the
largest discriminant coordinate dxi
5. Replace X with numeral dxi in the rest of the
algorithm

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

CRIMCOORDs: Examples

O Values(X|Y=1) = {4c1,c2,5c3}, values(X|Y=2) = {2c1, 2c2, 6c3}


Î dx1 = 1, dx2 = -1, dx3 = -0.3
O Values(X|Y=1) = {5c1,5c3}, values(X|Y=2) = {5c1, 5c3}
Î dx1 = 1, dx2 = 0, dx3 = 1
O Values(X|Y=1) = {5c1,5c3}, values(X|Y=2) = {5c1, c2, 5c3}
Î dx1 = 1, dx2 = -1, dx3 = 1

Advantages
O Avoid exponential subset search from CART
O Each dxi has the form Σ bi I(X=xi) for some b1, …, bl,
thus there is a 1-1 correspondence between subsets of
X and a dxi

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-13
QUEST Split Point Selection

O Assume X is the selected variable (either numerical, or categorical


transformed to CRIMCOORDS)
O Group J>2 classes into two superclasses
O Now problem is reduced to one-dimensional two-class problem
O Use exhaustive search for the best split point (like in CART)
O Use quadratic discriminant analysis (QDA, next few bullets)
QUEST Split Point Selection: QDA
O Let x1, x2 and s12, s22 the means and variances for the two
superclasses
O Make normal distribution assumption, and find intersections of the
two normal distributions N(x1,s12) and N(x2,s22)
O QDA splits the X-axis into three intervals
O Select as split point the root that is closer to the sample means

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Illustration: QDA Splits


N(0,1) N(2,2.25)

0.5

0.4

0.3

0.2

0.1

0
-1.5 -0.5 0.5 1.5 2.5 3.5
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

QUEST Linear Combination Splits

O Transform all categorical variables to


CRIMCOORDS
O Apply PCA to the correlation matrix of the data
O Drop the smallest principal components, and
project the remaining components onto the
largest CRIMCOORD
O Group J>2 classes into two superclasses
O Find split on largest CRIMCOORD using ES or
QDA

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-14
Key Differences CART/QUEST

Feature QUEST CART

Variable selection Statistical tests ES


Split point selection QDA or ES ES
Categorical variables CRIMCOORDS ES
Monotone transformations
for numerical variables Not invariant Invariant
Ordinal Variables No Yes
Variables selection bias No Yes (No)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Tutorial Overview

O Part I: Classification Trees


O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Pruning Methods

O Test dataset pruning


O Directstopping rule
O Cost-complexity pruning (not covered)
O MDL pruning
O Pruning by randomization testing

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-15
Stopping Policies

A stopping policy indicates when further growth of


the tree at a node t is counterproductive.
O All records are of the same class
O The attribute values of all records are identical
O All records have missing values
O At most one class has a number of records
larger than a user-specified number
O All records go to the same child node if t is split
(only possible with some split selection
methods)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Test Dataset Pruning

O Use an independent test sample D’ to


estimate the misclassification cost using
the resubstitution estimate R(T,D’) at
each node
O Select the subtree T’ of T with the
smallest expected cost

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Reduced Error Pruning

(Quinlan, C4.5, 1993)


O Assume observed misclassification rate at
a node is p
O Replace p (pessimistically) with the upper
75% confidence bound p’, assuming a
binomial distribution
O Then use p’ to estimate error rate of the
node

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-16
Pruning Using the MDL Principle

(Mehta, Rissanen, Agrawal, KDD 1996)


Also used before by Fayyad, Quinlan, and others.

O MDL: Minimum Description Length Principle


O Idea: Think of the decision tree as encoding the
class labels of the records in the training
database
O MDL Principle: The best tree is the tree that
encodes the records using the fewest bits

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

How To Encode a Node

Given a node t, we need to encode the following:


O Nodetype: One bit to encode the type of each node
(leaf or internal node)
For an internal node:
O Cost(P(t)): The cost of encoding the splitting
predicate P(t) at node t
For a leaf node:
O n*E(t): The cost of encoding the records in leaf node
t with n records from the training database (E(t) is
the entropy of t)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

How To Encode a Tree

Recursive definition of the minimal cost of a node:


O Node t is a leaf node:
cost(t)= n*E(t)
O Node t is an internal node with children nodes t1
and t2. Choice: Either make t a leaf node, or
take the best subtrees, whatever is cheaper:

cost(t) =
min( n*E(t), 1+cost(P(t))+cost(t1)+cost(t2))

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-17
How to Prune

1. Construct decision tree to its maximum


size
2. Compute the MDL cost for each node of
the tree bottom-up
3. Prune the tree bottom-up:
If cost(t)=n*E(t), make t a leaf node.
Resulting tree is the final tree output by
the pruning algorithm.

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Performance Improvements: PUBLIC

(Shim and Rastogi, VLDB 1998)


O MDL bottom-up pruning requires construction of
a complete tree before the bottom-up pruning
can start
O Idea: Prune the tree during (not after) the tree
construction phase
O Why is this possible?
O Calculate a lower bound on cost(t) and compare it
with n*E(t)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

PUBLIC Lower Bound Theorem

O Theorem: Consider a classification problem with


k predictor attributes and J classes. Let Tt be a
subtree with s internal nodes, rooted at node t,
let ni be the number of records with class label i.
Then
cost(Tt) >= 2*s+1+s*log k + Σ ni
O Lower bound on cost(Tt) is thus the minimum
of:
O n*E+1 (t becomes a leaf node)
O 2*s+1+s*log k + Σ ni (subtree at t remains)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-18
Large Datasets Lead to Large Trees

O Oates and Jensen (KDD 1998)


O Problem: Constant probability distribution P,
datasets D1, D2, …, Dk with
|D1| < |D2| < … < |Dk|
|Dk| = c |Dk-1| = … = ck |D1|
O Observation: Trees grow
|T1| < |T2| < … < |Tk|
|Tk| = c’ |Tk-1| = … = c’k |T1|
O But: No gain in accuracy due to larger trees
R(T1,D1) ~ R(T2,D2) ~ … ~ R(Tk, Dk)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Pruning By Randomization Testing

O Reduce pruning decision at each node to a hypothesis test


O Generate empirical distribution of the hypothesis under the null
hypothesis for a node

Node n with subtree T(n) and pruning statistic S(n)


For (i=0; i<K; i++)
1. Randomize class labels of the data at n
2. Build and prune a tree rooted at n
3. Calculate pruning statistic Si(n)
Compare S(n) to empirical distribution of Si(n) to estimate significance
of S(n)
If S(n) is not significant enough compared to a significance level alpha,
then prune T(n) to n

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Tutorial Overview

O Part I: Classification Trees


O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-19
SLIQ

Shafer, Agrawal, Mehta (EDBT 1996)


O Motivation:
O Scalable data access method for CART
O To find the best split we need to evaluate the impurity function
at all possible split points for each numerical attribute, at each
node of the tree
O Idea: Avoids re-sorting at each node of the three through pre-
sorting and maintenance of sort orders
O Ideas:
O Uses vertical partitioning to avoid re-sorting
O Main-memory resident data structure with schema (class label,
leaf node index)
Very likely to fit in-memory for nearly all training databases
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

SLIQ: Pre-Sorting
Age Car Class Age Ind Ind Class Leaf
20 M Yes 20 1 1 Yes 1
30 M Yes 20 6 2 Yes 1
25 T No 20 10 3 No 1
30 S Yes 25 3 4 Yes 1
40 S Yes 25 8 5 Yes 1
20 T No 30 2 6 No 1
30 M Yes 30 4 7 Yes 1
25 M Yes 30 7 8 Yes 1
40 M Yes 40 5 9 Yes 1
20 S No 40 9 10 No 1
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

SLIQ: Evaluation of Splits


Age Ind Ind Class Leaf
Node2 Yes No
20 1 1 Yes 2
Left 2 0
20 6 2 Yes 2
Right 3 2
20 10 3 No 2
25 3 4 Yes 3
5 Yes 3 Node3 Yes No
25 8
30 2 6 No 2 Left 0 1
30 4 7 Yes 2 Right 2 0
30 7 8 Yes 2
40 5 9 Yes 2
40 9 10 No 3
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-20
SLIQ: Splitting of a Node
Age Ind Ind Class Leaf
20 1 1 Yes 4
20 6 2 Yes 5 1
20 10 3 No 5
2 3
25 3 4 Yes 7
25 8 5 Yes 7 4 5 6 7
30 2 6 No 4
30 4 7 Yes 7
30 7 8 Yes 7
40 5 9 Yes 7
40 9 10 No 6
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

SLIQ: Summary

O Uses vertical partitioning to avoid re-


sorting
O Main-memory resident data structure with
schema (class label, leaf node index)
Very likely to fit in-memory for nearly all
training databases

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

SPRINT

Shafer, Agrawal, Mehta (VLDB 1996)


O Motivation:
O Scalable data access method for CART
O Improvement over SLIQ to avoid main-memory data structure
O Ideas:
O Create vertical partitions called attribute lists for each attribute
O Pre-sort the attribute lists

Recursive tree construction:


1. Scan all attribute lists at node t to find the best split
2. Partition current attribute lists over children nodes while
maintaining sort orders
3. Recurse
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-21
SPRINT Attribute Lists
Age Car Class Age Class Ind Car Class Ind
20 M Yes 20 Yes 1 M Yes 1
30 M Yes 20 No 6 M Yes 2
25 T No 20 No 10 T No 3
30 S Yes 25 No 3 S Yes 4
40 S Yes 25 Yes 8 S Yes 5
20 T No 30 Yes 2 T No 6
30 M Yes 30 Yes 4 M Yes 7
25 M Yes 30 Yes 7 M Yes 8
40 M Yes 40 Yes 5 M Yes 9
20 S No 40 Yes 9 S No 10
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

SPRINT: Evaluation of Splits


Age Class Ind
20 Yes 1
Node1 Yes No
20 No 6
20 No 10 Left 1 2
25 No 3 Right 6 1
25 Yes 8
30 Yes 2
30 Yes 4
30 Yes 7
40 Yes 5
40 Yes 9

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

SPRINT: Splitting of a Node

1. Scan all attribute lists to find the best


split
2. Partition the attribute list of the splitting
attribute X
3. For each attribute Xi != X
Perform the partitioning step of a hash-join
between the attribute list of X and the
attribute list of Xi

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-22
SPRINT: Hash-Join Partitioning
Age Class Ind Car Class Ind
20 Yes 1 M Yes 1
20 No 6 Right Child
M Yes 2
20 No 10 Right Child
R
M Yes 7
25 No 3
25 Yes 8 M Yes 8
30 Yes 2 M Yes 9
30 Yes 4 R
30 Yes 7
40 Yes 5 R
40 Yes 9

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

SPRINT: Summary

O Scalable data access method for CART


split selection method
O Completely scalable, can be (and has
been) implemented “inside” a database
system
O Hash-join partitioning step expensive
(each attribute, at each node of the tree)

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

RainForest
(Gehrke, Ramakrishnan, Ganti, VLDB 1998)

Training Database AVC-Sets


A ge Car Clas s Age Yes No
20 M Yes 20 1 2
30 M Yes 25 1 1
25 T No 30 3 0
30 S Yes 40 2 0
40 S Yes
20 T No Car Yes No
30 M Yes Sport 2 1
25 M Yes Truck 0 2
40 M Yes Minivan 5 0
20 S No
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-23
Refined RainForest Top-Down Schema
BuildTree(Node n, Training database D,
Split Selection Method S)

[ (1) Apply S to D to find splitting criterion ]


(1a) for each predictor attribute X
(1b) Call S.findSplit(AVC-set of X)
(1c) endfor
(1d) S.chooseBest();
(2) if (n is not a leaf node) ...

S: C4.5, CART, CHAID, FACT, ID3, GID3, QUEST, etc.

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

RainForest Data Access Method

Assume datapartition at a node is D. Then the


following steps are carried out:

1. Construct AVC-group of the node


2. Choose splitting attribute and splitting predicate
3. Partition D across the children

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

RainForest Algorithms: RF-Write

First scan:

Database
AVC-Sets

Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-24
RainForest Algorithms: RF-Write

Second Scan:

Database
Age<30?

Main Memory Partition 1 Partition 2


KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

RainForest Algorithms: RF-Write

Analysis:
O Assumes that the AVC-group of the root node fits into main
memory
O Two database scans per level of the tree
O Usually more main memory available than one single AVC-
group needs

Age<30? Database

? Database
?
Main Memory Main Memory Partition 1 Partition 2
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

RainForest Algorithms: RF-Read

First scan:

Database
AVC-Sets

Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-25
RainForest Algorithms: RF-Read

Second Scan:

Age<30

Database

AVC-Sets

Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

RainForest Algorithms: RF-Read

Third Scan:

Age<30

Sal<20k Car==S

Database

Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

RainForest Algorithms: RF-Hybrid

First scan:

Database
AVC-Sets

Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-26
RainForest Algorithms: RF-Hybrid

Second Scan:

Age<30

Database

AVC-Sets

Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

RainForest Algorithms: RF-Hybrid

Third Scan:

Age<30 Database
Sal<20k Car==S

Main Memory Partition 1 Partition 2 Partition 3 Partition 4

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

RainForest Algorithms: RF-Hybrid

Further optimization: While writing partitions,


concurrently build AVC-groups of as many nodes
as possible in-memory

Age<30
Database
Sal<20k Car==S

Main Memory Partition 1 Partition 2 Partition 3 Partition 4


KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-27
BOAT

(Gehrke, Ganti, Ramakrishnan, Loh; SIGMOD 1999)

Age
Training Database

<30 >=30

Left Partition Right Partition

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

BOAT: Algorithm Overview

In-memory Approximate All the


Sample tree, bounds data

Sampling Phase Cleanup Phase

Approximate tree Final tree


with bounds
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Tutorial Overview

O Part I: Classification Trees


O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing Values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-28
Missing Values

O What is the problem?


O During computation of the splitting predicate,
we can selectively ignore records with missing
values (note that this has some problems)
O But if a record r misses the value of the
variable in the splitting attribute, r can not
participate further in tree construction
Algorithms for missing values address this
problem.

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Mean and Mode Imputation

Assume record r has missing value r.X, and


splitting variable is X.
O Simplest algorithm:
O If X is numerical (categorical), impute the
overall mean (mode)
O Improved algorithm:
O If X is numerical (categorical), impute the
mean(X|t.C) (the mode(X|t.C))

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Surrogate Splits (CART)

Assume record r has missing value r.X, and


splitting predicate is PX.
O Idea: Find splitting predicate QX’ involving
another variable X’ != X that is most similar to
PX.
O Similarity sim(Q,P|D) between splits Q and P:
Sim(Q,P|D) = |{r in D: P(r) and Q(r)}|/|D|
O 0 <= sim(Q,P|D) <= 1
O Sim(P,P) = 1

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-29
Surrogate Splits: Example
X1 X2 Class
Consider splitting predicate
X1 <= 1. 1 1 Yes
Sim((X1 <= 1), 1 1 Yes
(X2 <= 1)|D) = 1 1 Yes
(3+4)/10 1 2 Yes
Sim((X1 <= 1), 1 2 Yes
(X2 <= 2)|D) = 1 2 No
(6+3)/10 2 2 No
(X2 <= 2) is the preferred 2 3 No
surrogate split. 2 3 No
2 3 No
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Tutorial Overview

O Part I: Classification Trees


O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing Values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Choice of Classification Algorithm?

O Example study: (Lim, Loh, and Shih, Machine


Learning 2000)
O 33 classification algorithms
O 16 (small) data sets (UC Irvine ML Repository)
O Each algorithm applied to each data set
O Experimental measurements:
O Classification accuracy
O Computational speed
O Classifier complexity

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-30
Experimental Setup

Algorithms:
O Tree-structure classifiers (IND, S-Plus Trees, C4.5, FACT,
QUEST, CART, OC1, LMDT, CAL5, T1)
O Statistical methods (LDA, QDA, NN, LOG, FDA, PDA, MDA, POL)
O Neural networks (LVQ, RBF)
Setup:
O 16 primary data sets, created 16 more data sets by adding noise
O Converted categorical predictor variables to 0-1 dummy
variables if necessary
O Error rates for 6 data sets estimated from supplied test sets, 10-
fold cross-validation used for the other data sets

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Results

Rank Algorithm Mean Error Time


1 Polyclass 0.195 3 hours
2 Quest Multivariate 0.202 4 min
3 Logistic Regression 0.204 4 min
6 LDA 0.208 10 s
8 IND CART 0.215 47 s
12 C4.5 Rules 0.220 20 s
16 Quest Univariate 0.221 40 s

O Number of leaves for tree-based classifiers varied widely (median
number of leaves between 5 and 32 (removing some outliers))
O Mean misclassification rates for top 26 algorithms are not
statistically significantly different, bottom 7 algorithms have
significantly lower error rates

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Tutorial Overview

O Part I: Classification Trees


O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing Values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-31
Bias in Split Selection for ES
Age Yes No
Assume: No correlation 20 15 15
with the class label. 25 15 15
30 15 15
O Question: Should we 40 15 15
choose Age or Car?
O Answer: We should Car Yes No
choose both of them Sport 20 20
equally likely! Truck 20 20
Minivan 20 20
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Formal Definition of the Bias

O Bias: “Odds of choosing X1 and X2 as split


variable when neither X1 nor X2 is correlated
with the class label”

O Formally:
Bias(X1,X2) = Log10(P(X1,X2)/(1-P(X1,X2)),
P(X1,X2): probability of choosing variable X1 over X2

We would like: Bias(X1,X2) = 0 in the Null Case


KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Formal Definition of the Bias (Contd.)


Car Yes No
O Example: Synthetic Car1
data with two
Car2
categorical predictor
variables Car3
O X1: 10 categories …
O X2: 2 categories Car10
O For each category:
Same probability of State Yes No
choosing “Yes” (no CA
correlation) NY
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-32
Evidence of the Bias
Gini Entropy

Gain Ratio

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

One Explanation

Theorem: (Expected Value of the Gini Gain)


Assume:
O Two classlabels
O n: number of categories
O N: number of records
O p1: probability of having classlabel “Yes”

Then: E(ginigain) = 2p(1-p)*(n-1)/N

Expected ginigain increases linearly with number


KDD of categories!
2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Bias Correction: Intuition

O Value of the splitting criteria is biased under


the Null Hypothesis.
O Idea: Use p-value of the criterion:
Probability that the value of the criterion under
the Null Case is as extreme as the observed
value

Method:
1. Compute criterion (gini, entropy, etc.)
2. Compute p-value
3. Choose splitting variable
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-33
Correction Through P-Value

O New p-value criterion:


O Maintains “good” properties of your favorite
splitting criterion
O Theorem: The correction through the p-value is
nearly unbiased.
Computation:
1. Exact (randomization statistic; very expensive to compute)
2. Bootstrapping (Monte Carlo simulations; computationally
expensive; works only for small p-values)
3. Asymptotic approximations (G2 for entropy, Chi2 distribution for
Chi2 test; don’t work well in boundary conditions)
4. Tight approximations (cheap, often work well in practice)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Tight Approximation

O Experimental evidence
shows that Gamma
distribution approximates
gini-gain very well.
O We can calculate:
O Expected gain:
E(gain) = 2p(1-p)*(n-1)/N
O Variance of gain:
Var(gain) = 4p(1-p)/N2[(1-
6p-6p2) * (sum 1/Ni – (2n-
1)/N) + 2(n-1)p(1-p)]

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

Problem: ES and Missing Value

Consider a training database with the following


schema: (X1, …, Xk, C)
O Assume the projection onto (X1, C) is the
following:

{(1, Class1), (2, Class2), (NULL, Class13), …,


(NULL, Class1N)}
(X1 has missing values except for the first two
records)

O Exhaustive search will very likely split on X1!


KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-34
Concluding Remarks Part I

O There are many algorithms available for:


O Split selection
O Pruning
O Data access
O Handling missing values
O Challenges: Performance, getting the “right” model,
data streams, new applications

KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh

T3-35

You might also like