Advances in
Decision Tree Construction
Johannes Gehrke
Cornell University
johannes@cs.cornell.edu
http://www.cs.cornell.edu/johannes
Wei-Yin Loh
University of Wisconsin-Madison
loh@stat.wisc.edu
http://www.stat.wisc.edu/~loh
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Tutorial Overview
O Part I: Classification Trees
O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Tutorial Overview
O Part I: Classification Trees
O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-1
Classification
Goal: Learn a function that assigns a record
to one of several predefined classes.
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Classification Example
O Example training database Age Car Class
O Two predictor attributes: 20 M Yes
Age and Car-type (Sport, 30 M Yes
Minivan and Truck) 25 T No
O Age is ordered, Car-type is 30 S Yes
categorical attribute
40 S Yes
O Class label indicates
20 T No
whether person bought
product 30 M Yes
O Dependent attribute is 25 M Yes
categorical 40 M Yes
20 S No
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Types of Variables
O Numerical: Domain is ordered and can be
represented on the real line (e.g., age, income)
O Nominal or categorical: Domain is a finite set
without any natural ordering (e.g., occupation,
marital status, race)
O Ordinal: Domain is ordered, but absolute
differences between values is unknown (e.g.,
preference scale, severity of an injury)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-2
Definitions
O Random variables X1, …, Xk (predictor variables)
and Y (dependent variable)
O Xi has domain dom(Xi), Y has domain dom(Y)
O P is a probability distribution on
dom(X1) x … x dom(Xk) x dom(Y)
Training database D is a random sample from P
O A predictor d is a function
d: dom(X1) … dom(Xk) Æ dom(Y)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Classification Problem
O C is called the class label, d is called a classifier.
O Take r be record randomly drawn from P.
Define the misclassification rate of d:
RT(d,P) = P(d(r.X1, …, r.Xk) != r.C)
Problem definition: Given dataset D that is a random
sample from probability distribution P, find classifier d
such that RT(d,P) is minimized.
(More on regression problems in the second part of the
tutorial.)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Goals and Requirements
Goals:
O To produce an accurate classifier/regression function
O To understand the structure of the problem
Requirements on the model:
O High accuracy
O Understandable by humans, interpretable
O Fast construction for very large training databases
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-3
What are Decision Trees?
Age Minivan
<30 >=30 YES
Sports, YES
Car Type YES Truck
NO
Minivan Sports, Truck
YES NO
0 30 60 Age
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Decision Trees
O A decision tree T encodes d (a classifier or
regression function) in form of a tree.
O A node t in T without children is called a leaf
node. Otherwise t is called an internal node.
O Each internal node has an associated splitting
predicate. Most common are binary predicates.
Example splitting predicates:
O Age <= 20
O Profession in {student, teacher}
O 5000*Age + 3*Salary – 10000 > 0
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Internal and Leaf Nodes
Internal nodes:
O Binary Univariate splits:
O Numerical or ordered X: X <= c, c in dom(X)
O Categorical X: X in A, A subset dom(X)
O Binary Multivariate splits:
O Linear combination split on numerical variables:
Σ aiXi <= c
O k-ary (k>2) splits analogous
Leaf nodes:
O Node t is labeled with one class label c in dom(C)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-4
Example
Encoded classifier:
If (age<30 and
Age carType=Minivan)
Then YES
<30 >=30
If (age <30 and
Car Type YES (carType=Sports or
carType=Truck))
Minivan Sports, Truck Then NO
If (age >= 30)
YES NO Then NO
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Evaluation of Misclassification Error
Problem:
O In order to quantify the quality of a classifier d, we need
to know its misclassification rate RT(d,P).
O But unless we know P, RT(d,P) is unknown.
O Thus we need to estimate RT(d,P) as good as possible.
Approaches:
O Resubstitution estimate
O Test sample estimate
O V-fold Cross Validation
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Resubstitution Estimate
The Resubstitution estimate R(d,D) estimates
RT(d,P) of a classifier d using D:
O Let D be the training database with N records.
O R(d,D) = 1/N Σ I(d(r.X) != r.C))
O Intuition: R(d,D) is the proportion of training
records that is misclassified by d
O Problem with resubstitution estimate:
Overly optimistic; classifiers that overfit the
training dataset will have very low resubstitution
error.
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-5
Test Sample Estimate
O Divide D into D1 and D2
O Use D1 to construct the classifier d
O Then use resubstitution estimate R(d,D2)
to calculate the estimated misclassification
error of d
O Unbiased and efficient, but removes D2
from training dataset D
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
V-fold Cross Validation
Procedure:
O Construct classifier d from D
O Partition D into V datasets D1, …, DV
O Construct classifier di using D \ Di
O Calculate the estimated misclassification error
R(di,Di) of di using test sample Di
Final misclassification estimate:
O Weighted combination of individual
misclassification errors:
R(d,D) = 1/V Σ R(di,Di)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Cross-Validation: Example
d1
d2
d3
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-6
Cross-Validation
O Misclassification estimate obtained through
cross-validation is usually nearly unbiased
O Costly computation (we need to compute d, and
d1, …, dV); computation of di is nearly as
expensive as computation of d
O Preferred method to estimate quality of learning
algorithms in the machine learning literature
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Tutorial Overview
O Part I: Classification Trees
O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Decision Tree Construction
O Top-down tree construction schema:
O Examine training database and find best splitting
predicate for the root node
O Partition training database
O Recurse on each child node
BuildTree(Node t, Training database D, Split Selection Method S)
(1) Apply S to D to find splitting criterion
(2) if (t is not a leaf node)
(3) Create children nodes of t
(4) Partition D into children partitions
(5) Recurse on each partition
(6) endif
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-7
Decision Tree Construction (Contd.)
O Three algorithmic components:
O Splitselection (CART, C4.5, QUEST, CHAID,
CRUISE, …)
O Pruning (direct stopping rule, test dataset
pruning, cost-complexity pruning, statistical
tests, bootstrapping)
O Data access (CLOUDS, SLIQ, SPRINT,
RainForest, BOAT, UnPivot operator)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Split Selection Methods
O Multitude of split selection methods in the
literature
O In this tutorial:
O Impurity-basedsplit selection: CART (most
common in today’s data mining tools)
O Model-based split selection: QUEST
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Split Selection Methods: CART
O Classification And Regression Trees
(Breiman, Friedman, Ohlson, Stone, 1984;
considered “the” reference on decision tree
construction)
O Commercial version sold by Salford Systems
(www.salford-systems.com)
O Many other, slightly modified implementations
exist (e.g., IBM Intelligent Miner implements the
CART split selection method)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-8
CART Split Selection Method
Motivation: We need a way to choose
quantitatively between different splitting
predicates
O Idea: Quantify the impurity of a node
O Method: Select splitting predicate that
generates children nodes with minimum
impurity from a space of possible splitting
predicates
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Intuition: Impurity Function
X1 X2 Class
X1<=1 (50%,50%)
1 1 Yes
1 2 Yes
1 2 Yes Yes No
1 2 Yes
(83%,17%) (0%,100%)
1 2 Yes
1 1 No
2 1 No X2<=1 (50%,50%)
2 1 No
2 2 No No Yes
2 2 No
(25%,75%) (66%,33%)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Impurity Function
Let p(j|t) be the proportion of class j training records at
node t. Then the node impurity measure at node t:
i(t) = phi(p(1|t), …, p(J|t))
Properties:
O phi is symmetric
O Maximum value at arguments (J-1, …, J-1)
O phi(1,0,…,0) = … =phi(0,…,0,1) = 0
The reduction in impurity through splitting predicate s (t splits into
children nodes tL with impurity phi(tL) and tR with impurity phi(tR))
is:
∆phi(s,t) = phi(t) – pL phi(tL) – pR phi(tR)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-9
Example
Root node t: p(1|t)=0.5; p(2|t)=0.5 X1<=1 (50%,50%)
Left child node t:
P(1|t)=0.83; p(2|t)=-.17
Impurity of root node: phi(0.5,0.5) Yes No
Impurity of left child node:
phi(0.83,0.17) (83%,17%) (0%,100%)
Impurity of right child node:
phi(0.0,1.0)
Impurity of whole tree:
0.6* phi(0.83,0.17) + 0.4 * phi(0,1)
Impurity reduction:
phi(0.5,0.5) - 0.6*phi(0.83,0.17) - 0.4*phi(0,1)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Error Reduction as Impurity Function
T
X1<=1 (50%,50%)
O Possible impurity 1
function:
Resubstitution error Yes No
R(T,D).
(83%,17%) (0%,100%)
O Example:
R(no tree, D) = 0.5 T2 X2<=1 (50%,50%)
R(T1,D) = 0.6*0.17
R(T2,D) =
0.4*0.25 + 0.6*0.33 No Yes
(25%,75%) (66%,33%)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Problems with Resubstitution Error
O Obvious problem:
There are situations
where no split can
decrease impurity X3<=1 (80%,20%)
O Example:
R(no tree, D) = 0.2
R(T1,D) Yes Yes
=0.6*0.17+0.4*0.25
=0.2 6: (83%,17%) 4: (75%,25%)
O More subtle problems
exist
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-10
Remedy: Concavity
Concave Impurity Functions
Use impurity functions that are concave: phi’’ < 0
Example concave impurity functions
O Entropy: phi(t) = - Σ p(j|t) log(p(j|t))
O Gini index: phi(t) = Σ p(j|t)2
Nonnegative Decrease in Impurity
Theorem: Let phi(p1, …, pJ) be a strictly concave function on j=1, …, J,
Σj pj = 1.
Then for any split s: ∆phi(s,t) >= 0
With equality if and only if: p(j|tL) = p(j|tR) = p(j|t), j = 1, …, J
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
CART Univariate Split Selection
O Use gini-index as impurity function
O For each numerical or ordered attribute X,
consider all binary splits s of the form
X <= x
where x in dom(X)
O For each categorical attribute X, consider all
binary splits s of the form
X in A, where A subset dom(X)
O At a node t, select split s* such that
∆phi(s*,t) is maximal over all s considered
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
CART: Shortcut for Categorical Splits
Computational shortcut if |Y|=2.
O Theorem: Let X be a categorical attribute with
dom(X) = {b1, …, bk}, |Y|=2, phi be a concave
function, and let
p(X=b1) <= … <= p(X=bk).
Then the best split is of the form:
X in {b1, b2, …, bl} for some l < k
O Benefit: We need only to check k-1 subsets of
dom(X) instead of 2(k-1)-1 subsets
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-11
Problems with CART Split Selection
O Biased towards variables with more splits
(M-category variable has 2M-1-1) possible
splits, an M-valued ordered variable has
(M-1) possible splits
(Explanation and remedy later)
O Computationally expensive for categorical
variables with large domains
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
QUEST: Model-based split selection
“The purpose of models is not to fit the data
but to sharpen the questions.”
Karlin, Samuel (1923 - )
(11th R A Fisher Memorial Lecture, Royal Society 20, April 1983.)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Split Selection Methods: QUEST
O Quick, Unbiased, Efficient, Statistical Tree
(Loh and Shih, Statistica Sinica, 1997)
Freeware, available at www.stat.wisc.edu/~loh
Also implemented in SPSS.
O Main new ideas:
O Separate splitting predicate selection into variable
selection and split point selection
O Use statistical significance tests instead of impurity
function
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-12
QUEST Variable Selection
Let X1, …, Xl be numerical predictor variables, and
let Xl+1, …, Xk be categorical predictor
variables.
1. Find p-value from ANOVA F-test for each numerical
variable.
2. Find p-value for each X2-test for each categorical
variable.
3. Choose variable Xk’ with overall smallest p-value pk’
(Actual algorithm is more complicated.)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
QUEST Split Point Selection
CRIMCOORD transformation of categorical
variables into numerical variables:
1. Take categorical variable X with domain
dom(X)={x1, …, xl}
2. For each record in the training database, create
vector (v1, …, vl) where vi = I(X=xi)
3. Find principal components of set of vectors V
4. Project the dimensionality-reduced data onto the
largest discriminant coordinate dxi
5. Replace X with numeral dxi in the rest of the
algorithm
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
CRIMCOORDs: Examples
O Values(X|Y=1) = {4c1,c2,5c3}, values(X|Y=2) = {2c1, 2c2, 6c3}
Î dx1 = 1, dx2 = -1, dx3 = -0.3
O Values(X|Y=1) = {5c1,5c3}, values(X|Y=2) = {5c1, 5c3}
Î dx1 = 1, dx2 = 0, dx3 = 1
O Values(X|Y=1) = {5c1,5c3}, values(X|Y=2) = {5c1, c2, 5c3}
Î dx1 = 1, dx2 = -1, dx3 = 1
Advantages
O Avoid exponential subset search from CART
O Each dxi has the form Σ bi I(X=xi) for some b1, …, bl,
thus there is a 1-1 correspondence between subsets of
X and a dxi
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-13
QUEST Split Point Selection
O Assume X is the selected variable (either numerical, or categorical
transformed to CRIMCOORDS)
O Group J>2 classes into two superclasses
O Now problem is reduced to one-dimensional two-class problem
O Use exhaustive search for the best split point (like in CART)
O Use quadratic discriminant analysis (QDA, next few bullets)
QUEST Split Point Selection: QDA
O Let x1, x2 and s12, s22 the means and variances for the two
superclasses
O Make normal distribution assumption, and find intersections of the
two normal distributions N(x1,s12) and N(x2,s22)
O QDA splits the X-axis into three intervals
O Select as split point the root that is closer to the sample means
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Illustration: QDA Splits
N(0,1) N(2,2.25)
0.5
0.4
0.3
0.2
0.1
0
-1.5 -0.5 0.5 1.5 2.5 3.5
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
QUEST Linear Combination Splits
O Transform all categorical variables to
CRIMCOORDS
O Apply PCA to the correlation matrix of the data
O Drop the smallest principal components, and
project the remaining components onto the
largest CRIMCOORD
O Group J>2 classes into two superclasses
O Find split on largest CRIMCOORD using ES or
QDA
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-14
Key Differences CART/QUEST
Feature QUEST CART
Variable selection Statistical tests ES
Split point selection QDA or ES ES
Categorical variables CRIMCOORDS ES
Monotone transformations
for numerical variables Not invariant Invariant
Ordinal Variables No Yes
Variables selection bias No Yes (No)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Tutorial Overview
O Part I: Classification Trees
O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Pruning Methods
O Test dataset pruning
O Directstopping rule
O Cost-complexity pruning (not covered)
O MDL pruning
O Pruning by randomization testing
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-15
Stopping Policies
A stopping policy indicates when further growth of
the tree at a node t is counterproductive.
O All records are of the same class
O The attribute values of all records are identical
O All records have missing values
O At most one class has a number of records
larger than a user-specified number
O All records go to the same child node if t is split
(only possible with some split selection
methods)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Test Dataset Pruning
O Use an independent test sample D’ to
estimate the misclassification cost using
the resubstitution estimate R(T,D’) at
each node
O Select the subtree T’ of T with the
smallest expected cost
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Reduced Error Pruning
(Quinlan, C4.5, 1993)
O Assume observed misclassification rate at
a node is p
O Replace p (pessimistically) with the upper
75% confidence bound p’, assuming a
binomial distribution
O Then use p’ to estimate error rate of the
node
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-16
Pruning Using the MDL Principle
(Mehta, Rissanen, Agrawal, KDD 1996)
Also used before by Fayyad, Quinlan, and others.
O MDL: Minimum Description Length Principle
O Idea: Think of the decision tree as encoding the
class labels of the records in the training
database
O MDL Principle: The best tree is the tree that
encodes the records using the fewest bits
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
How To Encode a Node
Given a node t, we need to encode the following:
O Nodetype: One bit to encode the type of each node
(leaf or internal node)
For an internal node:
O Cost(P(t)): The cost of encoding the splitting
predicate P(t) at node t
For a leaf node:
O n*E(t): The cost of encoding the records in leaf node
t with n records from the training database (E(t) is
the entropy of t)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
How To Encode a Tree
Recursive definition of the minimal cost of a node:
O Node t is a leaf node:
cost(t)= n*E(t)
O Node t is an internal node with children nodes t1
and t2. Choice: Either make t a leaf node, or
take the best subtrees, whatever is cheaper:
cost(t) =
min( n*E(t), 1+cost(P(t))+cost(t1)+cost(t2))
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-17
How to Prune
1. Construct decision tree to its maximum
size
2. Compute the MDL cost for each node of
the tree bottom-up
3. Prune the tree bottom-up:
If cost(t)=n*E(t), make t a leaf node.
Resulting tree is the final tree output by
the pruning algorithm.
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Performance Improvements: PUBLIC
(Shim and Rastogi, VLDB 1998)
O MDL bottom-up pruning requires construction of
a complete tree before the bottom-up pruning
can start
O Idea: Prune the tree during (not after) the tree
construction phase
O Why is this possible?
O Calculate a lower bound on cost(t) and compare it
with n*E(t)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
PUBLIC Lower Bound Theorem
O Theorem: Consider a classification problem with
k predictor attributes and J classes. Let Tt be a
subtree with s internal nodes, rooted at node t,
let ni be the number of records with class label i.
Then
cost(Tt) >= 2*s+1+s*log k + Σ ni
O Lower bound on cost(Tt) is thus the minimum
of:
O n*E+1 (t becomes a leaf node)
O 2*s+1+s*log k + Σ ni (subtree at t remains)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-18
Large Datasets Lead to Large Trees
O Oates and Jensen (KDD 1998)
O Problem: Constant probability distribution P,
datasets D1, D2, …, Dk with
|D1| < |D2| < … < |Dk|
|Dk| = c |Dk-1| = … = ck |D1|
O Observation: Trees grow
|T1| < |T2| < … < |Tk|
|Tk| = c’ |Tk-1| = … = c’k |T1|
O But: No gain in accuracy due to larger trees
R(T1,D1) ~ R(T2,D2) ~ … ~ R(Tk, Dk)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Pruning By Randomization Testing
O Reduce pruning decision at each node to a hypothesis test
O Generate empirical distribution of the hypothesis under the null
hypothesis for a node
Node n with subtree T(n) and pruning statistic S(n)
For (i=0; i<K; i++)
1. Randomize class labels of the data at n
2. Build and prune a tree rooted at n
3. Calculate pruning statistic Si(n)
Compare S(n) to empirical distribution of Si(n) to estimate significance
of S(n)
If S(n) is not significant enough compared to a significance level alpha,
then prune T(n) to n
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Tutorial Overview
O Part I: Classification Trees
O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-19
SLIQ
Shafer, Agrawal, Mehta (EDBT 1996)
O Motivation:
O Scalable data access method for CART
O To find the best split we need to evaluate the impurity function
at all possible split points for each numerical attribute, at each
node of the tree
O Idea: Avoids re-sorting at each node of the three through pre-
sorting and maintenance of sort orders
O Ideas:
O Uses vertical partitioning to avoid re-sorting
O Main-memory resident data structure with schema (class label,
leaf node index)
Very likely to fit in-memory for nearly all training databases
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
SLIQ: Pre-Sorting
Age Car Class Age Ind Ind Class Leaf
20 M Yes 20 1 1 Yes 1
30 M Yes 20 6 2 Yes 1
25 T No 20 10 3 No 1
30 S Yes 25 3 4 Yes 1
40 S Yes 25 8 5 Yes 1
20 T No 30 2 6 No 1
30 M Yes 30 4 7 Yes 1
25 M Yes 30 7 8 Yes 1
40 M Yes 40 5 9 Yes 1
20 S No 40 9 10 No 1
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
SLIQ: Evaluation of Splits
Age Ind Ind Class Leaf
Node2 Yes No
20 1 1 Yes 2
Left 2 0
20 6 2 Yes 2
Right 3 2
20 10 3 No 2
25 3 4 Yes 3
5 Yes 3 Node3 Yes No
25 8
30 2 6 No 2 Left 0 1
30 4 7 Yes 2 Right 2 0
30 7 8 Yes 2
40 5 9 Yes 2
40 9 10 No 3
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-20
SLIQ: Splitting of a Node
Age Ind Ind Class Leaf
20 1 1 Yes 4
20 6 2 Yes 5 1
20 10 3 No 5
2 3
25 3 4 Yes 7
25 8 5 Yes 7 4 5 6 7
30 2 6 No 4
30 4 7 Yes 7
30 7 8 Yes 7
40 5 9 Yes 7
40 9 10 No 6
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
SLIQ: Summary
O Uses vertical partitioning to avoid re-
sorting
O Main-memory resident data structure with
schema (class label, leaf node index)
Very likely to fit in-memory for nearly all
training databases
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
SPRINT
Shafer, Agrawal, Mehta (VLDB 1996)
O Motivation:
O Scalable data access method for CART
O Improvement over SLIQ to avoid main-memory data structure
O Ideas:
O Create vertical partitions called attribute lists for each attribute
O Pre-sort the attribute lists
Recursive tree construction:
1. Scan all attribute lists at node t to find the best split
2. Partition current attribute lists over children nodes while
maintaining sort orders
3. Recurse
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-21
SPRINT Attribute Lists
Age Car Class Age Class Ind Car Class Ind
20 M Yes 20 Yes 1 M Yes 1
30 M Yes 20 No 6 M Yes 2
25 T No 20 No 10 T No 3
30 S Yes 25 No 3 S Yes 4
40 S Yes 25 Yes 8 S Yes 5
20 T No 30 Yes 2 T No 6
30 M Yes 30 Yes 4 M Yes 7
25 M Yes 30 Yes 7 M Yes 8
40 M Yes 40 Yes 5 M Yes 9
20 S No 40 Yes 9 S No 10
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
SPRINT: Evaluation of Splits
Age Class Ind
20 Yes 1
Node1 Yes No
20 No 6
20 No 10 Left 1 2
25 No 3 Right 6 1
25 Yes 8
30 Yes 2
30 Yes 4
30 Yes 7
40 Yes 5
40 Yes 9
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
SPRINT: Splitting of a Node
1. Scan all attribute lists to find the best
split
2. Partition the attribute list of the splitting
attribute X
3. For each attribute Xi != X
Perform the partitioning step of a hash-join
between the attribute list of X and the
attribute list of Xi
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-22
SPRINT: Hash-Join Partitioning
Age Class Ind Car Class Ind
20 Yes 1 M Yes 1
20 No 6 Right Child
M Yes 2
20 No 10 Right Child
R
M Yes 7
25 No 3
25 Yes 8 M Yes 8
30 Yes 2 M Yes 9
30 Yes 4 R
30 Yes 7
40 Yes 5 R
40 Yes 9
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
SPRINT: Summary
O Scalable data access method for CART
split selection method
O Completely scalable, can be (and has
been) implemented “inside” a database
system
O Hash-join partitioning step expensive
(each attribute, at each node of the tree)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
RainForest
(Gehrke, Ramakrishnan, Ganti, VLDB 1998)
Training Database AVC-Sets
A ge Car Clas s Age Yes No
20 M Yes 20 1 2
30 M Yes 25 1 1
25 T No 30 3 0
30 S Yes 40 2 0
40 S Yes
20 T No Car Yes No
30 M Yes Sport 2 1
25 M Yes Truck 0 2
40 M Yes Minivan 5 0
20 S No
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-23
Refined RainForest Top-Down Schema
BuildTree(Node n, Training database D,
Split Selection Method S)
[ (1) Apply S to D to find splitting criterion ]
(1a) for each predictor attribute X
(1b) Call S.findSplit(AVC-set of X)
(1c) endfor
(1d) S.chooseBest();
(2) if (n is not a leaf node) ...
S: C4.5, CART, CHAID, FACT, ID3, GID3, QUEST, etc.
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
RainForest Data Access Method
Assume datapartition at a node is D. Then the
following steps are carried out:
1. Construct AVC-group of the node
2. Choose splitting attribute and splitting predicate
3. Partition D across the children
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
RainForest Algorithms: RF-Write
First scan:
Database
AVC-Sets
Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-24
RainForest Algorithms: RF-Write
Second Scan:
Database
Age<30?
Main Memory Partition 1 Partition 2
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
RainForest Algorithms: RF-Write
Analysis:
O Assumes that the AVC-group of the root node fits into main
memory
O Two database scans per level of the tree
O Usually more main memory available than one single AVC-
group needs
Age<30? Database
? Database
?
Main Memory Main Memory Partition 1 Partition 2
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
RainForest Algorithms: RF-Read
First scan:
Database
AVC-Sets
Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-25
RainForest Algorithms: RF-Read
Second Scan:
Age<30
Database
AVC-Sets
Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
RainForest Algorithms: RF-Read
Third Scan:
Age<30
Sal<20k Car==S
Database
Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
RainForest Algorithms: RF-Hybrid
First scan:
Database
AVC-Sets
Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-26
RainForest Algorithms: RF-Hybrid
Second Scan:
Age<30
Database
AVC-Sets
Main Memory
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
RainForest Algorithms: RF-Hybrid
Third Scan:
Age<30 Database
Sal<20k Car==S
Main Memory Partition 1 Partition 2 Partition 3 Partition 4
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
RainForest Algorithms: RF-Hybrid
Further optimization: While writing partitions,
concurrently build AVC-groups of as many nodes
as possible in-memory
Age<30
Database
Sal<20k Car==S
Main Memory Partition 1 Partition 2 Partition 3 Partition 4
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-27
BOAT
(Gehrke, Ganti, Ramakrishnan, Loh; SIGMOD 1999)
Age
Training Database
<30 >=30
Left Partition Right Partition
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
BOAT: Algorithm Overview
In-memory Approximate All the
Sample tree, bounds data
Sampling Phase Cleanup Phase
Approximate tree Final tree
with bounds
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Tutorial Overview
O Part I: Classification Trees
O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing Values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-28
Missing Values
O What is the problem?
O During computation of the splitting predicate,
we can selectively ignore records with missing
values (note that this has some problems)
O But if a record r misses the value of the
variable in the splitting attribute, r can not
participate further in tree construction
Algorithms for missing values address this
problem.
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Mean and Mode Imputation
Assume record r has missing value r.X, and
splitting variable is X.
O Simplest algorithm:
O If X is numerical (categorical), impute the
overall mean (mode)
O Improved algorithm:
O If X is numerical (categorical), impute the
mean(X|t.C) (the mode(X|t.C))
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Surrogate Splits (CART)
Assume record r has missing value r.X, and
splitting predicate is PX.
O Idea: Find splitting predicate QX’ involving
another variable X’ != X that is most similar to
PX.
O Similarity sim(Q,P|D) between splits Q and P:
Sim(Q,P|D) = |{r in D: P(r) and Q(r)}|/|D|
O 0 <= sim(Q,P|D) <= 1
O Sim(P,P) = 1
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-29
Surrogate Splits: Example
X1 X2 Class
Consider splitting predicate
X1 <= 1. 1 1 Yes
Sim((X1 <= 1), 1 1 Yes
(X2 <= 1)|D) = 1 1 Yes
(3+4)/10 1 2 Yes
Sim((X1 <= 1), 1 2 Yes
(X2 <= 2)|D) = 1 2 No
(6+3)/10 2 2 No
(X2 <= 2) is the preferred 2 3 No
surrogate split. 2 3 No
2 3 No
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Tutorial Overview
O Part I: Classification Trees
O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing Values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Choice of Classification Algorithm?
O Example study: (Lim, Loh, and Shih, Machine
Learning 2000)
O 33 classification algorithms
O 16 (small) data sets (UC Irvine ML Repository)
O Each algorithm applied to each data set
O Experimental measurements:
O Classification accuracy
O Computational speed
O Classifier complexity
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-30
Experimental Setup
Algorithms:
O Tree-structure classifiers (IND, S-Plus Trees, C4.5, FACT,
QUEST, CART, OC1, LMDT, CAL5, T1)
O Statistical methods (LDA, QDA, NN, LOG, FDA, PDA, MDA, POL)
O Neural networks (LVQ, RBF)
Setup:
O 16 primary data sets, created 16 more data sets by adding noise
O Converted categorical predictor variables to 0-1 dummy
variables if necessary
O Error rates for 6 data sets estimated from supplied test sets, 10-
fold cross-validation used for the other data sets
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Results
Rank Algorithm Mean Error Time
1 Polyclass 0.195 3 hours
2 Quest Multivariate 0.202 4 min
3 Logistic Regression 0.204 4 min
6 LDA 0.208 10 s
8 IND CART 0.215 47 s
12 C4.5 Rules 0.220 20 s
16 Quest Univariate 0.221 40 s
…
O Number of leaves for tree-based classifiers varied widely (median
number of leaves between 5 and 32 (removing some outliers))
O Mean misclassification rates for top 26 algorithms are not
statistically significantly different, bottom 7 algorithms have
significantly lower error rates
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Tutorial Overview
O Part I: Classification Trees
O Introduction
O Classification tree construction schema
O Split selection
O Pruning
O Data access
O Missing Values
O Evaluation
O Bias in split selection
(Short Break)
O Part II: Regression Trees
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-31
Bias in Split Selection for ES
Age Yes No
Assume: No correlation 20 15 15
with the class label. 25 15 15
30 15 15
O Question: Should we 40 15 15
choose Age or Car?
O Answer: We should Car Yes No
choose both of them Sport 20 20
equally likely! Truck 20 20
Minivan 20 20
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Formal Definition of the Bias
O Bias: “Odds of choosing X1 and X2 as split
variable when neither X1 nor X2 is correlated
with the class label”
O Formally:
Bias(X1,X2) = Log10(P(X1,X2)/(1-P(X1,X2)),
P(X1,X2): probability of choosing variable X1 over X2
We would like: Bias(X1,X2) = 0 in the Null Case
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Formal Definition of the Bias (Contd.)
Car Yes No
O Example: Synthetic Car1
data with two
Car2
categorical predictor
variables Car3
O X1: 10 categories …
O X2: 2 categories Car10
O For each category:
Same probability of State Yes No
choosing “Yes” (no CA
correlation) NY
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-32
Evidence of the Bias
Gini Entropy
Gain Ratio
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
One Explanation
Theorem: (Expected Value of the Gini Gain)
Assume:
O Two classlabels
O n: number of categories
O N: number of records
O p1: probability of having classlabel “Yes”
Then: E(ginigain) = 2p(1-p)*(n-1)/N
Expected ginigain increases linearly with number
KDD of categories!
2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Bias Correction: Intuition
O Value of the splitting criteria is biased under
the Null Hypothesis.
O Idea: Use p-value of the criterion:
Probability that the value of the criterion under
the Null Case is as extreme as the observed
value
Method:
1. Compute criterion (gini, entropy, etc.)
2. Compute p-value
3. Choose splitting variable
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-33
Correction Through P-Value
O New p-value criterion:
O Maintains “good” properties of your favorite
splitting criterion
O Theorem: The correction through the p-value is
nearly unbiased.
Computation:
1. Exact (randomization statistic; very expensive to compute)
2. Bootstrapping (Monte Carlo simulations; computationally
expensive; works only for small p-values)
3. Asymptotic approximations (G2 for entropy, Chi2 distribution for
Chi2 test; don’t work well in boundary conditions)
4. Tight approximations (cheap, often work well in practice)
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Tight Approximation
O Experimental evidence
shows that Gamma
distribution approximates
gini-gain very well.
O We can calculate:
O Expected gain:
E(gain) = 2p(1-p)*(n-1)/N
O Variance of gain:
Var(gain) = 4p(1-p)/N2[(1-
6p-6p2) * (sum 1/Ni – (2n-
1)/N) + 2(n-1)p(1-p)]
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
Problem: ES and Missing Value
Consider a training database with the following
schema: (X1, …, Xk, C)
O Assume the projection onto (X1, C) is the
following:
{(1, Class1), (2, Class2), (NULL, Class13), …,
(NULL, Class1N)}
(X1 has missing values except for the first two
records)
O Exhaustive search will very likely split on X1!
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-34
Concluding Remarks Part I
O There are many algorithms available for:
O Split selection
O Pruning
O Data access
O Handling missing values
O Challenges: Performance, getting the “right” model,
data streams, new applications
KDD 2001 Tutorial: Advances in Decision Trees Gehrke and Loh
T3-35