Chap4 Classification Lecture 5
Chap4 Classification Lecture 5
— Chapter 4—
1
Outlines
4
Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ Note: If the test set is used to select models, it is called validation (test) set
5
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
7
Chapter 4. Classification: Basic Concepts
no yes yes
9
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-
conquer manner
◼ At start, all the training examples are at the root
discretized in advance)
◼ Examples are partitioned recursively based on selected
attributes
◼ Test attributes are selected on the basis of a heuristic or
m=2
11
Brief Review of Entropy
age income student credit_rating buys_computer
◼ Example: <=30 high no fair no
<=30 high no excellent no
◼ Training data set: Buys_computer 31…40 high no fair yes
>40 medium no fair yes
◼ Find entropy of feature age, H(age) >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
9 9 5 5
31…40 high yes fair yes
fo( D) = I (9,5) = − log 2 ( )− log 2 ( ) =0.940 >40 medium no excellent no
14 14 14 14
12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
◼ Ci, D is the set of tuples of class Ci in D
◼ |Ci,D| denote the number of tuples in Ci,D,
◼ Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = − pi log 2 ( pi )
i =1
◼ Information needed (after using A to split D into v partitions) to
v | D |
classify D:
InfoA ( D) = Info( D j )
j
j =1 | D |
▪ Because age has the highest information gain among the attributes, it is selected
as the splitting attribute.
• The tuples are then partitioned accordingly, as shown in Figure shown in next
slide.
•
• Notice that the tuples falling into the partition for age = middle aged all belong
to the same class.
• Because they all belong to class “yes,” a leaf should therefore be created at the
end of this branch and labeled “yes.”
• The final decision tree returned by the algorithm was shown earlier 16
Attribute Selection: Information Gain
Figure: The attribute age becomes the splitting attribute at the root node of the
decision tree
17
Attribute Selection: Information Gain
◼ Figure: Each leaf node represents a class (either buys computer = yes or buys
computer = no).
◼ Notice because the tuples falling into the partition for age = middle aged all
belong to the same class, “yes,”
◼ A leaf was created at the end of this branch and labeled “yes.”
18
Computing Information-Gain for
Continuous-Valued Attributes
◼ Let attribute A be a continuous-valued attribute
◼ For example, suppose that instead of the discretized version of age from the
example, we have the raw values for this attribute.
◼ Must determine the best split point for A
◼ Sort the value A in increasing order
◼ Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
◼ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
◼ For each possible split-point for A, we evaluate
◼ where the number of partitions is two, that is, v = 2 (or j = 1,2)
◼ The point with the minimum expected information requirement for A is
selected as the split-point for A
◼ Split:
◼ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D
satisfying A > split-point 19
Gain Ratio for Attribute Selection (C4.5)
◼ Information gain measure is biased towards attributes with a large
number of values
◼ C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain) using a “split information” value
defined analogously with Info(D) as
v | Dj | | Dj |
SplitInfoA ( D) = − log 2 ( )
j =1 |D| |D|
◼ Ex.
◼ The attribute with the maximum gain ratio is selected as the splitting
attribute
21
Gini Index (CART, IBM IntelligentMiner)
◼ If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D) = 1− p j
j =1
where pj is the relative frequency of class j in D
◼ If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A ( D) = gini( D1) + 2 gini( D 2)
|D| |D|
◼ Reduction in Impurity:
gini( A) = gini(D) − giniA (D)
◼ The attribute that provides the smallest ginisplit(D) (or the
largest reduction in impurity) is chosen to split the node (need
to enumerate all the possible splitting points for each attribute)
22
Computation of Gini Index
◼ Ex. D has 9 tuples in buys_computer = “yes” age income studentcredit_rating
buys_compu
and 5 in “no” <=30 high no fair no
2 2
9 5 <=30 high no excellent no
gini( D ) = 1 − − = 0.459
14 14 31…40 high no fair yes
◼ Suppose the attribute income partitions D into
10 in D1: {low, medium} and 4 in D2 >40 medium no fair yes
>40 low yes fair yes
10 4
giniincome{low,medium} ( D) = Gini( D1 ) + Gini( D2 ) >40 low yes excellent no
14 14
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
Gini{low,high} is 0.458; Gini{medium,high} <=30 31…40
medium yes excellent yes
medium no excellent yes
is 0.450. Thus, split on the 31…40 high yes fair yes
{low,medium} (and {high}) since >40 medium no excellent no
it has the lowest Gini index
23
Comparing Attribute Selection Measures
26
Scalability Framework for RainForest
27
Rainforest: Training Set and Its AVC Sets
medium income
31
Prediction Based on Bayes’ Theorem
◼ Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
32
Classification Is to Derive the Maximum Posteriori
◼ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
◼ This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
◼ Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
33
Naïve Bayes Classifier
◼ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k =1
◼ This greatly reduces the computation cost: Only counts the
class distribution
◼ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
◼ If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
g ( x, , ) = e 2 2
and P(xk|Ci) is 2
P(X | C i) = g ( xk , Ci , Ci )
34
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
35
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
◼ Disadvantages
◼ Assumption: class conditional independence, therefore loss
of accuracy
◼ Practically, dependencies exist among variables
Bayes Classifier
◼ How to deal with these dependencies? Bayesian Belief Networks
37
Chapter 4. Classification: Basic Concepts
◼ One rule is created for each path from the student? credit rating?
yes
root to a leaf
no yes excellent fair
◼ Each attribute-value pair along a path forms a yes
no yes
conjunction: the leaf holds the class
prediction
◼ Rules are mutually exclusive and exhaustive
◼ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
40
Chapter 4. Classification: Basic Concepts
44
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
◼ Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
45
Classifier Evaluation Metrics: Example
◼ Precision = Recall =
46
Classifier Evaluation Metrics: Example
47
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
◼ Holdout method
◼ Given data is randomly partitioned into two independent sets
49
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
◼ Suppose we have 2 classifiers, M1 and M2, which one is better?
◼ These mean error rates are just estimates of error on the true
population of future data cases
50
Estimating Confidence Intervals:
Null Hypothesis
◼ Perform 10-fold cross-validation
◼ Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
◼ Use t-test (or Student’s t-test)
◼ Null Hypothesis: M1 & M2 are the same
◼ If we can reject null hypothesis, then
◼ we conclude that the difference between M1 & M2 is
statistically significant
◼ Chose model with lower error rate
51
Estimating Confidence Intervals: t-test
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
52
Estimating Confidence Intervals:
Table for t-distribution
◼ Symmetric
◼ Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
◼ Confidence limit, z
= sig/2
53
Estimating Confidence Intervals:
Statistical Significance
◼ Are M1 & M2 significantly different?
◼ Compute t. Select significance level (e.g. sig = 5%)
are same
◼ Conclude: statistically significant difference between M1
& M2
◼ Otherwise, conclude that any difference is chance
54
Model Selection: ROC Curves
◼ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
◼ Originated from signal detection theory
◼ Shows the trade-off between the true
positive rate and the false positive rate
◼ The area under the ROC curve is a ◼ Vertical axis
measure of the accuracy of the model represents the true
positive rate
◼ Rank the test tuples in decreasing ◼ Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at ◼ The plot also shows a
the top of the list diagonal line
◼ The closer to the diagonal line (i.e., the ◼ A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
55
Issues Affecting Model Selection
◼ Accuracy
◼ classifier accuracy: predicting class label
◼ Speed
◼ time to construct the model (training time)
◼ time to use the model (classification/prediction time)
◼ Robustness: handling noise and missing values
◼ Scalability: efficiency in disk-resident databases
◼ Interpretability
◼ understanding and insight provided by the model
◼ Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
56
Chapter 4. Classification: Basic Concepts
◼ Ensemble methods
◼ Use a combination of models to increase accuracy
classifiers
◼ Boosting: weighted vote with a collection of classifiers
58
Bagging: Boostrap Aggregation
◼ Analogy: Diagnosis based on multiple doctors’ majority vote
◼ Training
◼ Given a set D of d tuples, at each iteration i, a training set Di of d tuples
◼ The bagged classifier M* counts the votes and assigns the class with the
most votes to X
◼ Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
◼ Accuracy
◼ Often significantly better than a single classifier derived from D
returned
◼ Two Methods to construct Random Forest:
◼ Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
◼ Forest-RC (random linear combinations): Creates new attributes (or
65
Summary (II)
◼ Significance tests and ROC curves are useful for model selection.
◼ There have been numerous comparisons of the different
classification methods; the matter remains a research topic
◼ No single method has been found to be superior over all others
for all data sets
◼ Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
66
References (1)
◼ C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
◼ C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
◼ L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
◼ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
◼ P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
◼ H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
◼ H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
◼ W. Cohen. Fast effective rule induction. ICML'95
◼ G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
67
References (2)
◼ A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
◼ G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
◼ R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
◼ U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
◼ Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
◼ J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
◼ J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
◼ T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
◼ D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
◼ W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
68
References (3)
◼ T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
◼ J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of
Marketing Research, Blackwell Business, 1994.
◼ M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
mining. EDBT'96.
◼ T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
◼ S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
◼ J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
◼ J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
◼ J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
◼ J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
69
References (4)
◼ R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
◼ J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
◼ J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
◼ P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
◼ S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
◼ S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
◼ I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
◼ X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
◼ H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
70
Issues: Evaluating Classification Methods
◼ Accuracy
◼ classifier accuracy: predicting class label
◼ Speed
◼ time to construct the model (training time)
71
Predictor Error Measures
◼ Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
◼ Loss function: measures the error betw. yi and the predicted value yi’
◼ Absolute error: | yi – yi’|
◼ Squared error: (yi – yi’)2
◼ Test error (generalization error):
d
the average loss over the test set
d
d d
d
| y −Relative
y '|
i squared error:
i
( yi − yi ' ) 2
i =1
i =1
d d
| y
i =1
i −y|
(y
i =1
i − y)2
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative
squared error
72
Scalable Decision Tree Induction Methods
tree earlier
◼ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
◼ Builds an AVC-list (attribute, value, class label)
73
Data Cube-Based Decision-Tree Induction
◼ Integration of generalization with decision-tree induction
(Kamber et al.’97)
◼ Classification at primitive concept levels
◼ E.g., precise temperature, humidity, outlook, etc.
◼ Low-level concepts, scattered classes, bushy classification-
trees
◼ Semantic interpretation problems
◼ Cube-based multi-level classification
◼ Relevance analysis at multi-levels
◼ Information-gain analysis with dimension + level
74