Open navigation menu

Scribd

0% found this document useful (0 votes)

22 views109 pages

Lecture 8

Uploaded by

srinutirumanisetti

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views109 pages

Lecture 8

Uploaded by

srinutirumanisetti

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 109

Decision Tree Induction

Non-metric Methods
• Numerical Attributes
– Nearest-neighbor -- distance
– Neural networks: two similar inputs leads to
similar outputs
– SVMs: Dot Product

12/8/2021 Data Mining: Concepts and Techniques 2

Non-metric data
• Nominal attributes
• Color, taste
• Strings: DNA
•

12/8/2021 Data Mining: Concepts and Techniques 3

• Probability based
• Rule based
– Decision trees

12/8/2021 Data Mining: Concepts and Techniques 4

Decision Tree
• Rules in the form of a hierarchy.

12/8/2021 Data Mining: Concepts and Techniques 5

• Why are decision trees so popular?

12/8/2021 Data Mining: Concepts and Techniques 6

Definition of Decision Tree

Definition 9.1: Decision Tree

12/8/2021 Data Mining: Concepts and Techniques 7

We need to work with a training set

Decision tree induction

Training
Data

Classifier
(Decision Tree)

12/8/2021 Data Mining: Concepts and Techniques 8

You need to work with a training set

12/8/2021 Data Mining: Concepts and Techniques 9

Output: A Decision Tree for “buys_computer”

age?

<=30 >40
overcast
30..40

student? yes credit rating?

no yes fair
excellent

no yes no yes

12/8/2021 Data Mining: Concepts and Techniques 10

• Criteria for choosing an attribute?
• You can achieve 100% accuracy with training
set?!
– Overfitting
• When you stop building the tree?

• Are there various types of DT induction

methods?? ID3, C4.5 and CART.

12/8/2021 Data Mining: Concepts and Techniques 11

Decision tree induction
• They adopt a greedy (i.e., no backtracking),
top-down recursive divide-and-conquer
approach.

12/8/2021 Data Mining: Concepts and Techniques 12

• Node 🡪 subset of training patterns
• Root 🡪 training set.
• Leaf 🡪 class label.

12/8/2021 Data Mining: Concepts and Techniques 13

Impurity measures
• Entropy impurity (information impurity)

• Gini impurity (variance impurity)

• Misclassification impurity

12/8/2021 Data Mining: Concepts and Techniques 14

For a two category case

There is a mistake in this slide. Entropy for a two class problem has its max. vaule =
1. For others, (for 2 class problems), its max value = 0.5

12/8/2021 Data Mining: Concepts and Techniques 15

Which test?
• That which drops the impurity greater.
– Try to become pure quickly.

12/8/2021 Data Mining: Concepts and Techniques 16

Which test?
• That which drops the impurity greater.
– Try to become pure quickly.

12/8/2021 Data Mining: Concepts and Techniques 17

Which test?
• That which drops the impurity greater.
– Try to become pure quickly.

12/8/2021 Data Mining: Concepts and Techniques 18

Which test?
• That which drops the impurity greater.
– Try to become pure quickly.

12/8/2021 Data Mining: Concepts and Techniques 19

Which test?
• That which drops the impurity greater.
– Try to become pure quickly.

12/8/2021 Data Mining: Concepts and Techniques 20

Information gain
•

12/8/2021 Data Mining: Concepts and Techniques 21

Gain(age) ??
(yes, no) = (9, 5)

12/8/2021 Data Mining: Concepts and Techniques 22

Gain(age) ??
(yes, no) = (9, 5)

12/8/2021 Data Mining: Concepts and Techniques 23

For other attributes, their GAIN

• So we choose age as the splitting attribute.

12/8/2021 Data Mining: Concepts and Techniques 24

• Similarly one can use other impurity measures

12/8/2021 Data Mining: Concepts and Techniques 25

Gini Index (IBM
IntelligentMiner)
• If a data set T contains examples from n classes, gini index, gini(T)
is defined as

where pj is the relative frequency of class j in T.

• If a data set T is split into two subsets T1 and T2 with sizes N1 and
N2 respectively, the gini index of the split data contains examples
from n classes, the gini index gini(T) is defined as

• The attribute provides the smallest ginisplit(T) is chosen to split the

node (need to enumerate all possible splitting points for each
attribute).

12/8/2021 Data Mining: Concepts and Techniques 26

• But, there is one drawback with this approach!

12/8/2021 Data Mining: Concepts and Techniques 27

• A split with large branching factor is often
chosen.
– So, telephone number is chosen.

12/8/2021 Data Mining: Concepts and Techniques 28

So, we penalize large branching factors
• This is called gain ratio (very often used with
information gain).

• Branching factor is more, the denominator is

more.

12/8/2021 Data Mining: Concepts and Techniques 29

Notation
•

12/8/2021 Data Mining: Concepts and Techniques 30

Building Decision Tree
• In principle, there are exponentially many decision tree that can be
constructed from a given database (also called training data).
– Some of the tree may not be optimum

– Some of them may give inaccurate result

• How a Decision Tree built?

– Greedy strategy
• A top-down recursive divide-and-conquer

– Modification of greedy strategy

• ID3
• C4.5
• CART, etc.

12/8/2021 Data Mining: Concepts and Techniques 31

Built Decision Tree Algorithm
• Algorithm BuiltDT
• Input: D : Training data set
• Output: T : Decision tree
Steps
1. If all tuples in D belongs to the same class Cj
Add a leaf node labeled as Cj
Return // Termination condition
2. Select an attribute Ai (so that it is not already selected in the same branch)
3. Partition D = { D1, D2, …, Dp} based on p different values of Ai in D
4. For each Dk ϵ D
Create a node and add an edge between D and Dk with label as the Ai’s attribute value in Dk

5. For each Dk ϵ D
BuildTD(Dk) // Recursive call
6. Stop

12/8/2021 Data Mining: Concepts and Techniques 32

Node Splitting in BuildDT Algorithm
• BuildDT algorithm must provides a method for expressing an attribute test
condition and corresponding outcome for different attribute type

• Case: Binary attribute

– This is the simplest case of node splitting

– The test condition for a binary attribute generates only two outcomes

12/8/2021 Data Mining: Concepts and Techniques 33

Node Splitting in BuildDT Algorithm
• Case: Nominal attribute
– Since a nominal attribute can have many values, its test condition can be
expressed in two ways:
• A multi-way split
• A binary split

– Muti-way split: Outcome depends on the number of distinct values for the
corresponding attribute

– Binary splitting by grouping attribute values

12/8/2021 Data Mining: Concepts and Techniques 34

Node Splitting in BuildDT Algorithm
• Case: Ordinal attribute
– It also can be expressed in two ways:
• A multi-way split
• A binary split

– Muti-way split: It is same as in the case of nominal attribute

– Binary splitting attribute values should be grouped maintaining the order

property of the attribute values

12/8/2021 Data Mining: Concepts and Techniques 35

Node Splitting in BuildDT Algorithm
• Case: Numerical attribute
– For numeric attribute (with discrete or continuous values), a test condition can
be expressed as a comparison set
• Binary outcome: A >v or A ≤ v

– In this case, decision tree induction must consider all possible split positions

• Range query : vi ≤ A < vi+1 for i = 1, 2, …, q (if q number of ranges are chosen)

– Here, q should be decided a priori

– For a numeric attribute, decision tree induction is a combinatorial optimization

problem

12/8/2021 Data Mining: Concepts and Techniques 36

Illustration : BuildDT Algorithm
Example 9.4: Illustration of BuildDT Algorithm
– Consider a training data set as shown.

Attributes:
Gender = {Male(M), Female (F)} // Binary attribute
Height = {1.5, …, 2.5} // Continuous attribute

Class = {Short (S), Medium (M), Tall (T)}

Given a person, we are to test in which class s/he belongs

12/8/2021 Data Mining: Concepts and Techniques 37

Illustration : BuildDT Algorithm
• To built a decision tree, we can select an attribute in two different orderings:
<Gender, Height> or <Height, Gender>
• Further, for each ordering, we can choose different ways of splitting
• Different instances are shown in the following.
• Approach 1 : <Gender, Height>

12/8/2021 Data Mining: Concepts and Techniques 38

Illustration : BuildDT Algorithm

12/8/2021 Data Mining: Concepts and Techniques 39

Illustration : BuildDT Algorithm
• Approach 2 : <Height, Gender>

12/8/2021 Data Mining: Concepts and Techniques 40

Illustration : BuildDT Algorithm
Example 9.5: Illustration of BuildDT Algorithm
– Consider an anonymous database as shown.

• Is there any “clue” that enables to

select the “best” attribute first?
• Suppose, following are two
attempts:
• A1🡪A2🡪A3🡪A4 [naïve]
• A3🡪A2🡪A4🡪A1 [Random]
• Draw the decision trees in the
above-mentioned two cases.

• Are the trees different to classify any test

data?
• If any other sample data is added into the
database, is that likely to alter the
decision tree already obtained?

12/8/2021 Data Mining: Concepts and Techniques 41

Algorithm ID3

12/8/2021 Data Mining: Concepts and Techniques 42

ID3: Decision Tree Induction Algorithms

• Quinlan [1986] introduced the ID3, a popular short form of Iterative

Dichotomizer 3 for decision trees from a set of training data.
• In ID3, each node corresponds to a splitting attribute and each arc is a
possible value of that attribute.
• At each node, the splitting attribute is selected to be the most informative
among the attributes not yet considered in the path starting from the root.

12/8/2021 Data Mining: Concepts and Techniques 43

Algorithm ID3
• In ID3, entropy is used to measure how informative a node is.
– It is observed that splitting on any attribute has the property that average
entropy of the resulting training subsets will be less than or equal to that of the
previous (parent node’s) training subset.

• ID3 algorithm defines a measurement of a splitting called Information

Gain to determine the goodness of a split.
– The attribute with the largest value of information gain is chosen as the
splitting attribute and

– it partitions into a number of smaller training sets based on the distinct values
of attribute under split.

12/8/2021 Data Mining: Concepts and Techniques 44

12/8/2021 Data Mining: Concepts and Techniques 45
Entropy of a Training Set
Example 9.10: OPTH dataset
Consider the OTPH data shown in the following table with total 24 instances in it.
Age Eye sight Astigmatic Use Type Class
1 1 1 1 3
1 1 1 2 2
1 1 2 1 3
1 1 2 2 1
1 2 1 1 3
1 2 1 2 2
1 2 2 1 3
1 2 2 2 1
2 1 1 1 3
2 1 1 2 2
2 1 2 1 3
2 1 2 2 1
A coded
2 2 1 1 3
forms for all
2 2 1 2 2 values of
2 2 2 1 3 attributes are
2 2 2 2 3
3 1 1 1 3 used to avoid
3 1 1 2 3 the cluttering
3 1 2 1 3 in the table.
3 1 2 2 1
3 2 1 1 3
3 2 1 2 2
3 2 2 1 3
3 2 2 2 3 46
Information Gain Calculation
•

Age Eye-sight Astigmatism Use type Class

1 1 1 1 3
1 1 1 2 2
1 1 2 1 3
1 1 2 2 1
1 2 1 1 3
1 2 1 2 2
1 2 2 1 3
1 2 2 2 1

Data Mining: Concepts and Techniques 47

12/8/2021
Calculating Information Gain

Age Eye-sight Astigmatism Use type Class

2 1 1 1 3
2 1 1 2 2
2 1 2 1 3
2 1 2 2 1
2 2 1 1 3
2 2 1 2 2
2 2 2 1 3
2 2 2 2 3

12/8/2021 48
Calculating Information Gain

Age Eye-sight Astigmatism Use type Class

3 1 1 1 3
3 1 1 2 3
3 1 2 1 3
3 1 2 2 1
3 2 1 1 3
3 2 1 2 2
3 2 2 1 3
3 2 2 2 3

12/8/2021 Data Mining: Concepts and Techniques 49

Information Gains for Different Attributes

12/8/2021 Data Mining: Concepts and Techniques 50

Decision Tree Induction : ID3 Way

12/8/2021 Data Mining: Concepts and Techniques 51

Decision Tree Induction : ID3 Way

12/8/2021 Data Mining: Concepts and Techniques 52

Decision Tree Induction : ID3 Way

✔
Age Eye-sight Use Type Astigmatic

Age Eye Ast Use Class Age Eye Ast Use Class
1 1 1 2 2
1 1 1 1 3
1 1 2 2 1
1 1 2 1 3
1 2 1 2 2
1 2 1 1 3
1 2 2 1 3 1 2 2 2 1

2 1 1 1 3 2 1 1 2 2

2 2 1 1 3 2 1 2 2 1

2 2 2 1 3 2 2 1 2 2

3 1 1 1 3 3 1 1 2 3

3 1 2 1 3 3 1 2 2 3

3 2 1 1 3 3 2 1 2 2

3 2 2 1 3 3 2 2 2 3

Age Eye-si
Astigmatic Age Eye-si
ght Astigmatic
ght
12/8/2021 Data Mining: Concepts and Techniques 53
Splitting of Continuous Attribute Values

12/8/2021 Data Mining: Concepts and Techniques 54

Splitting of Continuous attribute values

12/8/2021 Data Mining: Concepts and Techniques 55

12/8/2021 Data Mining: Concepts and Techniques 56
Algorithm CART

12/8/2021 Data Mining: Concepts and Techniques 57

CART Algorithm
•

12/8/2021 Data Mining: Concepts and Techniques 58

Gini Index of Diversity

Definition 9.6: Gini Index

12/8/2021 Data Mining: Concepts and Techniques 59

Gini Index of Diversity
•

12/8/2021 Data Mining: Concepts and Techniques 60

Gini Index of Diversity
Definition 9.7: Gini Index of Diversity

12/8/2021 Data Mining: Concepts and Techniques 61

Gini Index of Diversity and CART

12/8/2021 Data Mining: Concepts and Techniques 62

n-ary Attribute Values to Binary Splitting
•

12/8/2021 Data Mining: Concepts and Techniques 63

n-ary Attribute Values to Binary Splitting
•
D

Yes No

12/8/2021 Data Mining: Concepts and Techniques 64

n-ary Attribute Values to Binary
Splitting
Case2: Continuous valued attributes
• For a continuous-valued attribute, each possible split point must be taken
into account.
• The strategy is similar to that followed in ID3 to calculate information gain
for the continuous –valued attributes.
• According to that strategy, the mid-point between ai and ai+1 , let it be vi,
then

Yes No

12/8/2021 Data Mining: Concepts and Techniques 65

n-ary Attribute Values to Binary
Splitting
•

12/8/2021 Data Mining: Concepts and Techniques 66

CART Algorithm : Illustration
Example 9.15 : CART Algorithm
Suppose we want to build decision tree for the data set EMP as given in the
table below.
Age Tuple# Age Salary Job Performance Select
Y : young 1 Y H P A N
M : middle-aged
2 Y H P E N
O : old
3 M H P A Y
Salary
4 O M P A Y
L : low
M : medium 5 O L G A Y
H : high 6 O L G E N
Job 7 M L G E Y
G : government 8 Y M P A N
P : private
9 Y L G A Y
Performance 10 O M G A Y
A : Average
11 Y M G E Y
E : Excellent
12 M M P E Y
Class : Select
13 M H G A Y
Y : yes
N : no 14 O M P E N
12/8/2021 Data Mining: Concepts and Techniques 67
CART Algorithm : Illustration
•

12/8/2021 Data Mining: Concepts and Techniques 68

CART Algorithm : Illustration
•

Yes No

{O} {Y,M}

12/8/2021 Data Mining: Concepts and Techniques 69

CART Algorithm : Illustration
•

Yes No

{H} {L,M}

12/8/2021 Data Mining: Concepts and Techniques 70

CART Algorithm : Illustration
•

12/8/2021 Data Mining: Concepts and Techniques 71

CART Algorithm : Illustration
•

12/8/2021 Data Mining: Concepts and Techniques 72

CS 40003: Data Analytics 73
Class

CS 40003: Data Analytics 74

Calculating γ using Frequency Table

12/8/2021 Data Mining: Concepts and Techniques 75

Calculating γ using Frequency Table

12/8/2021 Data Mining: Concepts and Techniques 76

Illustration: Calculating γ using Frequency
Table
•

1 2 3
Class 1 2 1 1
Class 2 2 2 1
Class 3 4 5 6
Column sum 8 8 8

12/8/2021 Data Mining: Concepts and Techniques 77

Illustration: Calculating γ using Frequency
Table
•

12/8/2021 Data Mining: Concepts and Techniques 78

Illustration: Calculating γ using Frequency
Table
•

12/8/2021 Data Mining: Concepts and Techniques 79

Decision Trees with ID3 and CART
Algorithms
Example 9.17 : Comparing Decision Trees of EMP Data set
Compare two decision trees obtained using ID3 and CART for the EMP
dataset. The decision tree according to ID3 is given for your ready reference
(subject to the verification)
Y Age O

Job Y Performance
P G A E
N Y Y N

Decision Tree using ID3

?
Decision Tree using CART

12/8/2021 Data Mining: Concepts and Techniques 80

Algorithm C4.5

12/8/2021 Data Mining: Concepts and Techniques 81

Algorithm C 4.5 : Introduction

12/8/2021 Data Mining: Concepts and Techniques 82

Algorithm C4.5 : Introduction
•

12/8/2021 Data Mining: Concepts and Techniques 83

Algorithm: C 4.5 : Introduction
• Although, the previous situation is an extreme case, intuitively, we can
infer that ID3 favours splitting attributes having a large number of values
– compared to other attributes, which have a less variations in their values.

• Such a partition appears to be useless for classification.

• This type of problem is called overfitting problem.

Note:
Decision Tree Induction Algorithm ID3 may suffer from overfitting problem.

12/8/2021 Data Mining: Concepts and Techniques 84

Algorithm: C 4.5 : Introduction

12/8/2021 Data Mining: Concepts and Techniques 85

Algorithm: C 4.5 : Gain Ratio

Definition 9.8: Gain Ratio

12/8/2021 Data Mining: Concepts and Techniques 86

•

12/8/2021 Data Mining: Concepts and Techniques 87

•

Frequency 32 0 0 0

Frequency 16 16 0 0

12/8/2021 Data Mining: Concepts and Techniques 88

– Distribution 3

Frequency 16 8 8 0

– Distribution 4

Frequency 16 8 4 4

– Distribution 5: Uniform distribution of attribute values

Frequency 8 8 8 8

12/8/2021 Data Mining: Concepts and Techniques 89

•

12/8/2021 Data Mining: Concepts and Techniques 90

• Information gain signifies how much information will be gained on
partitioning the values of attribute A
– Higher information gain means splitting of A is more desirable.
•

• On the other hand, split information forms the denominator in the gain ratio
formula.
– This implies that higher the value of split information is, lower the gain ratio.
– In turns, it decreases the information gain.
• Further, information gain is large when there are many distinct attribute
values.
– When many distinct values, split information is also a large value.
– This way split information reduces the value of gain ratio, thus resulting a
balanced value for information gain.
• Like information gain (in ID3), the attribute with the maximum gain ratio is
selected as the splitting attribute in C4.5.

12/8/2021 Data Mining: Concepts and Techniques 91

•

12/8/2021 Data Mining: Concepts and Techniques 92

Summary of Decision Tree Induction
Algorithms
• We have learned the building of a decision tree given a training data.
– The decision tree is then used to classify a test data.

• For a given training data D, the important task is to build the decision tree
so that:
– All test data can be classified accurately

– The tree is balanced and with as minimum depth as possible, thus the
classification can be done at a faster rate.

• In order to build a decision tree, several algorithms have been proposed.

These algorithms differ from the chosen splitting criteria, so that they
satisfy the above mentioned objectives as well as the decision tree can be
induced with minimum time complexity. We have studied three decision
tree induction algorithms namely ID3, CART and C4.5. A summary of
these three algorithms is presented in the following table.

12/8/2021 Data Mining: Concepts and Techniques 93

Table 11.6
Algorithm Splitting Criteria Remark
ID3

12/8/2021 Data Mining: Concepts and Techniques 94

Algorithm Splitting Criteria Remark
CART

12/8/2021 Data Mining: Concepts and Techniques 95

Algorithm Splitting Criteria Remark
C4.5

In addition to this, we also highlight few important characteristics

of decision tree induction algorithms in the following.

12/8/2021 Data Mining: Concepts and Techniques 96

Notes on Decision Tree Induction
algorithms
1. Optimal Decision Tree: Finding an optimal decision tree is an NP-complete
problem. Hence, decision tree induction algorithms employ a heuristic based
approach to search for the best in a large search space. Majority of the algorithms
follow a greedy, top-down recursive divide-and-conquer strategy to build
decision trees.

2. Missing data and noise: Decision tree induction algorithms are quite robust to
the data set with missing values and presence of noise. However, proper data
pre-processing can be followed to nullify these discrepancies.

3. Redundant Attributes: The presence of redundant attributes does not adversely

affect the accuracy of decision trees. It is observed that if an attribute is chosen
for splitting, then another attribute which is redundant is unlikely to chosen for
splitting.

4. Computational complexity: Decision tree induction algorithms are

computationally inexpensive, in particular, when the sizes of training sets are
large, Moreover, once a decision tree is known, classifying a test record is
extremely fast, with a worst-case time complexity of O(d), where d is the
maximum depth of the tree.
12/8/2021 Data Mining: Concepts and Techniques 97
Notes on Decision Tree Induction algorithms
5. Data Fragmentation Problem: Since the decision tree induction
algorithms employ a top-down, recursive partitioning approach, the number
of tuples becomes smaller as we traverse down the tree. At a time, the
number of tuples may be too small to make a decision about the class
representation, such a problem is known as the data fragmentation. To deal
with this problem, further splitting can be stopped when the number of
records falls below a certain threshold.

6. Tree Pruning: A sub-tree can replicate two or more times in a decision tree
(see figure below). This makes a decision tree unambiguous to classify a
test record. To avoid such a sub-tree replication problem, all sub-trees
except one can be pruned from the tree.
A

C B
1 0
D C
0 1 D 1

12/8/2021 Data Mining: Concepts and Techniques 0 1 98

Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”

12/8/2021 Data Mining: Concepts and Techniques 99

Notes on Decision Tree Induction algorithms

12/8/2021 Data Mining: Concepts and Techniques 100

Reference

⚫ The detail material related to this lecture can be found in

Data Mining: Concepts and Techniques, (3rd Edn.), Jiawei Han, Micheline Kamber, Morgan
Kaufmann, 2015.

Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,
Addison-Wesley, 2014

12/8/2021 Data Mining: Concepts and Techniques 101

APPENDIX

12/8/2021 Data Mining: Concepts and Techniques 102

Extracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules

• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

12/8/2021 Data Mining: Concepts and Techniques 103

Classification in Large Databases

• Classification—a classical problem extensively studied by

statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why decision tree induction in data mining?
– relatively faster learning speed (than other classification
methods)
– convertible to simple and easy to understand classification
rules
– can use SQL queries for accessing databases
– comparable classification accuracy with other methods
12/8/2021 Data Mining: Concepts and Techniques 104
Scalable Decision Tree Induction
Methods in Data Mining Studies
• SLIQ (EDBT’96 — Mehta et al.)
– builds an index for each attribute and only class list and the
current attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)
– constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the
tree earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
– separates the scalability aspects from the criteria that
determine the quality of the tree
– builds an AVC-list (attribute, value, class label)

12/8/2021 Data Mining: Concepts and Techniques 105

Drawbacks
• What we discussed are axis parallel
• For continuous valued attributes cut-points
can be found.
– Can be discretized (CART does).

12/8/2021 Data Mining: Concepts and Techniques 106

12/8/2021 Data Mining: Concepts and Techniques 107
12/8/2021 Data Mining: Concepts and Techniques 108
12/8/2021 Data Mining: Concepts and Techniques 109

You might also like

Lecture 3.1.2
No ratings yet
Lecture 3.1.2
27 pages
Data Mining: UNIT-3 Classification
No ratings yet
Data Mining: UNIT-3 Classification
54 pages
Chapter 7. Classification and Prediction
No ratings yet
Chapter 7. Classification and Prediction
68 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
72 pages
A Data Mining Query Language
No ratings yet
A Data Mining Query Language
69 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
38 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
No ratings yet
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
23 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
Classification Lecture 1
No ratings yet
Classification Lecture 1
51 pages
Classification Prediction
No ratings yet
Classification Prediction
71 pages
7 Class
No ratings yet
7 Class
72 pages
Classification
No ratings yet
Classification
36 pages
7 Class
No ratings yet
7 Class
72 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Chap 7
No ratings yet
Chap 7
71 pages
Classification & Prediction
No ratings yet
Classification & Prediction
78 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Lec05 Classification DecisionTree
No ratings yet
Lec05 Classification DecisionTree
67 pages
DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
By Eesha Tur Razia Babar: 2/1/2021 Introduction To Data Mining, 2 Edition 1
No ratings yet
By Eesha Tur Razia Babar: 2/1/2021 Introduction To Data Mining, 2 Edition 1
63 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
Classification and Prediction
No ratings yet
Classification and Prediction
134 pages
Unit 1 Classification & Prediction DM
No ratings yet
Unit 1 Classification & Prediction DM
71 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Classfication and Prediction
No ratings yet
Classfication and Prediction
133 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
Classification
No ratings yet
Classification
45 pages
Class Basic
No ratings yet
Class Basic
75 pages
Decision Trees Edited
No ratings yet
Decision Trees Edited
56 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
139 pages
Decision Tree Algorithms for Data Mining
No ratings yet
Decision Tree Algorithms for Data Mining
5 pages
Chapter 5. Classification and Prediction
No ratings yet
Chapter 5. Classification and Prediction
122 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
35 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
No ratings yet
Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
129 pages
DM Classification 1 3
No ratings yet
DM Classification 1 3
19 pages
DM Unit 4
No ratings yet
DM Unit 4
24 pages
05 Chap3 - Basic - Classification Edited On Oct 10, 2023
No ratings yet
05 Chap3 - Basic - Classification Edited On Oct 10, 2023
78 pages
Mod 3 Part1 - Merged
No ratings yet
Mod 3 Part1 - Merged
101 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Decision Trees and Decision Modeling
No ratings yet
Decision Trees and Decision Modeling
58 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
DmUnit 3
No ratings yet
DmUnit 3
42 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
15 pages
Decision Tree
No ratings yet
Decision Tree
22 pages
Data Classification Basics
No ratings yet
Data Classification Basics
34 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
BI Ch02
No ratings yet
BI Ch02
29 pages
Programming Using The Message-Passing Paradigm
No ratings yet
Programming Using The Message-Passing Paradigm
47 pages
A Search: F (N) Estimated Cost of The Best Path That Continues From N To A Goal
No ratings yet
A Search: F (N) Estimated Cost of The Best Path That Continues From N To A Goal
20 pages
Input 1
No ratings yet
Input 1
1 page
Problem Solving by Searching
No ratings yet
Problem Solving by Searching
40 pages
Artificial Intelligence: Dr. Piyush Joshi IIIT Sri City
No ratings yet
Artificial Intelligence: Dr. Piyush Joshi IIIT Sri City
27 pages
Artificial Intelligence: Dr. Piyush Joshi IIIT Sri City
No ratings yet
Artificial Intelligence: Dr. Piyush Joshi IIIT Sri City
23 pages
16 Turing Machines Variants NTM
No ratings yet
16 Turing Machines Variants NTM
36 pages
17 Decidabi - Ity
No ratings yet
17 Decidabi - Ity
58 pages
19 Reduction Computation History PCP
No ratings yet
19 Reduction Computation History PCP
25 pages
18 Reducibility
No ratings yet
18 Reducibility
57 pages
20 Properties of RE and R Sets
No ratings yet
20 Properties of RE and R Sets
2 pages
Lecture 14
No ratings yet
Lecture 14
20 pages
Lecture 11
No ratings yet
Lecture 11
49 pages
Lecture 12
No ratings yet
Lecture 12
13 pages
Calculus II 2020-2021 S2 Midterm
No ratings yet
Calculus II 2020-2021 S2 Midterm
6 pages
8.1 em 09
No ratings yet
8.1 em 09
12 pages
Carl Friedrich Gauss
No ratings yet
Carl Friedrich Gauss
14 pages
BS en Iso 10579 2013
No ratings yet
BS en Iso 10579 2013
16 pages
JawapanKeseluruhan PDF
67% (3)
JawapanKeseluruhan PDF
232 pages
PR 30 Hvs Quick Guide Functions
No ratings yet
PR 30 Hvs Quick Guide Functions
3 pages
Actuators: by K.Varun
100% (2)
Actuators: by K.Varun
20 pages
Hydrodynamic Forces On Pipelines: Model Tests: M. B. Bryndum V. Jacobsen
No ratings yet
Hydrodynamic Forces On Pipelines: Model Tests: M. B. Bryndum V. Jacobsen
11 pages
Chapter 10
No ratings yet
Chapter 10
5 pages
Anode Dusting Insights for Smelters
No ratings yet
Anode Dusting Insights for Smelters
6 pages
Graphites and Fullerene
No ratings yet
Graphites and Fullerene
9 pages
Time Geography Measurement Theory
No ratings yet
Time Geography Measurement Theory
29 pages
K (B) FD - TG4V-3, - 12 Series BC455580201423en-000102
No ratings yet
K (B) FD - TG4V-3, - 12 Series BC455580201423en-000102
19 pages
Coolside Door (CW)
No ratings yet
Coolside Door (CW)
2 pages
Adjustbox: The Package
No ratings yet
Adjustbox: The Package
56 pages
12SDMS01
No ratings yet
12SDMS01
14 pages
Chapter - 23 - Testing Conventional Applications - Black Box
No ratings yet
Chapter - 23 - Testing Conventional Applications - Black Box
33 pages
Dilip M. Salwi - Story of Zero - Children's Book Trust, New Delhi (1988) PDF
100% (1)
Dilip M. Salwi - Story of Zero - Children's Book Trust, New Delhi (1988) PDF
28 pages
Solar Plant Project in Basra, Iraq
No ratings yet
Solar Plant Project in Basra, Iraq
24 pages
Study Guide For A Beginning Course in Ground-Water Hydrology: R"-/ A ,: Part Ii - Instructor'S Guide
100% (1)
Study Guide For A Beginning Course in Ground-Water Hydrology: R"-/ A ,: Part Ii - Instructor'S Guide
42 pages
Types of Bread and Pastry Explained
No ratings yet
Types of Bread and Pastry Explained
22 pages
Bayesian Calibration for Media Models
No ratings yet
Bayesian Calibration for Media Models
16 pages
Lime Kiln Modeling for Engineers
No ratings yet
Lime Kiln Modeling for Engineers
40 pages
Phy Chem
No ratings yet
Phy Chem
2 pages
A Review of Solid Electrolyte Interphase SEI and Dendrite Formation in Lithium Batteries 2023 Springer
No ratings yet
A Review of Solid Electrolyte Interphase SEI and Dendrite Formation in Lithium Batteries 2023 Springer
46 pages
Chater 9 Solutions PDF
No ratings yet
Chater 9 Solutions PDF
22 pages
Seminar Report On Silicon Photonics
83% (6)
Seminar Report On Silicon Photonics
27 pages
Geometrical Tolerance
No ratings yet
Geometrical Tolerance
1 page
Accounting, Organizations and Society: Isabella Grabner, Frank Moers
No ratings yet
Accounting, Organizations and Society: Isabella Grabner, Frank Moers
13 pages
Smith Meter Accuload Iii Wildstream Blending: Electronic Preset Delivery System
No ratings yet
Smith Meter Accuload Iii Wildstream Blending: Electronic Preset Delivery System
12 pages