0% found this document useful (0 votes)
80 views6 pages

A New Heuristic of The Decision Tree Induction: Ning Li, Li Zhao, Ai-Xia Chen, Qing-Wu Meng, Guo-Fang Zhang

Uploaded by

kiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views6 pages

A New Heuristic of The Decision Tree Induction: Ning Li, Li Zhao, Ai-Xia Chen, Qing-Wu Meng, Guo-Fang Zhang

Uploaded by

kiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

A NEW HEURISTIC OF THE DECISION TREE INDUCTION


NING LI, LI ZHAO, AI-XIA CHEN, QING-WU MENG, GUO-FANG ZHANG

Key Lab. of Machine Learning and Computational Intelligence, College of Mathematics and Computer Science, Hebei
University, Baoding, 071002, Hebei, China
EMAIL:lining@cmc.hbu.cn

Abstract: generated. The most popular heuristic information used in


Decision tree induction is one of the useful approaches for the decision tree generation is the minimum entropy. This
extracting classification knowledge from a set of feature-based heuristic information has many advantages such as small
instances. The most popular heuristic information used in the number of leaves and less computational efforts. However,
decision tree generation is the minimum entropy. This it has a serious disadvantage-the poor generalization
heuristic information has a serious disadvantage-the poor
generalization capability [3]. Support Vector Machine (SVM)
capability [3].
is a classification technique of machine learning based on Support vector machines (SVMs) are a classification
statistical learning theory. It has good generalization. technique of machine leaning based on statistical learning
Considering the relationship between the classification margin theory. Considering a 2-class classification problem, SVMs
of support vector machine(SVM) and the generalization are to construct a hyper-plane from a training dataset as a
capability, the large margin of SVM can be used as the classifier to classify unseen instances. The SVM
heuristic information of decision tree, in order to improve its hyper-plane has the maximum margin which is defined as
generalization capability. the distance between support vectors and the separate
This paper proposes a decision tree induction algorithm hyper-plane. According to Vapnik statistical leaning theory,
based on large margin heuristic. Comparing with the binary
decision tree using the minimum entropy as the heuristic
the maximum of margin implies that the separate
information, the experiments show that the generalization hyper-plane is optimal in the sense of expected risk
capability has been improved by using the new heuristic. minimization. The optimal characteristics of separate
hyper-planes have resulted in an excellent generalization
Keywords: capability and some other good performances of SVM
Decision tree; Support vector machines; SMO; Clustering; classifiers [3].
large margin; Generalization The inverse problem is how to split a given dataset
into two clusters such that the margin between the two
1. Introduction clusters attains maximum. Initially suppose we do not know
the class for each individual instance. It is expected that
Decision tree induction is one of the most important instances of the dataset are assigned with class-labels (+1 or
branches of inductive learning. It is one of the most widely –1) such that the margin defined according to the separate
used and practical methods for inductive inference. It hyper-plan of SVMs attains maximum [3].
creates a decision tree from the instance table according to This paper investigates a new decision tree generation
heuristic information and classifies some instances by using procedure to improve the generalization capability of
a set of rules, which is transformed from the decision tree. existing decision tree programs based on minimum entropy
Many methods have been developed for constructing heuristic. Due to the relationship between the margin of
decision trees from collections of instances. Given a SVMs and the generalization capability, the split with
training set, a general procedure for generating a decision maximum margin is considered as the new heuristic
tree can be briefly described as follows. The entire training information for generating decision trees. In the process of
set is first considered as the root node of the tree. Then the induction, the inverse problem of SVMs is required to be
root node is split into several sub-nodes based on some solved. Compared with the decision tree programs based on
heuristic information. If the instances in a sub-node belong minimum entropy heuristic, the training accuracy and
to one class, then the sub-node is regarded as a leaf node, testing accuracy has been improved.
else we continue to split the sub-node based on the heuristic The rest of this paper is organized as follows. In
information. This process repeats until all leaf nodes are section 2, the basic concept of support vector machines and

978-1-4244-3703-0/09/$25.00 ©2009 IEEE


1659
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

its solving algorithm is reviewed briefly. Section 3 proposes We can know the separability of two subsets through
the inverse problem of SVMs and its solving method. In checking whether the following inequalities
section 4, the algorithm of inducing decision tree based on yi ( w0 xi − b) ≥ 1; i = 1, 2, …, N (6)
large margin is developed. Section 5 introduces the
hold well.
induction algorithm of binary decision tree based on
minimum entropy. Section 6 shows some experimental
results and finally Section 7 draws conclusions. 2.2. Generalization in feature space[3]

2. Support vector machines Practically the performance of SVMs based on the


previous section may not be very good for the
2.1. The basic problem of SVMs[3] nonlinear-separable cases in the original space. To improve
the performance and to reduce the computational load for
the nonlinear separable datasets, Vapnik extended the
Let S = {( x1 , y1 ), ( x2 , y2 ),…, ( xN , yN )} be a training SVMs from the original space to the feature space.
set, where xi ∈ R and
n The key idea of the extension is that an SVM first
maps the original input space into a high-dimensional
yi ∈ {−1,1} i = 1, 2,…, N .The
for optimal feature space through some nonlinear mapping, and then
constructs an optimal separating hyper-plane in the feature
hyper-plane of S is defined as f ( x ) = 0 ,
space. Without any knowledge of the mapping, the SVM
where can find the optimal hyper-plane by using the dot product
f ( x) = ( w0 ⋅ x) + b0 (1) function in the feature space. The dot function is usually
called a kernel function. According to Hilbert–Schmidt
N theorem, there exists a relationship between the original
w0 = ∑ y jα 0j x j (2) space and its feature space for the dot product of two
j =1
points.
That is
( w0 ⋅ x) = ∑ i =1 w0i ⋅ x i is the inner product of the
n
( z1 ⋅ z2 ) = K ( x1 , x2 ) (7)
two vectors, where w0 = ( w01 , w02 , …, w0n ) and where it is assumed that a mapping Φ from the
1 2 n
x = ( x , x ,…, x ) . The vector W0 can be determined original space to the feature space exists, such that
according to the following quadratic programming. Φ ( x1 ) = z1 and Φ ( x2 ) = z2 , and K ( x1 , x2 ) is
conventionally called a kernel function satisfying the
Maximum Mercer theorem. Usually the following three types of
N N kernel functions can be used: polynomial with degree p,
W (α ) = ∑ α i − 12 ∑ yi y jα iα j ( xi ⋅ x j ) radial basis function and sigmoid function. Replacing the
i =1 i , j =1
(3) inner product ( x1 ⋅ x2 ) in (5) with the kernel function
Subject to
N K ( x1 , x2 ) , the optimal separating hyper-plane becomes the
∑ y jα j = 0; C ≥ α i ≥ 0, i = 1, 2,… , N
j =1
following form:
N

where C is a positive constant. The constant b0 is given f ( x) = ∑ yiα i0 K ( xi , x) + b0 (8)


i =1
by It is worth noting that the conclusion of section 2.1 is
⎛ N ⎞ still valid in the feature space if we substitute K ( x1 , x2 )
b0 = yi − ⎜ xi ∑ y jα 0j x j ⎟ (4)
⎝ j =1 ⎠ for the inner product ( x1 ⋅ x2 ) .

Substituting (2) for w0 in (1), we have 2.3. Solving the basic problem of SVMs with SMO
N
f ( x) = ∑ yiα i0 ( xi x) + b0 (5)
Due to the solving of the basic problem of SVMs need
i =1
to be called for many times, so a fast algorithm—Sequential

1660
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

Minimal Optimization(SMO) is used in our method. Margin ( f ). Then the inverse problem is formulated as
The SMO algorithm searches through the feasible
region of the dual problem and maxmizes the objective Maximum f ∈Ω (Margin( f )). (10)
function:
1 l 3.2. The solving of the inverse problem of SVMs
max W (α ) = ∑ i =1α i − ∑ i , j =1α iα j yi y j K ( xi , x j )
l

2
(9) In reference [3], the authors use genetic algorithm to
s. t. ∑ i =1α i yi = 0,
l
solve the inverse problem of SVMs. Due to the high time
C ≥ α i ≥ 0, i = 1, ... , l. complexity, we can use clustering to solve this problem.
Sequential Minimal Optimization (SMO) is a simple Given a training sample set with n samples. Solving
algorithm that can quickly solve the SVM QP problem the inverse problem need to partition them into two classes
without any extra matrix storage and without using randomly, i.e. a class label (1 or -1) is assigned to each
numerical QP optimization steps at all. SMO decomposes sample. The partition will be gotten after the class label
the overall QP problem into QP sub-problems, using assignment. For each partition, the basic problem of SVMs
Osuna’s theorem to ensure convergence.[8] needs to be solved to get the margin value. If the
n
Unlike other methods, SMO chooses to solve the enumerating method is used, there will be 2 assignment,
smallest possible optimization problem at every step. For n −1
the partition will be 2 − 1 by ignoring the symmetrical
the standard SVM QP problem, the smallest possible partition and the partition that all the samples are assigned
optimization problem involves two Lagrange multipliers, with 1 or -1. That means the basic problem of SVMs need
because the Lagrange multipliers must obey a linear n −1
equality constraint. At every step, SMO chooses two to be solved for 2 − 1 times, the time complexity is very
Lagrange multipliers to jointly optimize, finds the optimal high.
values for these multipliers, and updates the SVM to reflect So the clustering is used in our algorithm. All the
the new optimal values. samples are clustered into m clusters at first, and then the
The advantage of SMO lies in the fact that solving for class label will be assigned to each cluster, i.e. the samples
two Lagrange multipliers can be done analytically. Thus, in the same cluster will be assigned with the same class
numerical QP optimization is avoided entirely. label. So there will be 2m assignment and the
m −1
corresponding partion will be 2 − 1 , this make the
3. An inverse problem of SVMs[3] times for solving the basic problem of SVMs decrease
sharply.
3.1. The description of the inverse problem of SVMs The algotithm is as follows:
1. The training set is devided into m clusters by
For a given dataset of which no class labels are
assigned to instances, we can randomly split the dataset into k-means clustering.
two subsets. Suppose that one is the positive instance subset 2. The samples in each cluster are assigned with class
and the other is the negative instance subset, we can label, let Maxm arg in be the max margin and set theinitial
calculate the maximum margin between the two subsets to be 0.
according to Procedure 1 where the margin is equal to 0 for 3. for each partition Pi (i = 1,… , 2m −1 − 1) ,
the non-separable case. Obviously the calculated margin
(1)Solve the basic problem of SVMs, record the
depends on the random split of the dataset. Our problem is
corresponding margin.
how to split the dataset such that the margin calculated
according to Procedure 1 attains the maximum. (2)If m arg in > Maxm arg in
It is an optimization problem. We mathematically Then Maxm arg in = m arg in
formulate it as follows. Let
S = { x1 , x2 , … , xN } be a dataset and xi ∈ R n for 4. Return Maxm arg in and the corresponding
partition.
i = 1, 2, … , N , Ω ={ f | f is a function from S to {-1,1}}.
Given a function f ∈ Ω , the dataset can be split to two
subsets and then the margin can be calculated by Procedure
1. We denote the calculated margin (the functional) by

1661
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

4. The induction of decision trees based on large z If the proportion of the Examples with positive
margin heuristic
class is bigger than β ,then return the tree
4.1. The main idea consisted of Root node with label =+.
Due to the high generalization capability of SVMs, we z If the proportion of the Examples with negative
can apply the SVMs inverse problem to the decision tree class is bigger than β , then return the tree
induction process. We construct a decision tree in the
process of maximizing the margin of each split. For a given consisted of Root node with label =-.
training set, initially we do not consider its class labels, the
z Otherwise,
dataset can be split into two subsets by solving the inverse
problem of SVM. The subsets can be considered as the new z The solving of the inverse problem of
branches. The instances labeled with -1 can be considered
SVMs is performed on Examples and the
as the left branch, and the instances labeled with +1 can be
considered as the right branch. The hyper-plane function partition with the max margin will be
with the maximum margin is recorded. It is used as the
gotten.
decision function. When we classify the new instances, the
outcomes of the decision function will decide the branches z Let the hyperplane be f ( x) , if f ( x) ≤ 0 ,
they belong to. If the outcome of the new coming instance
is less than 0, it belongs to the left branch, else it belongs to set c( x) = −1 , if f ( x) ≥ 0 , set c( x) = +1 .
the right one.
z Add two branches under the Root node,
4.2. Induction algorithm they are f ( x) ≤ 0 and f ( x) > 0 respectively.
Given a training set where the attributes have z Set
Examples−1 is the subset which
numerical values and a truth level threshold β , the
induction process consists of the following steps: satisfies f ( x) ≤ 0 in Examples.
Step1: Create a root node that contains all the Examples+1 is the subset which
instances for the tree. Split the instances into two subsets as Set
the new branches according to the procedure of solving the
inverse problem of SVMs. The hyper-plane with the satisfies f ( x) > 0 in Examples.
maximum margin is recorded at the root node. z Add the subtrees
Step2: For each branch of the above node, calculate
the proportions of all objects within the branch belonging to MaximumMargin( Examples−1 , β )
each class. If the proportion is above the given threshold
β , terminate the branch as a leaf. Otherwise, continue to and MaximumMargin( Examples+1 , β )
split the branch using the method in step1. under the two branches.
Step3: Repeat step2 for all newly generated decision
nodes until no further growth is possible, the decision tree z End.
then is complete. Return Root.
Take the two-class problem for example, the specific
algorithm is as follows:
5. The induction algorithm of binary decision tree
MaximumMargin(Examples, β )
based on minimum entropy heuristic
Examples is the training examples set, β is the threshold
of the proportion for the examples belonging to each class. Suppose D is the training set in the form of numerical,
The decision tree that can classify the given examples the induction algorithm is described as follows[7]:
correctly will be returned. Step1 If all the instances in D belong to one class or
z Create the Root node of the decision tree satisfy other terminating rule, for example, the proportion
of all instances within the branch belonging to one class is
above a given threshold, D is terminated as a leaf node

1662
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

labeled that class. Otherwise, turn to step2; Table 1. The features in the database
Step2 A non-leaf node is produced, D is split into n The attribute
The database The Sample number
(for binary decision tree, n=2) example sets, labeled with number
Di ( i = 1, 2,… n ). For each Di , execute step1.
Iris 100 5
The heuristic information used in step2 is minimum
entropy. Let A1 , A 2 ,… , Am be all the attributes, Rice 98 6
vi1 , v i 2 ,… , vin be the values for Ai .For each value of
Image
661 19
each attribute vii , we can split D into two subsets- D1 Segment

and D 2 , D1 is the subset of instances with the value for Pima 768 9
Ai less than or equal to vii , D 2 is the subset of
instances with the value for Ai more than vii .So we can Table 2. The parameters of the algorithm

obtain ni splits for attribute Ai ,where ni is the number k =4 Clusters number


two_sigma_squared=2 RBF kernel function parameters
of values of Ai . For each split, we can calculate the
tolerance = 0.001 KKT tolerance degree
entropy. We choose the minimum entropy which is labeled
eps = 0.001 the mulitipliers updating threshold in SMO
Ei for attribute Ai , then the attribute with the minimum
β = 0.9 The threshold of leave nodes
entropy, i.e. min( Ei ) , is chosen to be the expended
C=10000 The punishment factor
attribute.Experiments and discussion
Table 3. The comparison of the testing algorithm
6. The experiment and discussion
Algorithm Algorithm
Database Time
1 2
We apply the SVM into the decision tree induction, i.e.
using the max margin as the heuristic to induce the decision
Iris 97.8% 100% 31ms
tree. This algotithm and the one that uses minimal entropy
as the heuristic are both for the data with real type. We have
Rice 79.4% 88.89% 625ms
implemented the two decision tree induction system.
We choose Iris, Rice, Pima, Image Segment from UCI
Image
database and Table 1 shows the features of each database. 100% 100% 3421ms
Segment
Using these data we perform the experiment and compare
the test accuracy. For the multi-class database, we just
Pima 66.09% 73.0% 600000ms
choose two of them. The experimental steps are as follows:
1. Data preprocessing. Let the class attribute of the data
be -1 and 1, and divide them into two parts randomly, The result illustrates that the training accuracy and
70% as the training set and 30% as the testing set. testing accuracy has been improved by using the maximum
2. Set the corresponding parameters, the parameters and margin heuristic. The generalization ability has been
their values are listed in Table 2. improved at some extent.
3. Train the data using the induction algorithm based on The concept of margin of a classifier is the central to
large margin, and induce the decision tree. the success of a new classification learning algorithm [3].
4. Use the testing data to test, get the testing accuracy. Generalization capability of SVMs depends strongly on the
Table 3 shows the experimental results, i.e. the margin [3]. Increasing the margin has been shown to be
comparison of the average testing accuracy between the able to improve the generalization capability of SVMs
two algorithms, where the algorithm1 represents the binary through data-dependent structural risk minimization [5].In
decision tree induction algorithm introduced in section 5 the induction process we use the maximum margin as the
and the algorithm2 represents the one introduced in section heuristic information, i.e., The margin of the split during
4.2, the time is the time spent by algorithm 2. the induction of decision tree attains the maximum., So the
split with the maximum margin has good generalization.

1663
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009

7. Conclusions [2] T. M. Mitchell, “Machine Learning”, The


McGraw-Hill Co.,414p, 1997.
In order to generate decision trees with higher [3] Xi-zhao Wang, Qiang He, De-Gang Chen, Daniel
generalization capability, this paper proposes a decision tree Yeung, “A genetic algorithm for solving the inverse
induction algorithm based on large margin. The clustering problem of support vector machines”,
algorithm for solving the inverse problem of SVMs is Neurocomputing 68(2005):225-238.
chosen during the induction. The main disadvantage of this [4] V.N. Vapnik, “Statistical Learning Theory”, Wiley,
algorithm is still its large time complexity. Practically it is New York, ISBN 0-471-03003-1, 1998.
expected to give an improved version or another algorithm [5] V.N. Vapnik, “An overview of statistical learning
with time complexity reduction [3]. theory”, IEEE Trans. Neural Networks 10 (5) (1999):
88–999.
Ackonwlegements [6] V.N. Vapnik, “The Nature of Statistical Learning
Theory”, Springer, New York, ISBN 0-387-98780-0,
This research is supported by the natural science 2000.
foundation of Hebei Province (F2008000635) and is [7] Jie Yang, Chen-zhou Ye, Yue Zhou , Nian-yi Chen,
supported in part by Hebei Provincial “On Upper Bound of VC Dimension of Binary
Education Department (No. 2008306) Decision Tree Algorithms”, Computer Simulation,
2005,2(22):74-78.
References [8] John C. Platt, “Sequential Minimal Optimization A
Fast Algorithm for Training Support vector
[1] Quinlan J R. “Induction of decision tree”, Machine
Learning, 1986, 1: 81~106. Machines”[R]. Cambridge: Microsoft Research, 1998.

1664

You might also like