A New Heuristic of The Decision Tree Induction: Ning Li, Li Zhao, Ai-Xia Chen, Qing-Wu Meng, Guo-Fang Zhang
A New Heuristic of The Decision Tree Induction: Ning Li, Li Zhao, Ai-Xia Chen, Qing-Wu Meng, Guo-Fang Zhang
Key Lab. of Machine Learning and Computational Intelligence, College of Mathematics and Computer Science, Hebei
University, Baoding, 071002, Hebei, China
EMAIL:lining@cmc.hbu.cn
its solving algorithm is reviewed briefly. Section 3 proposes We can know the separability of two subsets through
the inverse problem of SVMs and its solving method. In checking whether the following inequalities
section 4, the algorithm of inducing decision tree based on yi ( w0 xi − b) ≥ 1; i = 1, 2, …, N (6)
large margin is developed. Section 5 introduces the
hold well.
induction algorithm of binary decision tree based on
minimum entropy. Section 6 shows some experimental
results and finally Section 7 draws conclusions. 2.2. Generalization in feature space[3]
Substituting (2) for w0 in (1), we have 2.3. Solving the basic problem of SVMs with SMO
N
f ( x) = ∑ yiα i0 ( xi x) + b0 (5)
Due to the solving of the basic problem of SVMs need
i =1
to be called for many times, so a fast algorithm—Sequential
1660
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009
Minimal Optimization(SMO) is used in our method. Margin ( f ). Then the inverse problem is formulated as
The SMO algorithm searches through the feasible
region of the dual problem and maxmizes the objective Maximum f ∈Ω (Margin( f )). (10)
function:
1 l 3.2. The solving of the inverse problem of SVMs
max W (α ) = ∑ i =1α i − ∑ i , j =1α iα j yi y j K ( xi , x j )
l
2
(9) In reference [3], the authors use genetic algorithm to
s. t. ∑ i =1α i yi = 0,
l
solve the inverse problem of SVMs. Due to the high time
C ≥ α i ≥ 0, i = 1, ... , l. complexity, we can use clustering to solve this problem.
Sequential Minimal Optimization (SMO) is a simple Given a training sample set with n samples. Solving
algorithm that can quickly solve the SVM QP problem the inverse problem need to partition them into two classes
without any extra matrix storage and without using randomly, i.e. a class label (1 or -1) is assigned to each
numerical QP optimization steps at all. SMO decomposes sample. The partition will be gotten after the class label
the overall QP problem into QP sub-problems, using assignment. For each partition, the basic problem of SVMs
Osuna’s theorem to ensure convergence.[8] needs to be solved to get the margin value. If the
n
Unlike other methods, SMO chooses to solve the enumerating method is used, there will be 2 assignment,
smallest possible optimization problem at every step. For n −1
the partition will be 2 − 1 by ignoring the symmetrical
the standard SVM QP problem, the smallest possible partition and the partition that all the samples are assigned
optimization problem involves two Lagrange multipliers, with 1 or -1. That means the basic problem of SVMs need
because the Lagrange multipliers must obey a linear n −1
equality constraint. At every step, SMO chooses two to be solved for 2 − 1 times, the time complexity is very
Lagrange multipliers to jointly optimize, finds the optimal high.
values for these multipliers, and updates the SVM to reflect So the clustering is used in our algorithm. All the
the new optimal values. samples are clustered into m clusters at first, and then the
The advantage of SMO lies in the fact that solving for class label will be assigned to each cluster, i.e. the samples
two Lagrange multipliers can be done analytically. Thus, in the same cluster will be assigned with the same class
numerical QP optimization is avoided entirely. label. So there will be 2m assignment and the
m −1
corresponding partion will be 2 − 1 , this make the
3. An inverse problem of SVMs[3] times for solving the basic problem of SVMs decrease
sharply.
3.1. The description of the inverse problem of SVMs The algotithm is as follows:
1. The training set is devided into m clusters by
For a given dataset of which no class labels are
assigned to instances, we can randomly split the dataset into k-means clustering.
two subsets. Suppose that one is the positive instance subset 2. The samples in each cluster are assigned with class
and the other is the negative instance subset, we can label, let Maxm arg in be the max margin and set theinitial
calculate the maximum margin between the two subsets to be 0.
according to Procedure 1 where the margin is equal to 0 for 3. for each partition Pi (i = 1,… , 2m −1 − 1) ,
the non-separable case. Obviously the calculated margin
(1)Solve the basic problem of SVMs, record the
depends on the random split of the dataset. Our problem is
corresponding margin.
how to split the dataset such that the margin calculated
according to Procedure 1 attains the maximum. (2)If m arg in > Maxm arg in
It is an optimization problem. We mathematically Then Maxm arg in = m arg in
formulate it as follows. Let
S = { x1 , x2 , … , xN } be a dataset and xi ∈ R n for 4. Return Maxm arg in and the corresponding
partition.
i = 1, 2, … , N , Ω ={ f | f is a function from S to {-1,1}}.
Given a function f ∈ Ω , the dataset can be split to two
subsets and then the margin can be calculated by Procedure
1. We denote the calculated margin (the functional) by
1661
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009
4. The induction of decision trees based on large z If the proportion of the Examples with positive
margin heuristic
class is bigger than β ,then return the tree
4.1. The main idea consisted of Root node with label =+.
Due to the high generalization capability of SVMs, we z If the proportion of the Examples with negative
can apply the SVMs inverse problem to the decision tree class is bigger than β , then return the tree
induction process. We construct a decision tree in the
process of maximizing the margin of each split. For a given consisted of Root node with label =-.
training set, initially we do not consider its class labels, the
z Otherwise,
dataset can be split into two subsets by solving the inverse
problem of SVM. The subsets can be considered as the new z The solving of the inverse problem of
branches. The instances labeled with -1 can be considered
SVMs is performed on Examples and the
as the left branch, and the instances labeled with +1 can be
considered as the right branch. The hyper-plane function partition with the max margin will be
with the maximum margin is recorded. It is used as the
gotten.
decision function. When we classify the new instances, the
outcomes of the decision function will decide the branches z Let the hyperplane be f ( x) , if f ( x) ≤ 0 ,
they belong to. If the outcome of the new coming instance
is less than 0, it belongs to the left branch, else it belongs to set c( x) = −1 , if f ( x) ≥ 0 , set c( x) = +1 .
the right one.
z Add two branches under the Root node,
4.2. Induction algorithm they are f ( x) ≤ 0 and f ( x) > 0 respectively.
Given a training set where the attributes have z Set
Examples−1 is the subset which
numerical values and a truth level threshold β , the
induction process consists of the following steps: satisfies f ( x) ≤ 0 in Examples.
Step1: Create a root node that contains all the Examples+1 is the subset which
instances for the tree. Split the instances into two subsets as Set
the new branches according to the procedure of solving the
inverse problem of SVMs. The hyper-plane with the satisfies f ( x) > 0 in Examples.
maximum margin is recorded at the root node. z Add the subtrees
Step2: For each branch of the above node, calculate
the proportions of all objects within the branch belonging to MaximumMargin( Examples−1 , β )
each class. If the proportion is above the given threshold
β , terminate the branch as a leaf. Otherwise, continue to and MaximumMargin( Examples+1 , β )
split the branch using the method in step1. under the two branches.
Step3: Repeat step2 for all newly generated decision
nodes until no further growth is possible, the decision tree z End.
then is complete. Return Root.
Take the two-class problem for example, the specific
algorithm is as follows:
5. The induction algorithm of binary decision tree
MaximumMargin(Examples, β )
based on minimum entropy heuristic
Examples is the training examples set, β is the threshold
of the proportion for the examples belonging to each class. Suppose D is the training set in the form of numerical,
The decision tree that can classify the given examples the induction algorithm is described as follows[7]:
correctly will be returned. Step1 If all the instances in D belong to one class or
z Create the Root node of the decision tree satisfy other terminating rule, for example, the proportion
of all instances within the branch belonging to one class is
above a given threshold, D is terminated as a leaf node
1662
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009
labeled that class. Otherwise, turn to step2; Table 1. The features in the database
Step2 A non-leaf node is produced, D is split into n The attribute
The database The Sample number
(for binary decision tree, n=2) example sets, labeled with number
Di ( i = 1, 2,… n ). For each Di , execute step1.
Iris 100 5
The heuristic information used in step2 is minimum
entropy. Let A1 , A 2 ,… , Am be all the attributes, Rice 98 6
vi1 , v i 2 ,… , vin be the values for Ai .For each value of
Image
661 19
each attribute vii , we can split D into two subsets- D1 Segment
and D 2 , D1 is the subset of instances with the value for Pima 768 9
Ai less than or equal to vii , D 2 is the subset of
instances with the value for Ai more than vii .So we can Table 2. The parameters of the algorithm
1663
Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009
1664