Wilson Icml97 Prune
Wilson Icml97 Prune
2
Salzberg (1991) introduced the nested generalized 3.2. DIRECTION OF SEARCH
exemplar (NGE) theory, in which hyperrectangles are
used to take the place of one or more instances, thus When searching for a subset S of instances to keep from
reducing storage requirements. training set T, there are different directions the search
Wettschereck & Dietterich, (1995) introduced a hybrid can proceed. We call these search directions
nearest-neighbor and nearest-hyperrectangle algorithm incremental, decremental, and batch.
that used hyperrectangles to classify input vectors if they Incremental. The incremental search begins with an
fell inside the hyperrectangle, and kNN to classify inputs empty subset S, and adds each instance in T to S if it
that were not covered by any hyperrectangle. This fulfills the criteria. Using such a search direction, the
algorithm must therefore store the entire training set T, order of presentation of instances can be very important.
but accelerates classification by using relatively few In particular, the first few instances may have a very
hyperrectangles whenever possible. different probability of being included in S than they
would if they were visited later. For example, CNN
begins by selecting one instance from each class at
3. INSTANCE REDUCTION ISSUES random, which gives these instances a 100% chance of
being included. The next instances visited are classified
From the above learning models, several observations only by the few instances that are already in S, while
can be made regarding the issues involved in training set instances chosen near the end of the algorithm are
reduction. This section covers the issues of instance classified by a much larger number of instances that
representation, the order of the search, the choice of have been included in S. Other incremental schemes
distance function, the general intuition of which include IB2 and IB3.
instances to keep, and how to evaluate the different One advantage to an incremental scheme is that if
strategies. instances are made available later, after training is
complete, they can continue to be added to S according
to the same criteria.
3.1. REPRESENTATION
Decremental. Decremental searches begins with S=T,
One choice in designing a training set reduction and then search for instances to remove from S. Again
algorithm is to decide whether to retain a subset of the the order of presentation can be important, but unlike the
original instances or whether to modify the instances incremental process, all of the training examples are
using a new representation. For example, NGE available for examination at any time, so a search can be
(Salzberg, 1991) and its derivatives (Wettschereck & made to determine which instance would be best to
Dietterich, 1995) use hyperrectangles to represent remove during each step of the algorithm. Decremental
collections of instances; RISE (Domingos, 1995) algorithms include RNN, SNN, and Wilson’s (1972)
generalizes instances into rules; and prototypes (Chang rule. NGE and RISE can also be viewed as decremental
1974) can be used to represent a cluster of instances, algorithms, except that instead of simply removing
even if no original instance occurred at the point where instances from S, they are instead generalized into
the prototype is located. hyperrectangles or rules. Similarly, Chang’s prototype
rule operates in a decremental order, but prototypes are
On the other hand, many models seek to retain a subset
merged into each other instead of being simply removed.
of the original instances, including the Condensed NN
rule (CNN) (Hart, 1968), the Reduced NN rule (RNN) One disadvantage with the decremental rule is that it is
(Gates 1972), the Selective NN rule (SNN) (Ritter et. al., often computationally more expensive than incremental
1975), Wilson’s rule (Wilson, 1972), the “all k-NN” algorithms. For example, in order to find the nearest
method (Tomek, 1976), Instance-Based (IBL) neighbor in T of an instance, n distance calculations must
Algorithms (Aha et. al. 1991), and the Typical Instance be made. On the other hand, there are fewer than n
Based Learning (TIBL) algorithm (Zhang, 1992). instances in S (zero initially, and some fraction of T
eventually), so finding the nearest neighbor in S of an
Another decision that affects the concept description for
instance takes less computation.
many algorithms is the choice of k, which is the number
of neighbors used to decide the output class of an input However, if the application of a decremental algorithm
vector. The value of k is typically a small integer (e.g., can result in greater storage reduction and/or increased
1, 3 or 5) that is odd so as to avoid “ties” in the voting of generalization accuracy, then the extra computation
neighbors. The value of k is often determined from during learning (which is done just once) can be well
cross-validation. worth the computational savings during execution
thereafter.
3
Batch. A final way to apply a training set reduction rule m
is in batch mode. This involves deciding if each instance
meets the removal criteria before removing any, and then
E(x, y) = ∑ (xi − yi )2 (1)
i =1
removing all of them at once. For example, the “all-
kNN” rule operates this way. This can relieve the where x and y are the two input vectors, m is the number
algorithm from having to constantly update lists of of input attributes, and xi and yi are the input values for
nearest neighbors and other information when instances input attribute i. This function is appropriate when all
are individually removed, but there are also dangers in the input attributes are numeric and have ranges of
batch processing. approximately equal width. When the attributes have
substantially different ranges, the attributes can be
For example, assume we apply a rule such as “Remove normalized by dividing the individual attribute distances
an instance if it has the same output class as its k nearest by the range or standard deviation of the attribute.
neighbors” to a training set. This could result in entire
clusters disappearing if there are no instances of a When nominal (discrete, unordered) attributes are
different class nearby. If done in decremental mode, included in an application, a distance metric is needed
however, some instances would remain, because that can handle them. We use a distance function based
eventually enough neighbors would be removed that one upon the Value Difference Metric (VDM) (Stanfill &
of the k nearest neighbors of an instance would have to Waltz, 1986) for nominal attributes. A simplified
be of another class, even if it was originally surrounded version of the VDM defines the distance between two
by those of its own class. values x and y of a single attribute a as:
4
attribute a in the instances in the training set T. This following basic rule to decide if it is safe to remove an
function handles unknown input values by assigning instance from the instance set S (where S = T originally):
them a large distance (1.0), and provides appropriate
normalization between numeric and nominal attributes, Remove P if at least as many of its associates
as well as between numeric attributes of different scales. in S would be classified correctly without it.
5
associates there are not near their enemies, and thus 4.3. REDUCTION TECHNIQUE 3 (RT3)
continue to be classified correctly without P.
Near the border, the removal of some instances RT2 sorts S in an attempt to remove center points before
sometimes causes others to be classified incorrectly border points. One problem with this method is that
because the majority of their neighbors can become noisy instances are also “border” points, and cause the
enemies. Thus this algorithm tends to keep non-noisy order of removal to be drastically changed. One noisy
border points. At the limit, there is typically a collection point in the center of a cluster causes many points in that
of border instances such that the majority of the k nearest cluster to be considered border points, and some of these
neighbors of each of these instances is the correct class. can remain in S even after the noisy point is removed.
Two passes through S can remove the dangling center
points, but unfortunately, by that time some border
4.2. REDUCTION TECHNIQUE 2 (RT2) points may have already been removed that should have
been kept.
There is a potential problem that can arise in RT1 with
regards to noisy instances. A noisy instance will RT3 therefore uses a noise-filtering pass before sorting
typically have associates of a different class, and will the instances in S. This is done using a rule similar to
thus be contained to a somewhat small portion of the Wilson’s Rule (Wilson, 1972): Any instance
input space. However, if its associates are removed by misclassified by its k nearest neighbors is removed. This
the above criteria, the noisy instance may cover more removes noisy instances, as well as close border points,
and more of the input space. Eventually it is hoped that which can in turn smooth the decision boundary slightly.
the noisy instance itself will be removed. However, if This helps to avoid “overfitting” the data, i.e., using a
many of its neighbors are removed first, its associates decision surface that goes beyond modeling the
may eventually include instances of the same class from underlying function and starts to model the data
the other side of the original decision boundary, and it is sampling distribution as well.
possible that removing the noisy instance at that point After removing noisy instances from S in this manner,
could cause some of its distant associates to be classified the instances are sorted by distance to their nearest
incorrectly. enemy remaining in S, and thus points far from the real
RT2 solves this problem by considering the effect of the decision boundary are removed first. This allows points
removal of an instance on all the instances in the original internal to clusters to be removed early in the process,
training set T instead of considering only those instances even if there were noisy points nearby.
remaining in S. In other words, an instance P is removed
from S only if at least as many of its associates
(including those that may have already been pruned) are 5. EXPERIMENTAL RESULTS
classified correctly without it.
The reduction algorithms RT1, RT2 and RT3 were
Using this modification, each instance in the original implemented using k = 3, and using the HVDM distance
training set T continues to maintain a list of its k + 1 function described in Section 3.4. These algorithms
nearest neighbors in S, even after it is removed from S. were tested on 29 data sets from the University of
This in turn means that instances in S have associates California, Irvine Machine Learning Database
that are both in and out of S, while instances that have Repository (Merz & Murphy, 1996) and compared to a
been removed from S have no associates (because they k-nearest neighbor classifier that was identical to RT1
are no longer a neighbor of any instance). This except that it does not remove any instances from the
modification makes use of additional information that is instance set (i.e., S=T).
available for estimating generalization accuracy, and
also avoids some problems that can occur with RT1 such Each test consisted of ten trials, each using one of ten
as removing entire clusters. This change is made by partitions of the data randomly selected from the data
removing line 13 from the pseudo-code for RT1 so that sets, i.e., 10-fold cross-validation. For each trial, 90% of
pruned instances will still be associates of their nearest the training instances were used for T, the subset S was
neighbors in S. determined using each reduction technique (except for
the kNN algorithm, which keeps all the instances), and
RT2 also changes the order of removal of instances. It the remaining 10% of the instances were classified using
initially sorts the instances in S by the distance to their only the instances remaining in S . The average
nearest enemy. Instances are then checked for removal generalization accuracy over the ten trials for each test is
beginning at the instance furthest from its nearest enemy. given in Table 1. The average percentage of instances
This tends to remove instances furthest from the decision retained in S is shown in the table as well.
boundary first, which in turn increases the chance of
retaining border points.
6
Table 1: Generalization accuracy and storage requirements of kNN, RT1, RT2, and RT3.
Database kNN (size) RT1 (size) RT2 (size) RT3 (size) H-IB3 (size)
Anneal 93.11 100 87.85 9.11 95.36 11.42 93.49 8.63 94.98 7.81
Australian 84.78 100 82.61 7.67 84.64 15.41 84.35 5.93 85.99 6.48
Breast Cancer WI 96.28 100 94.00 2.56 96.14 5.79 96.14 3.58 95.71 2.56
Bridges 66.09 100 55.64 20.86 59.18 24.11 58.27 18.66 59.37 38.67
Crx 83.62 100 81.01 6.70 84.93 14.11 85.80 5.46 83.48 6.86
Echocardiogram 94.82 100 93.39 9.01 85.18 7.51 93.39 9.01 93.39 14.85
Flag 61.34 100 58.13 24.51 62.34 32.30 61.29 20.45 51.50 39.18
Glass 73.83 100 60.30 26.11 64.98 31.52 65.02 23.88 67.77 32.92
Heart 81.48 100 79.26 12.96 81.11 21.60 83.33 13.62 76.30 10.33
Heart.Cleveland 81.19 100 77.85 14.26 79.87 20.61 80.84 12.76 74.23 10.78
Heart.Hungarian 79.22 100 78.92 11.38 79.22 15.98 79.95 10.43 74.83 8.88
Heart.Long Beach VA 70.00 100 73.00 11.78 72.00 16.33 73.50 4.22 69.50 11.67
Heart.More 74.17 100 73.20 11.20 74.50 16.98 76.25 9.10 74.75 13.97
Heart.Swiss 92.69 100 91.15 2.08 93.46 2.89 93.46 1.81 84.62 4.79
Hepatitis 80.62 100 76.21 8.67 82.00 13.98 81.87 7.81 72.79 8.03
Horse-Colic 57.84 100 65.09 10.89 66.17 17.98 71.08 7.42 61.82 17.64
Image.Segmentation 93.10 100 84.76 10.21 92.38 13.76 92.62 10.98 90.24 14.79
Ionosphere 84.62 100 84.91 5.67 88.32 12.09 87.75 7.06 88.32 13.61
Iris 94.00 100 89.33 11.70 95.33 16.89 95.33 14.81 92.00 10.96
LED-Creator+17 67.10 100 68.80 17.50 70.50 26.23 70.40 12.66 52.60 42.42
LED-Creator 60.60 100 64.00 7.78 67.00 11.20 68.20 11.57 71.30 24.32
Liver.Bupa 65.57 100 59.66 27.28 65.80 37.55 60.84 24.99 55.64 15.59
Pima Indians Diabetes 73.56 100 70.96 20.12 73.31 28.11 75.01 16.90 67.83 13.22
Promoters 93.45 100 85.00 8.18 87.91 17.82 86.82 16.67 88.59 10.02
Sonar 87.55 100 69.81 24.04 81.86 30.82 78.00 26.87 71.67 15.22
Soybean-Large 88.59 100 79.81 24.03 84.99 27.94 85.62 25.73 90.23 21.90
Vehicle 71.76 100 66.21 24.00 68.91 31.60 65.85 23.00 66.45 29.17
Vowel 96.57 100 88.98 43.14 91.46 46.91 89.56 45.22 94.70 23.42
Wine 94.93 100 91.05 8.55 93.79 15.23 94.93 16.11 94.38 13.42
Average 80.78 100 76.93 14.55 80.09 20.16 80.31 14.32 77.41 16.67
Results were also obtained for several of the methods RT2 had better generalization accuracy than RT1
discussed in Section 2, including CNN (Hart, 1968), because of its use of the additional information provided
SNN (Ritter et al., 1975), Wilson’s Rule (Wilson, 1972), by pruned instances in determining whether to remove
the “All k-NN” method (Tomek, 1976), IB2, IB3 (Aha, others. However, it also has higher storage
Kibler & Albert, 1991), and the Explore method requirements, due at least in part to the fact that noisy
(Cameron-Jones, 1995). Space does not permit the instances sometimes cause nearby instances to be
inclusion of all of the results, but the results for IB3 are retained even after the noisy instance is removed.
included in Table 1 as a benchmark for comparison, RT3 had a higher average generalization accuracy than
since it is one of the most popular algorithms. This RT1, RT2, and H-IB3, and also had the lowest storage
version of IB3 was modified to use the heterogeneous requirements of the four. Its generalization accuracy
distance function HVDM and is thus labeled H-IB3. was within one-half of a percent of the kNN algorithm
From the results in this table, several observations can be that retained 100% of the instances, and yet it retained
made. RT1 had very good storage reduction on average, on average only 14.32% of the instances.
but also dropped in generalization accuracy by an Some datasets seem to be especially well suited for these
average of almost 4%. This is likely due to the fact that reduction techniques. For example, RT3 required less
this algorithm ignores those instances that have already than 2% storage for the Heart.Swiss dataset, yet it
been pruned from S when deciding whether to remove achieved even higher generalization accuracy than the
additional instances. It also prunes them in a random kNN algorithm. On the other hand, some datasets were
order, thus causing some border points to be removed
not so appropriate. On the Vowel dataset, for example,
too early.
RT3 required over 45% of the data, and dropped in
7
generalization accuracy by 7%, suggesting that RT3 is As can be seen from the figure, the average
inappropriate for this particular dataset. generalization accuracy drops off more quickly if
The Heterogeneous IB3 algorithm was also compared to instances are randomly removed than if they are
the original IB3 algorithm (i.e., with k = 1, and using removed using RT3, indicating that the order of removal
Euclidean distance on linear attributes and overlap is improved by RT3.
metric on nominal attributes), to make sure that the This can be useful if more storage is available than that
above comparison was fair. The original IB3 algorithm required by RT3. For example, if an instance set of 10
attained 75% accuracy with 18% storage on average on million instances is reduced to just 100, but there is
these 29 datasets, indicating that the heterogeneous sufficient storage and computational resources to handle
distance function was beneficial to IB3. 1000 instances, then by using the sorted list of pruned
As mentioned above, the results in Table 1 were also instances, the best 1000 instances can be used. The user
compared with those achieved by CNN, SNN, Wilson’s is thus allowed to manually trade off storage for
Rule, and the All k-NN method. All of these resulted in accuracy without the higher loss in accuracy that random
lower generalization accuracy and higher storage removal would cause.
requirements than RT3. IB2, CNN and SNN all had
about 76% accuracy and required about 25-33% storage.
Wilson’s Rule and the All k-NN method both had 80.1%
accuracy but retained over 75% of the instances. 80
The Explore method (Cameron-Jones, 1995) mentioned
in Section 2.2 was also tested, and though its average
75
accuracy was lower than RT3 (76.23%), it reduced
%Accuracy
8
RT3 (and to a lesser extent the other algorithms) is Hart, P. E., (1968). “The Condensed Nearest Neighbor
designed to be robust in the presence of noise and use an Rule,” Institute of Electrical and Electronics Engineers
estimate of generalization accuracy in making decisions, Transactions on Information Theory, vol. 14, pp. 515-
which helps it avoid removing instances that would be 516.
helpful in generalization. Since RT3 makes use of all
Merz, C. J., and P. M. Murphy, (1996). UCI Repository
the instances in the training set in making its decision, it
of Machine Learning Databases. Irvine, CA: University
is not sensitive to the order of presentation of the
of California Irvine, Department of Information and
instances (as are incremental approaches), and is able to
Computer Science. Internet: http://www.ics.uci.edu/
choose a good order of removal regardless of how the
~mlearn/ MLRepository.html.
instances were ordered in the original training set.
Papadimitriou, Christos H., and Jon Louis Bentley,
These reduction algorithms were also among the first to
(1980). A Worst-Case Analysis of Nearest Neighbor
use heterogeneous distance functions that are appropriate
Searching by Projection. Lecture Notes in Computer
for applications with both nominal and continuous
Science, Vol. 85, Automata Languages and
attributes (Wilson & Martinez, 1997).
Programming, pp. 470-482.
Future research will focus on determining the conditions
Ritter, G. L., H. B. Woodruff, S. R. Lowry, and T. L.
under which these algorithms are not appropriate, and
Isenhour, (1975). “An Algorithm for a Selective Nearest
will seek to overcome weaknesses in such areas. These
Neighbor Decision Rule,” IEEE Transactions on
reduction algorithms will also be integrated with feature
Information Theory, vol. 21, no. 6, November 1975, pp.
selection algorithms and weighting techniques in order
665-669.
to produce comprehensive instance-based learning
systems that are robust in the presence of irrelevant Salzberg, Steven, (1991). “A Nearest Hyperrectangle
attributes and other difficult circumstances, thus Learning Method,” Machine Learning, vol. 6, pp. 277-
providing more accurate learning algorithms for a wide 309.
variety of problems. Sproull, Robert F., (1991). Refinements to Nearest-
Neighbor Searching in k -Dimensional Trees.
References Algorithmica, Vol. 6, pp. 579-589.
Aha, David W., Dennis Kibler, Marc K. Albert, (1991). Stanfill, C., and D. Waltz, (1986). “Toward memory-
“Instance-Based Learning Algorithms,” Machine based reasoning,” Communications of the ACM, vol. 29,
Learning, vol. 6, pp. 37-66. December 1986, pp. 1213-1228.
Cameron-Jones, R. M., (1995). Instance Selection by Tomek, Ivan, (1976). “An Experiment with the Edited
Encoding Length Heuristic with Random Mutation Hill Nearest-Neighbor Rule,” IEEE Transactions on Systems,
Climbing. In Proceedings of the Eighth Australian Joint Man, and Cybernetics, vol. 6, no. 6, June 1976, pp. 448-
Conference on Artificial Intelligence, pp. 99-106. 452.
Chang, Chin-Liang, (1974). “Finding Prototypes for Wettschereck, Dietrich, (1994). “A Hybrid Nearest-
Nearest Neighbor Classifiers,” IEEE Transactions on Neighbor and Nearest-Hyperrectangle Algorithm”, To
Computers, vol. 23, no. 11, November 1974, pp. 1179- appear in the Proceedings of the 7th European
1184. Conference on Machine Learning.
Cover, T. M., and P. E. Hart, (1967). “Nearest Neighbor Wettschereck, Dietrich, and Thomas G. Dietterich,
Pattern Classification,” Institute of Electrical and (1995). “An Experimental Comparison of Nearest-
Electronics Engineers Transactions on Information Neighbor and Nearest-Hyperrectangle Algorithms,”
Theory, vol. 13, no. 1, January 1967, pp. 21-27. Machine Learning, vol. 19, no. 1, pp. 5-28.
Dasarathy, Belur V., (1991). Nearest Neighbor (NN) Wilson, D. Randall, and Tony R. Martinez, (1997).
Norms: NN Pattern Classification Techniques. Los “Improved Heterogeneous Distance Functions,” Journal
Alamitos, CA: IEEE Computer Society Press. of Artificial Intelligence Research (JAIR), vol. 6, no. 1,
pp. 1-34.
Domingos, Pedro, (1995). “Rule Induction and Instance-
Based Learning: A Unified Approach,” to appear in The Wilson, Dennis L., (1972). “Asymptotic Properties of
1995 International Joint Conference on Artificial Nearest Neighbor Rules Using Edited Data,” IEEE
Intelligence (IJCAI-95). Transactions on Systems, Man, and Cybernetics, vol. 2,
no. 3, pp. 408-421.
Gates, G. W. (1972). “The Reduced Nearest Neighbor
Rule,” IEEE Transactions on Information Theory, vol. Zhang, Jianping, (1992). “Selecting Typical Instances in
IT-18, no. 3, pp. 431-433. Instance-Based Learning,” Proceedings of the Ninth
International Conference on Machine Learning.