0% found this document useful (0 votes)
12 views9 pages

Wilson Icml97 Prune

Uploaded by

조킹
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views9 pages

Wilson Icml97 Prune

Uploaded by

조킹
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

In Fisher, D., ed.

, Machine Learning: Proceedings of the Fourteenth International Conference, Morgan


Kaufmann Publishers, San Francisco, CA, pp. 404-411, 1997.

Instance Pruning Techniques

D. Randall Wilson, Tony R. Martinez


Computer Science Department
Brigham Young University
Provo, UT 84058
randy@axon.cs.byu.edu, martinez@cs.byu.edu

Abstract The nearest neighbor algorithm learns very quickly


(O(n)) because it need only read in the training set
without further processing. It can also generalize
The nearest neighbor algorithm and its accurately for many applications because it can learn
derivatives are often quite successful at learning complex concept descriptions, and provides an
a concept from a training set and providing appropriate bias for many applications.
good generalization on subsequent input The nearest neighbor algorithm also has several
vectors. However, these techniques often retain shortcomings. Since it stores all training instances, it has
the entire training set in memory, resulting in large (O(n)) memory requirements. Since it must search
large memory requirements and slow execution through all available instances to classify a new input
speed, as well as a sensitivity to noise. This vector, it is slow (O(n)) during classification. Since it
paper provides a discussion of issues related to stores every instance in the training set, noisy instances
reducing the number of instances retained in (i.e., those with errors in the input vector or output class,
memory while maintaining (and sometimes or those not representative of typical cases) are stored as
improving) generalization accuracy, and well, which can degrade generalization accuracy.
mentions algorithms other researchers have
used to address this problem. It presents three Techniques such as k-d trees (Sproull, 1991) and
intuitive noise-tolerant algorithms that can be projection (Papadimitriou & Bentley, 1980) can reduce
used to prune instances from the training set. In the time required to find the nearest neighbor(s) of an
experiments on 29 applications, the algorithm input vector, but they do not reduce storage
that achieves the highest reduction in storage requirements, do not address the problem of noise, and
also results in the highest generalization often become much less effective as the dimensionality
accuracy of the three methods. of the problem (i.e., the number of input attributes)
grows.
On the other hand, when some of the instances are
1. INTRODUCTION removed (or pruned) from the training set, the storage
requirements and time necessary for generalization are
The nearest neighbor algorithm (Cover & Hart, 1967; correspondingly reduced. This paper focuses on the
Dasarathy, 1991) has been used successfully for pattern problem of reducing the size of the stored set of
classification on many applications. Each pattern has an instances while trying to maintain or even improve
input vector with one value for each of several input generalization accuracy. Much research has been done
attributes. An instance has an input vector and an output in this area, and an overview of this research is given in
class. A training set T is a collection of instances with Section 2.
known output classes.
Section 3 discusses several issues related to the problem
A new pattern x is classified in the basic nearest of instance set reduction, and provides motivation for
neighbor algorithm by finding the instance y in T that is further algorithms. Section 4 presents four new instance
closest to x, and using the output class of y as the reduction techniques that were hypothesized to provide
predicted classification for x. The instance that is closest substantial instance reduction while continuing to
to x is taken to be the one with minimum distance, using generalize accurately, even in the presence of noise. The
some distance function. The distance function can have first is similar to the Reduced Nearest Neighbor (RNN)
a significant impact on the ability of the classifier to algorithm (Gates 1972). The second changes the order
generalize correctly. in which instances are considered for removal, and the
third adds a noise-reduction step similar to that done by noisy instances as well as close border cases, leaving
Wilson (1972) before proceeding with the main smoother decision boundaries. It also retains all internal
reduction algorithm. points, which keeps it from reducing the storage
Section 5 describes experiments on 29 datasets that requirements as much as many other algorithms. Tomek
compare the performance of each of these three (1976) extended Wilson’s algorithm with his “all k-NN”
reduction techniques to a basic k-nearest neighbor method of editing by calling Wilson’s algorithm
algorithm. The results indicate that these algorithms repeatedly with k=1..k, though this still retains internal
achieve substantial storage reduction, while maintaining points.
good generalization accuracy. Comparisons are also
made with two of the best algorithms from the review in
2.2. “INSTANCE-BASED” LEARNING AL-
Section 2. Section 6 provides conclusions and future
GORITHMS
research directions.
Aha et. al. (1991) presented a series of instance-based
learning algorithms that reduce storage. IB2 is quite
2. RELATED WORK similar to the Condensed Nearest Neighbor (CNN) rule
(Hart, 1968), and suffers from the same sensitivity to
Several researchers have addressed the problem of noise.
training set size reduction. This section reviews
previous work in reduction algorithms. IB3 (Aha et al. 1991) addresses IB2’s problem of
keeping noisy instances by using a statistical test to
retain only acceptable misclassified instances. In their
2.1. NEAREST NEIGHBOR EDITING RULES experiments, IB3 was able to achieve greater reduction
in the number of instances stored and also achieved
Hart (1968) made one of the first attempts to reduce the higher accuracy than IB2, due to its reduced sensitivity
size of the training set with his Condensed Nearest to noise on the applications on which it was tested.
Neighbor Rule (CNN). This algorithm begins by Zhang (1992) used a different approach called the
randomly selecting one instance belonging to each Typical Instance Based Learning (TIBL) algorithm,
output class from the training set T and putting it in a which attempted to save instances near the center of
subset S. Then each instance in T is classified using only clusters rather than on the border.
the instances in S. If an instance is misclassified, it is
added to S, thus ensuring that it will be classified Cameron-Jones (1995) used an encoding length heuristic
correctly. This algorithm is especially sensitive to noise, to determine how good the subset S is in describing T.
because noisy instances will usually be misclassified by After a growing phase similar to IB3, a random mutation
their neighbors, and thus will be retained. hill climbing method called Explore is used to search for
a better subset of instances, using the encoding length
Ritter et. al. (1975) extended the condensed NN method heuristic as a guide.
in their Selective Nearest Neighbor Rule (SNN) such that
every member of T must be closer to a member of S of
the same class than to a member of T (instead of S) of a 2.3. PROTOTYPES AND OTHER MOD-
different class. Further, the method ensures a minimal IFICATIONS OF THE INSTANCES
subset satisfying these conditions. This algorithm
resulted in greater reduction in the training set size as Some algorithms seek to reduce storage requirements
well as slightly higher accuracy than CNN in their and speed up classification by modifying the instances
experiments, though it is still sensitive to noise. themselves, instead of just deciding which ones to keep.
Gates (1972) modified this algorithm in his Reduced Chang (1974) introduced an algorithm in which each
Nearest Neighbor Rule (RNN). The reduced NN instance in T is initially treated as a prototype. The
algorithm starts with S = T and removes each instance nearest two instances that have the same class are
from S if such a removal does not cause any other merged into a single prototype (using a weighted
instances in T to be misclassified. Since the instance averaging scheme) that is located somewhere between
being removed is not guaranteed to be classified the two prototypes. This process is repeated until
correctly, this algorithm is able to remove noisy classification accuracy starts to suffer.
instances and internal instances while retaining border
points. Domingos (1995) introduced the RISE 2.0 system which
treats each instance in T as a rule in R , and then
Wilson (1972) developed an algorithm which removes generalizes rules until classification accuracy starts to
instances that do not agree with the majority of their k suffer.
nearest neighbors (with k=3, typically). This edits out

2
Salzberg (1991) introduced the nested generalized 3.2. DIRECTION OF SEARCH
exemplar (NGE) theory, in which hyperrectangles are
used to take the place of one or more instances, thus When searching for a subset S of instances to keep from
reducing storage requirements. training set T, there are different directions the search
Wettschereck & Dietterich, (1995) introduced a hybrid can proceed. We call these search directions
nearest-neighbor and nearest-hyperrectangle algorithm incremental, decremental, and batch.
that used hyperrectangles to classify input vectors if they Incremental. The incremental search begins with an
fell inside the hyperrectangle, and kNN to classify inputs empty subset S, and adds each instance in T to S if it
that were not covered by any hyperrectangle. This fulfills the criteria. Using such a search direction, the
algorithm must therefore store the entire training set T, order of presentation of instances can be very important.
but accelerates classification by using relatively few In particular, the first few instances may have a very
hyperrectangles whenever possible. different probability of being included in S than they
would if they were visited later. For example, CNN
begins by selecting one instance from each class at
3. INSTANCE REDUCTION ISSUES random, which gives these instances a 100% chance of
being included. The next instances visited are classified
From the above learning models, several observations only by the few instances that are already in S, while
can be made regarding the issues involved in training set instances chosen near the end of the algorithm are
reduction. This section covers the issues of instance classified by a much larger number of instances that
representation, the order of the search, the choice of have been included in S. Other incremental schemes
distance function, the general intuition of which include IB2 and IB3.
instances to keep, and how to evaluate the different One advantage to an incremental scheme is that if
strategies. instances are made available later, after training is
complete, they can continue to be added to S according
to the same criteria.
3.1. REPRESENTATION
Decremental. Decremental searches begins with S=T,
One choice in designing a training set reduction and then search for instances to remove from S. Again
algorithm is to decide whether to retain a subset of the the order of presentation can be important, but unlike the
original instances or whether to modify the instances incremental process, all of the training examples are
using a new representation. For example, NGE available for examination at any time, so a search can be
(Salzberg, 1991) and its derivatives (Wettschereck & made to determine which instance would be best to
Dietterich, 1995) use hyperrectangles to represent remove during each step of the algorithm. Decremental
collections of instances; RISE (Domingos, 1995) algorithms include RNN, SNN, and Wilson’s (1972)
generalizes instances into rules; and prototypes (Chang rule. NGE and RISE can also be viewed as decremental
1974) can be used to represent a cluster of instances, algorithms, except that instead of simply removing
even if no original instance occurred at the point where instances from S, they are instead generalized into
the prototype is located. hyperrectangles or rules. Similarly, Chang’s prototype
rule operates in a decremental order, but prototypes are
On the other hand, many models seek to retain a subset
merged into each other instead of being simply removed.
of the original instances, including the Condensed NN
rule (CNN) (Hart, 1968), the Reduced NN rule (RNN) One disadvantage with the decremental rule is that it is
(Gates 1972), the Selective NN rule (SNN) (Ritter et. al., often computationally more expensive than incremental
1975), Wilson’s rule (Wilson, 1972), the “all k-NN” algorithms. For example, in order to find the nearest
method (Tomek, 1976), Instance-Based (IBL) neighbor in T of an instance, n distance calculations must
Algorithms (Aha et. al. 1991), and the Typical Instance be made. On the other hand, there are fewer than n
Based Learning (TIBL) algorithm (Zhang, 1992). instances in S (zero initially, and some fraction of T
eventually), so finding the nearest neighbor in S of an
Another decision that affects the concept description for
instance takes less computation.
many algorithms is the choice of k, which is the number
of neighbors used to decide the output class of an input However, if the application of a decremental algorithm
vector. The value of k is typically a small integer (e.g., can result in greater storage reduction and/or increased
1, 3 or 5) that is odd so as to avoid “ties” in the voting of generalization accuracy, then the extra computation
neighbors. The value of k is often determined from during learning (which is done just once) can be well
cross-validation. worth the computational savings during execution
thereafter.

3
Batch. A final way to apply a training set reduction rule m
is in batch mode. This involves deciding if each instance
meets the removal criteria before removing any, and then
E(x, y) = ∑ (xi − yi )2 (1)
i =1
removing all of them at once. For example, the “all-
kNN” rule operates this way. This can relieve the where x and y are the two input vectors, m is the number
algorithm from having to constantly update lists of of input attributes, and xi and yi are the input values for
nearest neighbors and other information when instances input attribute i. This function is appropriate when all
are individually removed, but there are also dangers in the input attributes are numeric and have ranges of
batch processing. approximately equal width. When the attributes have
substantially different ranges, the attributes can be
For example, assume we apply a rule such as “Remove normalized by dividing the individual attribute distances
an instance if it has the same output class as its k nearest by the range or standard deviation of the attribute.
neighbors” to a training set. This could result in entire
clusters disappearing if there are no instances of a When nominal (discrete, unordered) attributes are
different class nearby. If done in decremental mode, included in an application, a distance metric is needed
however, some instances would remain, because that can handle them. We use a distance function based
eventually enough neighbors would be removed that one upon the Value Difference Metric (VDM) (Stanfill &
of the k nearest neighbors of an instance would have to Waltz, 1986) for nominal attributes. A simplified
be of another class, even if it was originally surrounded version of the VDM defines the distance between two
by those of its own class. values x and y of a single attribute a as:

3.3. BORDER POINTS VS. CENTRAL POINTS 2


C  N a, x,c N a,y,c 
vdma (x, y) = ∑  N −
N a,y 
(2)
c=1 
Another factor that distinguishes instance reduction a, x
techniques is whether they seek to retain border points,
central points, or some other set of points.
The intuition behind retaining border points is that where Na,x is the number of times attribute a had value
“internal” points do not affect the decision boundaries as x; Na,x,c is the number of times attribute a had value x
much as the border points, and thus can be removed with and the output class was c; and C is the number of output
relatively little effect on classification. Algorithms classes. Using this distance measure, two values are
which tend to retain border points include CNN, RNN, considered to be closer if they have more similar
IB2, and IB3. classifications, regardless of the order of the values.
On the other hand, some algorithms instead seek to In order to handle heterogeneous applications—those
remove border points. Wilson’s rule and the “all kNN” with both numeric and nominal attributes—we use the
rule are examples of this. They remove points that are heterogeneous distance function HVDM (Wilson &
noisy or do not agree with their neighbors. This removes Martinez, 1997), which is defined as:
close border points, leaving smoother decision
boundaries behind. This may help generalization in
some cases, but typically keeps most of the instances. m
Some algorithms retain center points instead of border HVDM(x, y) = ∑ da 2 (xa , ya ) (3)
points. For example, the Typical Instance Based a=1
Learning algorithm attempts to retain center points and
can achieve very dramatic reduction (ideally one
instance per class) when conditions are right. However, where the function da(x,y) is the distance for attribute a
when decision boundaries are complex, it may take and is defined as:
nearly as many instances to define the boundaries using
center points as it would using boundary points. Current
research is addressing this question. vdma (x, y), if a is nominal

da (x, y) =  x − y (4)
, if a is numeric
 4 σ a
3.4. DISTANCE FUNCTION

The nearest neighbor algorithm and its derivatives


where vdm a(x,y) is the function given in (2), and σ a is
usually use the Euclidean distance function, which is
defined as: the standard deviation of the values occurring for

4
attribute a in the instances in the training set T. This following basic rule to decide if it is safe to remove an
function handles unknown input values by assigning instance from the instance set S (where S = T originally):
them a large distance (1.0), and provides appropriate
normalization between numeric and nominal attributes, Remove P if at least as many of its associates
as well as between numeric attributes of different scales. in S would be classified correctly without it.

To see if an instance can be removed using this rule,


4. NEW INSTANCE SET REDUCTION each associate is checked to see what effect the removal
ALGORITHMS of P would have on it.
Removing P causes each associate P.Ai to use its k+1st
This section presents a collection of new heuristics or nearest neighbor (P.Ai.Nk+1) in place of P. If P has the
rules used to decide which instances to keep and which same class as P.Ai, and P.Ai.Nk+1 has a different class
instances to remove from a training set. In order to
than P.Ai, this weakens its classification, and could
avoid repeating lengthy definitions, some notation is
introduced here: cause P.Ai to be misclassified by its neighbors. On the
other hand, if P is a different class than P.A i and the
A training set T consists of n instances (or prototypes) P.Ai.N k+1 is the same class as P.Ai, the removal of
P1..n. Each instance P has k nearest neighbors P.N1..k
P could cause a previously misclassified instance to be
(ordered from nearest to furthest), where k is a small odd classified correctly.
integer such as 1, 3 or 5. P also has a nearest enemy,
P.E, which is the nearest instance with a different output In essence, this rule tests to see if removing P would
class. Those instances that have P as one of their k degrade leave-one-out cross-validation generalization
nearest neighbors are called associates of P, and are accuracy, which is an estimate of the true generalization
notated as P.A1..a (sorted from nearest to furthest) where ability of the resulting classifier. An instance is removed
a is the number of associates that P has. when it results in the same level of generalization with
lower storage requirements.
Given the issues in Section 3 to consider, our research is
directed towards finding instance reduction techniques The algorithm for RT1 proceeds as shown in Figure 1.
that provide noise tolerance, high generalization This algorithm begins by building a list of nearest
accuracy, insensitivity to the order of presentation of neighbors for each instance, as well as a list of
instances, and significant storage reduction, which in associates. Then each instance in S is removed if its
turn improves generalization speed. Sections 4.1-4.3 removal does not hurt the classification of the instances
present three rules, called RT1-RT3, respectively, that remaining in S.
seek to meet these goals.
This algorithm removes noisy instances, because a noisy
instance P usually has associates that are mostly of a
4.1. REDUCTION TECHNIQUE 1 (RT1) different class, and such associates will be at least as
likely to be classified correctly without P. RT1 also
The first reduction technique we present, RT1, uses the removes instances in the center of clusters, because

1 RT1(Training set T): Instance set S.


2 Let S = T.
3 For each instance P in S:
4 Find P.N1..k+1, the k+1 nearest neighbors of P in S.
5 Add P to each of its neighbors’ lists of associates.
6 For each instance P in S:
7 Let with = # of associates of P classified correctly with P as a neighbor.
8 Let without = # of associates of P classified correctly without P.
9 If (without - with) ≥ 0
10 Remove P from S.
11 Remove P from its associates’ lists of nearest neighbors, and find
12 the next nearest neighbor for each of these associates.
13 Remove P from its neighbors’ lists of associates.
14 Endif
15 Return S.

Figure 1: Pseudo-code for RT1.

5
associates there are not near their enemies, and thus 4.3. REDUCTION TECHNIQUE 3 (RT3)
continue to be classified correctly without P.
Near the border, the removal of some instances RT2 sorts S in an attempt to remove center points before
sometimes causes others to be classified incorrectly border points. One problem with this method is that
because the majority of their neighbors can become noisy instances are also “border” points, and cause the
enemies. Thus this algorithm tends to keep non-noisy order of removal to be drastically changed. One noisy
border points. At the limit, there is typically a collection point in the center of a cluster causes many points in that
of border instances such that the majority of the k nearest cluster to be considered border points, and some of these
neighbors of each of these instances is the correct class. can remain in S even after the noisy point is removed.
Two passes through S can remove the dangling center
points, but unfortunately, by that time some border
4.2. REDUCTION TECHNIQUE 2 (RT2) points may have already been removed that should have
been kept.
There is a potential problem that can arise in RT1 with
regards to noisy instances. A noisy instance will RT3 therefore uses a noise-filtering pass before sorting
typically have associates of a different class, and will the instances in S. This is done using a rule similar to
thus be contained to a somewhat small portion of the Wilson’s Rule (Wilson, 1972): Any instance
input space. However, if its associates are removed by misclassified by its k nearest neighbors is removed. This
the above criteria, the noisy instance may cover more removes noisy instances, as well as close border points,
and more of the input space. Eventually it is hoped that which can in turn smooth the decision boundary slightly.
the noisy instance itself will be removed. However, if This helps to avoid “overfitting” the data, i.e., using a
many of its neighbors are removed first, its associates decision surface that goes beyond modeling the
may eventually include instances of the same class from underlying function and starts to model the data
the other side of the original decision boundary, and it is sampling distribution as well.
possible that removing the noisy instance at that point After removing noisy instances from S in this manner,
could cause some of its distant associates to be classified the instances are sorted by distance to their nearest
incorrectly. enemy remaining in S, and thus points far from the real
RT2 solves this problem by considering the effect of the decision boundary are removed first. This allows points
removal of an instance on all the instances in the original internal to clusters to be removed early in the process,
training set T instead of considering only those instances even if there were noisy points nearby.
remaining in S. In other words, an instance P is removed
from S only if at least as many of its associates
(including those that may have already been pruned) are 5. EXPERIMENTAL RESULTS
classified correctly without it.
The reduction algorithms RT1, RT2 and RT3 were
Using this modification, each instance in the original implemented using k = 3, and using the HVDM distance
training set T continues to maintain a list of its k + 1 function described in Section 3.4. These algorithms
nearest neighbors in S, even after it is removed from S. were tested on 29 data sets from the University of
This in turn means that instances in S have associates California, Irvine Machine Learning Database
that are both in and out of S, while instances that have Repository (Merz & Murphy, 1996) and compared to a
been removed from S have no associates (because they k-nearest neighbor classifier that was identical to RT1
are no longer a neighbor of any instance). This except that it does not remove any instances from the
modification makes use of additional information that is instance set (i.e., S=T).
available for estimating generalization accuracy, and
also avoids some problems that can occur with RT1 such Each test consisted of ten trials, each using one of ten
as removing entire clusters. This change is made by partitions of the data randomly selected from the data
removing line 13 from the pseudo-code for RT1 so that sets, i.e., 10-fold cross-validation. For each trial, 90% of
pruned instances will still be associates of their nearest the training instances were used for T, the subset S was
neighbors in S. determined using each reduction technique (except for
the kNN algorithm, which keeps all the instances), and
RT2 also changes the order of removal of instances. It the remaining 10% of the instances were classified using
initially sorts the instances in S by the distance to their only the instances remaining in S . The average
nearest enemy. Instances are then checked for removal generalization accuracy over the ten trials for each test is
beginning at the instance furthest from its nearest enemy. given in Table 1. The average percentage of instances
This tends to remove instances furthest from the decision retained in S is shown in the table as well.
boundary first, which in turn increases the chance of
retaining border points.

6
Table 1: Generalization accuracy and storage requirements of kNN, RT1, RT2, and RT3.

Database kNN (size) RT1 (size) RT2 (size) RT3 (size) H-IB3 (size)
Anneal 93.11 100 87.85 9.11 95.36 11.42 93.49 8.63 94.98 7.81
Australian 84.78 100 82.61 7.67 84.64 15.41 84.35 5.93 85.99 6.48
Breast Cancer WI 96.28 100 94.00 2.56 96.14 5.79 96.14 3.58 95.71 2.56
Bridges 66.09 100 55.64 20.86 59.18 24.11 58.27 18.66 59.37 38.67
Crx 83.62 100 81.01 6.70 84.93 14.11 85.80 5.46 83.48 6.86
Echocardiogram 94.82 100 93.39 9.01 85.18 7.51 93.39 9.01 93.39 14.85
Flag 61.34 100 58.13 24.51 62.34 32.30 61.29 20.45 51.50 39.18
Glass 73.83 100 60.30 26.11 64.98 31.52 65.02 23.88 67.77 32.92
Heart 81.48 100 79.26 12.96 81.11 21.60 83.33 13.62 76.30 10.33
Heart.Cleveland 81.19 100 77.85 14.26 79.87 20.61 80.84 12.76 74.23 10.78
Heart.Hungarian 79.22 100 78.92 11.38 79.22 15.98 79.95 10.43 74.83 8.88
Heart.Long Beach VA 70.00 100 73.00 11.78 72.00 16.33 73.50 4.22 69.50 11.67
Heart.More 74.17 100 73.20 11.20 74.50 16.98 76.25 9.10 74.75 13.97
Heart.Swiss 92.69 100 91.15 2.08 93.46 2.89 93.46 1.81 84.62 4.79
Hepatitis 80.62 100 76.21 8.67 82.00 13.98 81.87 7.81 72.79 8.03
Horse-Colic 57.84 100 65.09 10.89 66.17 17.98 71.08 7.42 61.82 17.64
Image.Segmentation 93.10 100 84.76 10.21 92.38 13.76 92.62 10.98 90.24 14.79
Ionosphere 84.62 100 84.91 5.67 88.32 12.09 87.75 7.06 88.32 13.61
Iris 94.00 100 89.33 11.70 95.33 16.89 95.33 14.81 92.00 10.96
LED-Creator+17 67.10 100 68.80 17.50 70.50 26.23 70.40 12.66 52.60 42.42
LED-Creator 60.60 100 64.00 7.78 67.00 11.20 68.20 11.57 71.30 24.32
Liver.Bupa 65.57 100 59.66 27.28 65.80 37.55 60.84 24.99 55.64 15.59
Pima Indians Diabetes 73.56 100 70.96 20.12 73.31 28.11 75.01 16.90 67.83 13.22
Promoters 93.45 100 85.00 8.18 87.91 17.82 86.82 16.67 88.59 10.02
Sonar 87.55 100 69.81 24.04 81.86 30.82 78.00 26.87 71.67 15.22
Soybean-Large 88.59 100 79.81 24.03 84.99 27.94 85.62 25.73 90.23 21.90
Vehicle 71.76 100 66.21 24.00 68.91 31.60 65.85 23.00 66.45 29.17
Vowel 96.57 100 88.98 43.14 91.46 46.91 89.56 45.22 94.70 23.42
Wine 94.93 100 91.05 8.55 93.79 15.23 94.93 16.11 94.38 13.42
Average 80.78 100 76.93 14.55 80.09 20.16 80.31 14.32 77.41 16.67

Results were also obtained for several of the methods RT2 had better generalization accuracy than RT1
discussed in Section 2, including CNN (Hart, 1968), because of its use of the additional information provided
SNN (Ritter et al., 1975), Wilson’s Rule (Wilson, 1972), by pruned instances in determining whether to remove
the “All k-NN” method (Tomek, 1976), IB2, IB3 (Aha, others. However, it also has higher storage
Kibler & Albert, 1991), and the Explore method requirements, due at least in part to the fact that noisy
(Cameron-Jones, 1995). Space does not permit the instances sometimes cause nearby instances to be
inclusion of all of the results, but the results for IB3 are retained even after the noisy instance is removed.
included in Table 1 as a benchmark for comparison, RT3 had a higher average generalization accuracy than
since it is one of the most popular algorithms. This RT1, RT2, and H-IB3, and also had the lowest storage
version of IB3 was modified to use the heterogeneous requirements of the four. Its generalization accuracy
distance function HVDM and is thus labeled H-IB3. was within one-half of a percent of the kNN algorithm
From the results in this table, several observations can be that retained 100% of the instances, and yet it retained
made. RT1 had very good storage reduction on average, on average only 14.32% of the instances.
but also dropped in generalization accuracy by an Some datasets seem to be especially well suited for these
average of almost 4%. This is likely due to the fact that reduction techniques. For example, RT3 required less
this algorithm ignores those instances that have already than 2% storage for the Heart.Swiss dataset, yet it
been pruned from S when deciding whether to remove achieved even higher generalization accuracy than the
additional instances. It also prunes them in a random kNN algorithm. On the other hand, some datasets were
order, thus causing some border points to be removed
not so appropriate. On the Vowel dataset, for example,
too early.
RT3 required over 45% of the data, and dropped in

7
generalization accuracy by 7%, suggesting that RT3 is As can be seen from the figure, the average
inappropriate for this particular dataset. generalization accuracy drops off more quickly if
The Heterogeneous IB3 algorithm was also compared to instances are randomly removed than if they are
the original IB3 algorithm (i.e., with k = 1, and using removed using RT3, indicating that the order of removal
Euclidean distance on linear attributes and overlap is improved by RT3.
metric on nominal attributes), to make sure that the This can be useful if more storage is available than that
above comparison was fair. The original IB3 algorithm required by RT3. For example, if an instance set of 10
attained 75% accuracy with 18% storage on average on million instances is reduced to just 100, but there is
these 29 datasets, indicating that the heterogeneous sufficient storage and computational resources to handle
distance function was beneficial to IB3. 1000 instances, then by using the sorted list of pruned
As mentioned above, the results in Table 1 were also instances, the best 1000 instances can be used. The user
compared with those achieved by CNN, SNN, Wilson’s is thus allowed to manually trade off storage for
Rule, and the All k-NN method. All of these resulted in accuracy without the higher loss in accuracy that random
lower generalization accuracy and higher storage removal would cause.
requirements than RT3. IB2, CNN and SNN all had
about 76% accuracy and required about 25-33% storage.
Wilson’s Rule and the All k-NN method both had 80.1%
accuracy but retained over 75% of the instances. 80
The Explore method (Cameron-Jones, 1995) mentioned
in Section 2.2 was also tested, and though its average
75
accuracy was lower than RT3 (76.23%), it reduced
%Accuracy

storage more dramatically than the others, retaining only


2.23% of the instances on average. When modified with
the HVDM distance function, its storage improved 70
further (2.11%) and accuracy increased to 79.01%. This RT3
is still not as accurate as RT3, but the improved
reduction may be worth the trade-off in some cases. 65 Random

One factor that influences the amount of reduction


achieved by RT3 is the use of k = 3. This causes some
instances to be retained that could be removed with 60
k = 1. Initial experiments suggest that the smaller value 0 50 100
of k results in more dramatic storage reduction, but may %Storage
reduce generalization accuracy slightly in RT3. One
area of future research is in the use of a dynamic value of
k, where k starts out with a value of 3 or 5 and is Figure 2: Generalization Accuracy for RT3 vs.
reduced as pruning continues until it is eventually Randomly removing instances.
reduced to a value of 1.
We were interested to see how fast generalization 6. CONCLUSIONS AND FUTURE
accuracy drops as instances are removed from S, and to RESEARCH DIRECTIONS
see if it drops off more slowly when using an intelligent
reduction technique than it does when removing Nearest neighbor algorithms and their derivatives are
instances randomly. often appropriate and can provide high generalization
In order to test this, the experiments described above accuracy for real-world applications, but the storage and
were modified as follows. When instances were computational requirements can be restrictive when the
removed from S, the order of their removal was recorded size of the training set is large.
so that the training set T could be sorted by order of This paper introduced three new instance reduction
removal (with the instances in S at the beginning in techniques which are intuitive and provide good storage
random order, since they were not removed). Then the reduction. In experiments on 29 datasets, the third
instances in the test set were classified using only the technique, RT3, provided higher generalization accuracy
first 1% of the data, then 2%, and so on up to 100% of and lower storage requirements than the other two
the data. Figure 2 shows the average generalization methods, and its accuracy was within 0.5% of that of a
accuracy (over all 10 trials on all 29 datasets) as a nearest neighbor classifier that retained all of the
function of the percentage of the training set that was instances. On average it retained under 15% of the
used for generalization. original training set.

8
RT3 (and to a lesser extent the other algorithms) is Hart, P. E., (1968). “The Condensed Nearest Neighbor
designed to be robust in the presence of noise and use an Rule,” Institute of Electrical and Electronics Engineers
estimate of generalization accuracy in making decisions, Transactions on Information Theory, vol. 14, pp. 515-
which helps it avoid removing instances that would be 516.
helpful in generalization. Since RT3 makes use of all
Merz, C. J., and P. M. Murphy, (1996). UCI Repository
the instances in the training set in making its decision, it
of Machine Learning Databases. Irvine, CA: University
is not sensitive to the order of presentation of the
of California Irvine, Department of Information and
instances (as are incremental approaches), and is able to
Computer Science. Internet: http://www.ics.uci.edu/
choose a good order of removal regardless of how the
~mlearn/ MLRepository.html.
instances were ordered in the original training set.
Papadimitriou, Christos H., and Jon Louis Bentley,
These reduction algorithms were also among the first to
(1980). A Worst-Case Analysis of Nearest Neighbor
use heterogeneous distance functions that are appropriate
Searching by Projection. Lecture Notes in Computer
for applications with both nominal and continuous
Science, Vol. 85, Automata Languages and
attributes (Wilson & Martinez, 1997).
Programming, pp. 470-482.
Future research will focus on determining the conditions
Ritter, G. L., H. B. Woodruff, S. R. Lowry, and T. L.
under which these algorithms are not appropriate, and
Isenhour, (1975). “An Algorithm for a Selective Nearest
will seek to overcome weaknesses in such areas. These
Neighbor Decision Rule,” IEEE Transactions on
reduction algorithms will also be integrated with feature
Information Theory, vol. 21, no. 6, November 1975, pp.
selection algorithms and weighting techniques in order
665-669.
to produce comprehensive instance-based learning
systems that are robust in the presence of irrelevant Salzberg, Steven, (1991). “A Nearest Hyperrectangle
attributes and other difficult circumstances, thus Learning Method,” Machine Learning, vol. 6, pp. 277-
providing more accurate learning algorithms for a wide 309.
variety of problems. Sproull, Robert F., (1991). Refinements to Nearest-
Neighbor Searching in k -Dimensional Trees.
References Algorithmica, Vol. 6, pp. 579-589.
Aha, David W., Dennis Kibler, Marc K. Albert, (1991). Stanfill, C., and D. Waltz, (1986). “Toward memory-
“Instance-Based Learning Algorithms,” Machine based reasoning,” Communications of the ACM, vol. 29,
Learning, vol. 6, pp. 37-66. December 1986, pp. 1213-1228.
Cameron-Jones, R. M., (1995). Instance Selection by Tomek, Ivan, (1976). “An Experiment with the Edited
Encoding Length Heuristic with Random Mutation Hill Nearest-Neighbor Rule,” IEEE Transactions on Systems,
Climbing. In Proceedings of the Eighth Australian Joint Man, and Cybernetics, vol. 6, no. 6, June 1976, pp. 448-
Conference on Artificial Intelligence, pp. 99-106. 452.
Chang, Chin-Liang, (1974). “Finding Prototypes for Wettschereck, Dietrich, (1994). “A Hybrid Nearest-
Nearest Neighbor Classifiers,” IEEE Transactions on Neighbor and Nearest-Hyperrectangle Algorithm”, To
Computers, vol. 23, no. 11, November 1974, pp. 1179- appear in the Proceedings of the 7th European
1184. Conference on Machine Learning.
Cover, T. M., and P. E. Hart, (1967). “Nearest Neighbor Wettschereck, Dietrich, and Thomas G. Dietterich,
Pattern Classification,” Institute of Electrical and (1995). “An Experimental Comparison of Nearest-
Electronics Engineers Transactions on Information Neighbor and Nearest-Hyperrectangle Algorithms,”
Theory, vol. 13, no. 1, January 1967, pp. 21-27. Machine Learning, vol. 19, no. 1, pp. 5-28.
Dasarathy, Belur V., (1991). Nearest Neighbor (NN) Wilson, D. Randall, and Tony R. Martinez, (1997).
Norms: NN Pattern Classification Techniques. Los “Improved Heterogeneous Distance Functions,” Journal
Alamitos, CA: IEEE Computer Society Press. of Artificial Intelligence Research (JAIR), vol. 6, no. 1,
pp. 1-34.
Domingos, Pedro, (1995). “Rule Induction and Instance-
Based Learning: A Unified Approach,” to appear in The Wilson, Dennis L., (1972). “Asymptotic Properties of
1995 International Joint Conference on Artificial Nearest Neighbor Rules Using Edited Data,” IEEE
Intelligence (IJCAI-95). Transactions on Systems, Man, and Cybernetics, vol. 2,
no. 3, pp. 408-421.
Gates, G. W. (1972). “The Reduced Nearest Neighbor
Rule,” IEEE Transactions on Information Theory, vol. Zhang, Jianping, (1992). “Selecting Typical Instances in
IT-18, no. 3, pp. 431-433. Instance-Based Learning,” Proceedings of the Ninth
International Conference on Machine Learning.

You might also like