Novelty Detection A Review
Novelty Detection A Review
Abstract
Novelty detection is the identi-cation of new or unknown data or signal that a machine learning system is not aware of
during training. Novelty detection is one of the fundamental requirements of a good classi-cation or identi-cation system
since sometimes the test data contains information about objects that were not known at the time of training the model. In
this paper we provide state-of-the-art review in the area of novelty detection based on statistical approaches. The second
part paper details novelty detection using neural networks. As discussed, there are a multitude of applications where novelty
detection is extremely important including signal processing, computer vision, pattern recognition, data mining, and robotics.
? 2003 Elsevier B.V. All rights reserved.
Keywords: Novelty detection review; Statistical approaches; Gaussian mixture models; Hidden Markov models; KNN; Parzen density
estimation; String matching; Clustering
0165-1684/$ - see front matter ? 2003 Elsevier B.V. All rights reserved.
doi:10.1016/j.sigpro.2003.07.018
2482 M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497
data and using a distance measure and a threshold studies address the above principles. Each approach
for determining abnormality. In recent years novelty has a number of di?erent methods and we detail of the
detection has been used in a number of other important studies in these areas.
applications especially signal processing and image
analysis (e.g. biometrics). In these applications the
problem becomes more complicated with multiple 2. Statistical approaches
classes, high dimensionality, noisy features and quite
often not enough samples. As such, novelty detection Statistical approaches are mostly based on mod-
methods have tried to keep up with these problems to elling data based on its statistical properties and using
o?er solutions that can be used in the real world. In this information to estimate whether a test samples
this paper we review some of the currently used meth- comes from the same distribution or not. The tech-
ods on novelty detection using statistical approaches. niques used vary in terms of their complexity [40].
There are several important issues related to novelty The simplest approach can be based on constructing a
detection. We can summarise them in terms of the density function for data of a known class, and then as-
following principles. suming that data is normal computing the probability
of a test sample of belonging to that class. The prob-
(a) Principle of robustness and trade-o5: a nov-
ability estimate can be thresholded to signal novelty.
elty detection method must be capable of robust
Another simple model can simply -nd the distance of
performance on test data that maximises the
the sample from a class mean and threshold on the
exclusion of novel samples while minimising
basis of how many standard deviations away the sam-
the exclusion of known samples. This trade-o?
ple is [36,37]. The distance measure itself can be Ma-
should be, to a limited extent, predictable and
halanobis or some other probabilistic distance [57].
under experimental control.
Manson et al. [35] also discuss the choice of features
(b) Principle of uniform data scaling: in order to as-
based on their ability for novelty detection. In such a
sist novelty detection, it should be possible that
scheme a novelty index is used to rank features on their
all test data and training data after normalisation
ability for detecting novelty. Another simple statisti-
lie within the same range [49].
cal scheme for outlier rejection is based on the use of
(c) Principle of parameter minimisation: a novelty
box-plots [33]. The box plot is a well-known display
detection method should aim to minimise the
of the -ve-number summary (lower extreme, lower
number of parameters that are user set.
quartile, median, upper quartile, upper extreme). Box
(d) Principle of generalisation: the system should be
plots are most suitable for exploring both symmetric
able to generalise without confusing generalised
and skewed quantitative univariate data, but they can
information as novel [55].
also identify infrequent values from categorical data.
(e) Principle of independence: the novelty detection
The data (univariate) is sorted in ascending order and
method should be independent of the number of
the outliers are ranked according to the frequencies
features, and classes available and it should show
of univariate outlier values. Samples with the highest
reasonable performance in the context of imbal-
frequencies are discarded. A predetermined percent-
anced data set, low number of samples, and noise.
age of the worst examples can be discarded. Knorr et
(f) Principle of adaptability: a system that recog-
al. [31] suggest that an object O in a data set T is a
nises novel samples during test should be able to
distance-based (DB) outlier if at least fraction p of
use this information for retraining [47].
the objects in T lie at a distance greater than D from
(g) Principle of computational complexity: a num-
O. More speci-cally, based on a standard multidimen-
ber of novelty detection applications are online
sional indexing structure, they execute a range search
and therefore the computational complexity of a
with radius D for each object O. Once there are at
novelty detection mechanism should be as less as
least M neighbours in the D-neighbourhood they stop
possible.
the search and declare O a non-outlier otherwise O is
In this survey, we study a number of approaches rejected as an outlier. The technique is based in other
to novelty detection and remark on how well these words on a nearest neighbour scheme such as [26]
M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497 2483
and [21]. The main contribution of this study is that assumption on the form of data distribution and are
the authors present a solution for fast indexing and therefore more Iexible (though more computationally
search in large multidimensional databases. Several expensive). Density estimation in these cases can be
advanced statistical modelling techniques also exist performed using either nearest neighbour methods or
for novelty detection. For example, one can use mix- Parzen window method. Once more, a probability es-
ture models for modelling complex data distributions timate of the test sample belonging to the distribution
or using a hidden markov model for novelty detection can be obtained which can be thresholded. These ap-
as discussed later. proaches have limitations with regards to the choice of
Two main approaches exist to the estimation parameters (neighbours used, smoothing parameters),
of the probability density function, parametric and and diJculty in tackling noise in data. We discuss the
non-parametric methods [15]. The parametric ap- parametric and non-parametric approaches in detail in
proach assumes that the data comes from a family Sections 2.1 and 2.2, respectively.
of known distributions, such as the normal distri-
bution and certain parameters are calculated to -t 2.1. Parametric approaches
this distribution. However, in most real world situ-
ations the underlying distribution of the data is not Parametric approaches make an assumption that
known therefore such techniques have little practical data distributions are Gaussian in nature and they
importance. In non-parametric methods the overall can be modelled statistically based on data means
form of the density function is derived from the data and covariance. A number of studies have theoret-
as well as the parameters of the model. As a result ically investigated novelty detection for such data.
non-parametric methods give greater Iexibility in Of particular importance is the trade-o? between the
general systems. One of the most intuitive and widely recognition rate and the proportion of data rejected
used non-parametric technique is histogram analysis. [24]. The error rate and the reject rate are commonly
However, the shape of the density estimate can vary used to describe the performance level of a pattern
signi-cantly, depending on the choice of the origin, recognition system. Because of uncertainty and noise
which is often purely arbitrary. Things get even worse inherent in any pattern recognition task, errors are
when the density estimation of multivariate data is generally unavoidable. The option to reject is intro-
required. A better way of estimating density functions duced to safeguard against excessive misclassi-cation.
is kernel methods. They are similar to histograms However, the trade-o? between the errors and rejects
in that they are built up of a number of individual is seldom one to one. Whenever the reject option is
kernels centred on the sampled data points. A param- exercised, some outliers of known classes are also re-
eter h determines the width of the kernel and how jected. Chow [8] studied the trade-o? in detail to -nd
smooth the estimation becomes. The kernel function an optimal threshold for rejection. It is obvious that a
is usually a symmetric probability density function, recognition rule is optimum if for a given error rate
being non-negative over its domain and should inte- (error probability) it minimizes the reject rate (reject
grate to unity over the de-ned range. The value of probability). The author showed that the optimum
the density at some arbitrary value x is dependent on rule is to reject a pattern if the maximum of the a pos-
the distance between the observed data and the shape teriori probabilities is less than some threshold. The
of the kernel. Clearly, the smoothing parameter h error and reject trade-o? were derived for the Bayes
plays a central role on the estimation of the density. optimum recognition system with an option to reject.
Choosing a global value h might result in a density Hansen et al. [24] extended the work of Chow by
function that does not adequately describe the data, introducing the role of classi-er con-dence in its de-
particularly in regions of low data density (various cisions. Intuitively a rejection rule is based on the
techniques may be employed to locally adjust h). amount of con-dence a classi-er has on a given clas-
Parametric methods for estimating the probability si-cation. The novelty rejection scheme used in this
density function have sometimes limited use since study was based on an ensemble of neural network
they require extensive a priori knowledge of the prob- classi-ers employing a consensus scheme. The pattern
lem. Non-parametric statistical approaches make no is classi-ed in the category with the majority of votes.
2484 M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497
If the number of votes the winning class receives is , novelty detection performance of a system. Wei et al.
then a threshold is placed on =N where N is the num- [58] suggest how arti-cial anomalies (novelties) can
ber of classi-ers. If =N is below the threshold then be injected into the training data to help the learner
the pattern is rejected. discover a boundary around the original data. To gen-
In most recognition tasks, the underlying proba- erate arti-cial anomalies close to the known data, a
bility distributions of the patterns are not completely useful heuristic is to randomly change the value of one
known. Hence, the optimal threshold as given by feature of an existing example while leaving the other
Chow is no longer very useful. Fumera et al. [21] features unaltered. Sparse regions are characterized by
tackle this problem. Their method provides an al- infrequent values of individual features. To amplify
ternative, which works well even if the a posteriori sparse regions, they proportionally create more arti--
probabilities are a?ected by errors. The authors sug- cial anomalies around them. The technique only works
gest using multiple thresholds, one for each class. if the novel class (anomalies) do not overlap with the
The threshold is placed on the maximum a posteri- known classes. After the random anomalies are gener-
ori probability just as in Chow [8] but in this case ated, a classi-cation takes place and random anomalies
each class has a di?erent threshold. Their results us- that are misclassi-ed as known are removed from the
ing nearest neighbour and neural network classi-ers data. The process is repeated until a satisfactory size
show that the scheme outperforms the one based on of stable random anomalies data is created. This sys-
parametric assumption. tem was tested on a network intrusion detection task.
Another improvement to Chow’s original study was It was found that better performance is obtained if the
made by Foggia et al. [19]. In pattern recognition prob- training data is augmented with data of the anomalous
lems the idea of combining various experts with the classes. It was also found that the number of random
aim of compensating the weakness of each single ex- anomalies generated is not critical for the system per-
pert, has been widely investigated. The main problem formance.
is that the combination rule should be able to solve the We now discuss some of the advanced methods
conIicts i.e. when the experts disagree. In these cases of parametric statistical novelty detection including
the experts’ -nal decision may be unreliable and it probabilistic/GMM approaches (Section 2.1.1), Hid-
might be desirable to reject such patterns for more so- den Markov models (Section 2.1.2) and hypothesis
phisticated processing. Foggia et al. extend the work of testing (Section 2.1.3). These approaches are more
Cordella et al. [9] and De Stefano et al. [51], applying complex than the use of linear schemes of novelty de-
their technique to multi expert systems. The technique tection [17] and they are primarily aimed at data den-
is also similar to that of Hansen [24]. In this paper, a re- sity estimation using robust statistics.
ject option for the MES architecture is de-ned, which
drops the assumption of knowing the exact values of
a posteriori probabilities. The reject option is de-ned 2.1.1. Probabilistic/GMM approaches
for an MES architecture with the Bayesian combin- Gaussian mixture modelling (GMM) models gen-
ing (BC) rule. The BC rule estimates the a posteriori eral distributions estimating the density using fewer
probability that the input sample belongs to each class kernels than the number of patterns in the train-
and selects the class with the highest probability. A ing set. The parameters of the model are chosen by
value between 0 and 1 can be calculated for these two maximising the log likelihood of the training data
reasons with values near 1 characterizing reliable clas- with respect to the model. This is done using opti-
si-cation and values near 0 indicating unreliable clas- misation algorithms such as conjugate gradients or
si-cations. These two parameters can be combined in reestimation techniques such as the EM algorithm.
several ways and then thresholded for rejection. The However, GMM su?er from the curse of dimen-
threshold is computed by maximizing a function that sionality in the sense that if the dimensionality of
measures the MES classi-cation e?ectiveness in the the data is high, a very large number of samples
considered application domain. are needed to train the model. A number of studies
Some approaches have suggested the use of ar- have used GMM for novelty detection as described
ti-cially generated anomalies that can improve the below.
M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497 2485
Roberts and Tarassenko [45] developed a robust all novel cases but will reject a lot of normal cases.
method for novelty detection, which aims to mini- The solution presented here is the implementation of
mize the number of heuristically chosen thresholds in a local novelty threshold that depends on the density
the novelty decision process. The novelty detection of the data in that region of input space. The space
method used is similar to those of Barnett and Lewis is partitioned using the k-means algorithm to several
[2], Bishop [4], Tarassenko [53], Parra et al. [41], Tax parts according to their input space and the density
and Duin [55], Desforges et al. [15], Brotherton et function is calculated independently within each par-
al. [5], Tarassenko et al. [54], Yeung and Chow [63], tition. The precise number of partitions is not critical.
and others, based on the density function of the train- The authors considered two ways of density estima-
ing data estimated with a GMM. The parameters of tion, Gaussian mixture models and Parzen windows
the GMM are estimated with the EM algorithm. The and found that Parzen windows work much better due
major contribution of this paper is that the algorithm to the unavailability of a large number of training sam-
decides to add a Gaussian unit based on some auto- ples. The technique was tested on 120 images from the
matic criterion that determines the number of Gaus- MIAS database. In all of the 40 cases the mass-like
sians. The growth decision is based on monitoring the structures were correctly identi-ed as novel except in
smallest Mahalanobis distance between a training vec- two cases. Unfortunately a large number of false pos-
tor and each Gaussian within the network. The growth itives were also discovered.
threshold is initially set to 0 and it is progressively A similar approach is taken by Tarssenko et al. [54]
increased. The algorithm ensures that every Gaussian for jet engine fault detection. The input data is initially
has seen each sample in the training data at least once. pre-processed and a simpler model of the distribution
The maximum growth threshold found by the algo- of the input space is chosen in a transformed space.
rithm is also used as the novelty threshold. If the max- The transform is such that the Euclidean distance in
imum posterior probability of a test vector is below the transformed space is equal to the Mahalanobis dis-
this threshold then it is rejected as a novelty. The tech- tance in the original space. The transformation is done
nique was tested on a medical signal-processing task using two di?erent methods, component-wise normal-
to detect epileptic seizures. A total of 195 Gaussian isation and the whitening transform. The distribution
kernels were grown by the system exhibiting very high of normal vectors in the transformed space is approxi-
performance rates. mated by a small number of spherical clusters selected
Tarassenko [53] applies his novelty detection tech- using the k-means clustering algorithm (k ¡ 5). Each
nique to the detection of masses in mammograms. cluster radius is calculated as the average distance be-
Mammography is a well-suited problem for novelty tween the feature vectors belonging to the cluster and
detection as often the question is whether a mass ex- the cluster centre. The novelty test is based on thresh-
ists in the mammogram or not, and examples of ab- olding the shortest normalised distance of the test
normal tissue are often scarce compared to examples vector to a cluster centre. The distance to the nearest
of normal tissue. The idea is to build a model of nor- cluster mean is normalised by the radius of the clus-
mality of the training data using only normal examples ter in order to account for varying data densities in
and then compare the test patterns against this model. di?erent regions of the input space. If the test vector
An assumption is made that the abnormalities are uni- is suJciently far from all cluster centres then it is in
formly distributed outside the boundaries of normality. a region of space with very few known training sam-
The description of normality is made using the uncon- ples and hence it is deemed to be novel. The novelty
ditional probability density estimation of the training threshold is set so as to accept all training samples.
data. If a test vector falls in a region of input space This method was tested on jet engine fault detection
with a density under a pre-determined threshold then and compared with Parzen windows method in previ-
the test vector is considered to be novel. This tech- ous study [53]. The method described slightly outper-
nique is very similar to Bishop [4], except that it tack- forms Parzen windows for the component-wise nor-
les regions in the training space with low density but malisation transformation.
with legitimate training objects. If such regions exist, Parra et al. [41] present a new scheme of den-
setting a global novelty threshold will fail to reject sity estimation. Novelty detection is performed by
2486 M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497
-nding the underlying density of the training data us- set as the most ‘novel’ point in the training set. The
ing this novel technique. A hyper-sphere is drawn to method was tested on a fault detection task in a jet en-
separate known regions from unknown regions. Novel gine. The component-wise normalisation gives much
objects should ideally fall outside this hyper-sphere. better results in recognising previously unseen nor-
An appropriate threshold separates known from new mal engine data although the whitening transform is
test objects. The strength of this study is its density slightly better in recognising abnormal data. The au-
estimation technique that can be used with non-linear thors also suggest using a Monte Carlo simulation to
distributions or distributions for which no a priori in- generate synthetic data to facilitate estimating the data
formation is available. The factorization of a joint density of the training data when not enough samples
probability distribution was formulated as a minimal are present.
mutual information criterion under the constraint of Tax and Duin [55] describe three methods of re-
volume conservation. Volume conservation is imple- jecting outliers based on the data density distribution.
mented by symplectic maps. A Gaussian upper bound Data objects in low probability areas are rejected us-
leads to a computationally eJcient optimisation tech- ing these methods [2,4,53]. The techniques are based
nique, which in turn facilitates density estimation. The on di?erent methods of computing data density: mix-
method was tested for motor fault detection. The data tures of Gaussians, Parzen windows and a nearest
was very high dimensional and a combination of linear neighbour-based estimator. These methods for nov-
PCA and symplectic maps was used to reduce the di- elty detection are then compared with a method based
mensionality. The novelty detection method then gave on classi-er instability introduced in this study. Using
as good or slightly better results when compared to the mixtures of Gaussians method and Parzen win-
various other methods including MLP, RBF, nearest dows, the density distribution of the training data is
neighbour and others. found. When the di?erence between the new object
Nairac et al. [38,39] present a method for novelty and the mean of the training objects is larger than
detection based on a probabilistic framework applied three standard deviations in the training distribution,
to those tasks when only a limited amount of train- the new object is rejected. With the nearest neighbour
ing data is available. It is important that the dimen- method, a slightly di?erent approach is followed to
sionality of the data is kept relatively small because -nd the probability density. The distance of the new
in order for a good probability model to be estimated, object and its nearest neighbour in the training set is
the number of inputs must be signi-cantly larger than found and the distance of this nearest neighbour and
the number of dimensions. Because of the small num- its nearest neighbour in the training set is also found.
ber of samples in the training set, a GMM cannot be The quotient between the -rst and the second distance
used. Instead, a normalising transformation is applied is taken as indication of the novelty of the object. The
to the whole of the training data and a small num- new method of novelty detection introduced here is
ber of spherical basis functions are then -tted to the based on classi-er instability. They use a linear clas-
transformed data. The authors have used two normal- si-er based on maximizing Fisher’s criterion and ex-
ising methods; component-wise normalisation and a tending the two-class classi-er for multiple classes.
whitening transformation. The placement of the basis They take bootstrap samples of the same size as the
functions is done by a k-means algorithm with Eu- original training set and train several classi-ers. The
clidean distance in the transformed space. For nov- outputs of these classi-ers di?er. The variation in the
elty detection, the average distance between the data outputs indicates the diversity between di?erent train-
vectors and their corresponding basis function in the ing sets. A large variation indicates that the object is
training data is computed. The novelty of a test vec- hard to classify. This variation is then thresholded.
tor is assessed by computing the shortest normalised The results show that this new method only outper-
distance to a kernel centre. This is a locally de-ned forms the distribution probability methods when only
measure of novelty, since it is based on the distance a small training set is available. In all other cases,
of the nearest kernel and it is normalised by the aver- and especially when there is an abundance of training
age distance of the data within that kernel. A valida- data, Parzen windows method is the best. However,
tion set is used to set the novelty threshold which is the drawback of using Parzen windows is that a large
M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497 2487
number of samples are required for a proper estima- on. However, only the component closest to the data
tion and the width parameter needs to be user-set. point concerned (in the Mahalanobis sense) is used
For the Gaussian case, the problem is selecting the to calculate the EVT probability as this dominates
correct number of kernels required which is often in- the EVT statistics. The EM algorithm is used to esti-
tuitive. The system was tested on hand written digits mate the parameters of the GMM using the minimum
with good success. length coding to penalize models with higher com-
Roberts [43,44] proposes extreme value theory plexity. The technique was tested on a tremor and an
(EVT) for novelty detection that concerns abnormally epilepsy data set with the objective of discriminating
low or high values in the tails of data distributions. between normal and abnormal behaviour as well as
As more data is observed, the value of this extreme on a noise removal task in an image (salt and pepper
data changes. Knowledge of such statistics is useful noise). The technique exhibited very good perfor-
in such tasks as novelty detection, outlier removal or mance on all three tasks without the need of setting
for rejecting classi-cation or regression of patterns novelty thresholds that plague other techniques based
that lie away from the expected statistics of some on statistical novelty detection.
training data set. A lot of work has been done previ- Yamanishi et al. [59] present SmartSifter (SS), an
ously in this area for data sets that are known to be outlier detection system from the viewpoint of statis-
pure i.e. data sets that contain patterns from only one tical learning theory, which works in data mining to
class, e.g. [4,5,41,53]. These methods can be used in detect fraud, network intrusion, network monitoring,
cases where data for the normal case is abundant and etc. The work is focused on outlier detection based on
easily obtainable whereas examples for the abnormal unsupervised learning of the information source. Ev-
case are rare and very expensive to obtain such as ery time a datum is input, it is required to evaluate how
in medical and fault detection domains. However, large the datum has deviated compared to a normal
these methods become very sensitive in those cases pattern. SS uses a probabilistic model as representa-
where the normal case contains very few examples of tion of an underlying mechanism for data generation.
the abnormal class. These examples are not enough The probability density over the domain of categori-
for classi-cation but enough to inIuence the density cal variables is found using a histogram and a -nite
distribution estimation of the normal class by -tting mixture model is employed for each histogram cell to
abnormal data thus making the detection of abnormal represent the probability density over the domain of
cases very problematic. EVT forms representations continues variables. Every time a datum is input, an
for the tails of distributions. Fisher and Tippett [18] on-line learning algorithm is employed to update the
showed that if the distribution function over a data model. The authors have developed the sequentially
point is to be stable as the number of samples tends to discounting Laplace estimation for learning the his-
in-nity then it must weakly converge under a positive togram density over the categorical domain and the
aJne transform. The second key theorem of Fisher sequentially discounting expectation and maximizing
and Tippett states that if the distribution is a non- for learning the -nite mixture for the continues do-
degenerate limit distribution for normalised maxima main. SS gives a score to each datum on the basis
then it can take only one of three forms. The -rst of of the learned model indicating how much the model
these forms is referred to as Gumbel distribution, and has changed after learning. A high score means that
used by Roberts [43,44]. His research is ultimately the datum is an outlier. According to the authors, the
concerned with samples drawn from a distribution novel features of SS include the fact that SS is adap-
whose maxima distribution converges to the Gumbel tive to non-stationary data. This is useful in the cases
form. This distribution gives a probability of observ- when drifting sources of time-series data are tackled.
ing some extreme value and the parameters can be Added to that, a score calculated by SS has a clear
determined using a Monte-Carlo simulation. A Gaus- statistical and information theoretic meaning. While
sian mixture model (GMM) is used to estimate the in other works heuristics such as cost or distances
distribution of the data, hence the EVT probability such as Mahalanobis, etc. are used to describe outliers,
can be used directly as the EVT distribution is derived SS de-nes a score for a datum based on how much
from the same range of data that the GMM is -tted the model has shifted after learning it. Finally, SS is
2488 M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497
computationally inexpensive and it can deal with both contains a small number of outliers. They model the
categorical and continues variables. The system was pattern distribution as the composition of a big per-
successfully tested on the network intrusion database, centage of normal patterns and a small proportion of
KDD Cup, 1999. corrupted patterns. If is the proportion of anoma-
Spence et al. [50] independently developed a class lous patterns, PN is the distribution of normal patterns
of models for probability distributions of images called and PO is the distribution of anomalous patterns then
hierarchical image probability (HIP) models. HIP con- the distribution of the whole training set is given as:
structs a tree-structured graph of the dependencies be- P(x) = (1 − )PN (x) + PO (x). The value can be in-
tween hidden variables at di?erent scales, and uses terpreted as the prior probability for outlying patterns
mixtures of multivariate Gaussians to model the local and the modelling is called the “mixture alternative”.
distributions of vectors of features. The hidden vari- The learning task can now be decomposed into three
ables are similar to Markov random -elds (MRF). The steps: (a) estimate P(x) from the given training data;
method is applied to mammographic images to reject (b) decompose P(x) into PN (x) and PO (x), and (c)
those that contain cancerous regions given a set of decide whether x is an outlier given the probabilities
normal images as training data. Due to the tree struc- PN (x), PO (x) and the prior . The second step is very
ture, the belief network for the hidden variables is rel- diJcult since there is no knowledge of anomalous
atively straightforward to train with an EM algorithm. patterns or their number. In addition, the distribution
The expectation step can be performed directly. The PO (x) cannot be estimated reliably due to the small
expectation is weighted by the probability of a label number of outliers in the training data. Therefore, the
or a parent–child pair of labels given the image. Once second step cannot be performed directly, and instead,
the expectations are computed, the normal distribu- an assumption on the distribution PO (x) is made and
tion makes the M step tractable. Detecting novel ex- the proportion , and PN (x) is estimated on the basis
amples can be useful in a CAD system for generating of that. The estimation of PN (x) is equivalent to the
con-dence measures on the CAD output and identify- determination of the distribution’s parameters. The us-
ing data that could be useful for future training of the age of GMM is considered here and the EM algorithm
model. The HIP model’s generative structure enables is employed to estimate the parameters. The approach
novel examples to be identi-ed by thresholding the is run with di?erent values of . The parameter can
log-likelihood of the models. The system was tested be optimised to resample the expected proportion of
on mammography images but no extensive results are outliers. In this respect, is interpreted as a parameter
shown nor a comparison is made with competing sys- of the algorithm that controls the sensitivity against
tems. In addition, the authors do not explain how the anomalous patterns. The system was tested on arti-cial
novelty threshold is selected or how critical it is to the datasets of various complexities and a real database of
system performance. medical data with patients carrying rare diseases. The
Almost all statistical approaches dealing with nov- results were compared with the traditional approach to
elty detection are based on modelling the density of novelty detection using data densities but without pro-
the training data and rejecting test patterns that fall viding for outliers in the training set. This approach
in regions of low density. However, in order for such outperforms the traditional approach.
techniques to work the training data itself needs to be Hickinbotham and Austin [27] describe a novelty
either free of outliers or the outliers to be known. The detection method based on Gaussian mixture mod-
most common data descriptor is the Gaussian mix- els (GMM) for sensor fault detection. The structural
ture model (GMM). Lauer [32] detail a method that health of airframes is often monitored by analysing the
works when the training data is corrupted with an un- frequency of occurrence matrix (FOOM) produced af-
known number of outliers. This approach is more ro- ter each Iight. Unfortunately, these FOOMs also get
bust against outliers so it can tolerate a small number corrupted through time and corrupted FOOMs need
of them in the training data. To calibrate the algorithm, to be -ltered out. There are two classes of sensor
classi-ed validation patterns are needed or a rough es- fault. The -rst is a random addition of counts and
timate of the proportion of outliers in the training data. the second is a shift in the response of a sensor to
They start with the presumption that the training data the load it monitors. Noise distortion is independent
M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497 2489
of the nature of the Iight. They search for noise using that can be used to build a hierarchy and avoids many
the mean and variance of each cell in the FOOMs. By poor locally optimal solutions that can trap the EM
setting a threshold it is possible to count the number of algorithm. By constraining the entropy, they implic-
cells whose value was “unlikely” using the Gaussian itly smooth the surface on which EM is hill climbing,
measure of probability. A suitable threshold can be thus reducing the number of local optima in which EM
set experimentally. For the second problem it is nec- might get trapped. To build the hierarchy, the algo-
essary to reduce the dimensionality of the input data. rithm starts with in-nite temperature so that no mat-
The authors used the eigenface algorithm to achieve ter how many classes EM might assume, all parame-
this reduction. After the dimensionality of the train- ter estimates converge to the same value. Thus, there
ing data is reduced, the authors model the distribution is e?ectively only one class and it becomes the root
of FOOM data using a Gaussian basis function neu- of the hierarchy. As the temperature is lowered, the
ral network. The EM algorithm is used to estimate the system undergoes phase transitions at which the e?ec-
parameters of the GMM. The GMM provides a prob- tive number of clusters grows by splitting one node
ability density function which forms the basis of the in the hierarchy into two children. The approach is
novelty measure N . It is useful to scale N so that its extended to on-line novelty detection by assuming a
value lies between 0 and 1. A threshold is placed on hierarchy obtained either by existing data or a priori
N reject patterns with low probability belonging to knowledge and determining for each subsequent doc-
the mixture model. The experimental results showed ument whether it is the -rst to report on a given topic
good performance rejecting faulty FOOMs but a large or not. If the document is most likely in one of the
number of healthy FOOMs were also rejected. The new nodes then the document is labelled as new. Sim-
authors state that this is probably due to the high vari- ilarly, Hansen et al. [25] discuss the role of Gaussian
ability of strain events between Iights. It is therefore mixture modelling for textual data and novelty detec-
anticipated that the addition of more data will correct tion. The main purpose of this study is to demonstrate
this problem. how well known techniques for signal/data analysis
Novelty detection in textual data based applications can also be used for textual information.
has also generated considerable interested in the past
[1]. Topic detection and tracking (TDT), or novelty
detection, is a variant of the traditional classi-cation 2.1.2. Hidden Markov models (HMM)
problem to allow the classi-cation of new classes. HMMs are stochastic models for sequential data
In TDT new topics emerge often which means that [16]. A given HMM contains a -nite number of unob-
the classi-er has to expand its seto? classes continu- servable (hidden) states. State transitions are governed
ously. The authors cluster the unlabelled documents by a stochastic process to form Markov chains. At each
using a multinomial naTUve Bayes classi-er estimat- state, some state-dependent events can be observed.
ing the parameters using the EM algorithm. The esti- The emission probabilities of these observable events
mated probabilities are used to classify the unlabelled are determined by a probability distribution, one for
documents to one of the known classes or a novel each state. To estimate the parameters of an HMM for
class. In a novelty detection task, the parameter esti- modelling normal system behaviour, sequences of nor-
mation of newly identi-ed classes is unreliable due to mal events collected from normal system operation are
a complete lack of data. The authors propose an ag- used as training data. An expectation-maximization
glomeration of the classes into hierarchy, with each (EM) algorithm is used to estimate the parameters.
node formed recursively by pooling the data by its Once an HMM has been trained, when confronted with
children. Then they use shrinkage to improve these test data, probability measures can be thresholded for
estimates. Shrinkage linearly interpolates the param- novelty detection.
eter estimates of each leaf in the hierarchy with the Yeung et al. [64] describe the use of HMMs for
parameter estimates of the leaf’s ancestors. As an al- novelty detection for the application of intrusion
ternative to the EM algorithm, the authors use deter- detection based on pro-ling system call sequences
ministic annealing (DA) for clustering. DA is based and shell command sequences (computer security).
on probabilistic and information-theoretic principles They present two main approaches. The -rst one is
2490 M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497
rejected. Hellman showed that going from the single and the min–max parameters in the hyper-box meth-
NN to the new rule, the increase in reject rate is twice ods increase the method’s sensitivity to novelty. The
the decrease in probability of error. Thus, if errors are experimental results showed that the elliptical surface
at least twice as costly as rejects, the new rule always is the most suitable for novelty detection.
has lower total cost than the single NN rule. This rule A number of techniques model the density of the
can be extended to examine k NNs and classify only training data and use it for novelty detection, e.g.
when all the neighbours agree. This approach is very [2,4,5,15,41,53–55,64]. This requires a large number
conservative and hence the author proposed an alter- of samples to overcome the curse of dimensionality.
native: examine the k NNs and if all neighbours agree In fact most other novelty detection techniques re-
classify the test pattern otherwise reject it. It has to be quire a large amount of data to form the ‘normal’ class
said here that the author does not claim this to be a especially in high dimensions. The simplest model
novelty detection approach, but merely a method for when just a little amount of data is available is the uni-
rejecting patterns with high misclassi-cation risk. It modal normal distribution. Tax and Duin [56] suggest
can however work for novelty detection if the novel modelling the probability density of the data with a
classes induce high confusion in the classi-er; other- unimodal normal distribution and threshold the prob-
wise the novel classes will be erroneously classi-ed ability density accepting 95% of the data placing a
to one of the training classes. threshold on the Mahalanobis distance. Instead of
Guttormsson et al. [23] present a novelty detection modelling the complete probability density, an in-
method applied to rotor fault detection. The training dication can be obtained by comparing distances.
data is -rst subjected to simple outlier removal to en- This method is based on the local density of the test
sure pureness. Any pattern that lies more than three object and the nearest neighbour in the training set.
standard deviations from the average is removed from The distance between the test pattern and its nearest
the set. For novelty detection, a surface is imposed neighbour in the training data is found along with the
around the set of healthy signals. If a new signal falls distance of the neighbour and it’s own nearest neigh-
outside this surface it is deemed to be novel. Surfaces bour. The quotient between the two distances is an
that can be placed around the healthy points include indication of novelty. The Euclidian distance is used
a spherical boundary, an elliptical boundary, a rectan- for this purpose. This method is very useful for dis-
gular boundary formed by the extrema of the data, or tributions with relatively fast decaying probabilities.
min-max surface and nearest neighbour boundaries. In The techniques were tested on both real and arti-cial
the -rst two methods, a new pattern is compared to the data and found to be very useful when very little
centre of either the hyper-sphere using the Euclidean amount of training data exist (less than -ve samples
distance or the ellipse using the Mahalanobis distance. per feature).
For the min–max technique, the smallest possible box Jiang et al. [29] propose a two-phase clustering
containing all the healthy data is used. The dimensions algorithm for outlier detection based on a modi-ed
of the box are determined by the minima and max- k-means algorithm and a minimum spanning tree
ima of the signature signals. The nearest neighbour (MST). The k-means algorithm is modi-ed as to
novelty detector allows for more general data topol- calculate the minimum distance between any pair of
ogy. Here, minimum Euclidean distances are found cluster centres. If the distance of a pattern and its
between each point and its closest neighbour. The dis- closest cluster is larger than this distance, then the
tance proportional to the maximum of these distances pattern is assigned to a new cluster. In the extreme
is then used as a decision parameter. Every incoming case, each pattern will form its own cluster and there-
point is compared to every point in the healthy set. If fore so an upper limit kmax is de-ned. When kmax is
the new point is at a greater distance from each of the reached, two nearest clusters are merged. The cluster
healthy points than the decision parameter then a nov- centres de-ned with the modi-ed k-means algorithm
elty is declared. The radius of the hyper-sphere and the are regarded as nodes, which are used to form an
ellipse controls the trade-o? between novelty detec- MST based upon the distance between every two
tion and false alarms. Similarly the ball drawn around nodes. After the MST is constructed, the longest edge
each training point in the nearest neighbour method of a tree is removed from the forest and it is replaced
2492 M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497
with two newly obtained sub trees. The small clusters that document. The algorithm is very simple. When a
(the tree with less number of nodes) are selected and new document arrives, it is compared with all the doc-
regarded as outliers. Three di?erent datasets were uments available. If its nearest neighbour in its past
used to compare this technique with the traditional has a cosine similarity score below a threshold, then
k-means clustering algorithm. In all of their exper- the document is labelled as novel meaning that it is the
iments, the new technique outperformed baseline -rst story of a novel event otherwise it is labelled as
techniques tested. The datasets used include the iris old and added to the history. The threshold is set using
data, sugar cane breeding data and e-mail log data. cross-validation. The algorithm works at two-levels.
Yang and Liu [60] describe two approaches to nov- At the -rst level, a classi-er determines the broad topic
elty detection in document classi-cation: linear least of the arrived document and at the second level the
squares -t (LLSF) and k-nearest neighbour (k-NN) novelty detector decides if the document describes a
classi-er approach. The k-NN algorithm is very sim- new event or an old event. The method was applied
ple. The k nearest neighbours of a test pattern are to a benchmark document database and showed very
found in the training set. The categories of the k near- good performance. The technique is very simplistic
est neighbours are used to weight the category candi- and will probably fail in most other domains that ex-
dates. The similarity score of each neighbour is used hibit high complexity between the various objects. It is
as a weight. If several neighbours share the same cat- however an interesting approach to novelty detection.
egory, then the per-neighbour weights of that cate-
gory are added together, and the resulting weighted
2.2.2. Parzen density estimation
sum is used as the likelihood score of that category
Parzen windows method [16] can be used for
with respect to the test pattern. A threshold set using
non-parametric data density estimation. Yeung and
cross-validation can be applied to determine whether
Chow [63] follow a well-established novelty detec-
a test pattern is suJciently ‘novel’ to be rejected.
tion approach, based on estimating the density of
In LLSF, a multivariate regression model is auto-
the training data and rejecting patterns (similar to
matically learnt from the training data and its cat-
[4,5,15,41,53–55]). The authors apply their technique
egories. By solving a linear least-squares -t on the
on an intrusion detection problem. The authors have
training pairs (input–output) of vectors, one can ob-
chosen Gaussian kernel functions for two reasons.
tain a matrix of regression coeJcients. The solution
First, the Gaussian function is smooth and hence the
matrix de-nes a mapping from an arbitrary pattern
density estimation also varies smoothly and second,
to a vector of weighted categories. By sorting these
if a radially symmetrical Gaussian is assumed, the
category weights, a ranked list of categories is ob-
function can be completely speci-ed by a variance
tained for the pattern. By thresholding on these cate-
parameter only. The novelty threshold is set using a
gory weights, category assignments can be obtained.
separate training set called ‘threshold determination
Each category has a speci-c threshold determined us-
set’ and it is applied on the unconditional probability
ing cross-validation. A minimum on each threshold
p(x) of a test pattern x based on the modelled distri-
can be imposed for novelty detection and pattern re-
bution. The technique was tested using the data set
jection.
from KDD Cup, 1999.
Yang et al. [62] describe a very simple novelty
detection method applied to document classi-cation.
Training data from old events is used to learn use- 2.2.3. String matching approaches
ful statistics for the prediction of new (novel) events. String matching approaches are based on treating
Their approach consists of the following steps: clas- training data as templates represented by a string
sifying documents into broad topics each of which (vector of features) and then computing some mea-
consists of multiple events, identifying named enti- sure of dissimilarity between training and test data.
ties, optimising their weight relative to normal words Forrest et al. [20] present a method for solving the
for each topic, and computing a stop-word list per problem of distinguishing self from Non-Self using
topic. Finally, they measure the novelty of a new doc- a change-detection algorithm based on the way the
ument conditioned on the system-predicted topic for natural immune system achieves the same task. The
M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497 2493
self-data is converted to binary format forming a previous work of Dasgupta and Forrest [11] and Das-
collection S. Then a large number of random strings gupta and Nino [14] to multi-class approach. Specif-
are generated forming a set R0 . Strings from R0 are ically the non-self (unknown) space will be further
matched against the strings in S and those that match classi-ed into multiple subclasses to determine the
are eliminated. Since perfect matching is extremely level of abnormality. The technique is implemented
rare, the matching criterion is relaxed so as to con- on a computer intrusion task. In an anomaly detec-
sider only r contiguous matches in the strings. Once tion system it is an acceptable assumption that the
R0 is created, new patterns are converted to binary normal operation of the system can be characterized
and matched against R0 . If a match is found, then the by a series of observations over time. Also, normal
new pattern belongs to Non-Self and is rejected. A system behaviour generally exhibits stable patterns
number of experiments were performed designed to when observed over a period of time. This is the com-
test various aspects of the detection system. The sys- mon ground in many fault detection systems includ-
tem was tested on network intrusion tasks and found ing those proposed by Tarassenko [53], Japcowicz
to perform extremely well. [28], Dasgupta and Forrest [11], Tarassenko [54], and
A further extension to this work is provided by Campbell and Bennett [6]. According to Dasgupta and
Dasgupta and Forrest [11] and Dasgupta and Nino Gonzalez, a naTUve approach to the task is to determine
[14] whose work is biologically inspired (based on the minimum and maximum values of the monitored
immuno-computing concepts [52]). In our bodies nov- parameters and measure the abnormality as a devia-
elty detection is carried out by T-cells that have recep- tion from these values. However, such an approach
tors in their surface that can detect foreign proteins. will not consider the fact that normality is dependent
The body releases a large number of T-cells but only on time and values that might be acceptable at a given
allows those that do not match any of the body’s own time might not be acceptable at a di?erent time. Added
cells to circulate. If a T-cell matches any cell in the to that, the notion of normality depends on the correla-
body then that cell is deemed as foreign and it is dealt tion and interaction of various parameters (features).
with. The novelty detection technique presented here In this paper, a sliding window is used for pattern char-
works in a similar way. A suJcient subset of the train- acterization and the normal behaviour of the system is
ing data is taken and the analogue values are made represented by a subspace called Self and its compli-
discrete by sampling. Each point is assigned an integer ment non-self. Two approaches are described here for
value and then converted to binary form. Thereafter, a fault detection, positive characterization and negative
large set of strings, called detectors, is generated that characterization. Positive characterization is a nearest
do not match the strings obtained by the training data. neighbour technique that records the Euclidean dis-
If a new data point, after being encoded, matches any tance between a test vector and its nearest neighbour in
of the detectors then a deviation from the normal sys- the Self subspace. A user-set value determines the al-
tem behaviour is evaluated and it is treated as novel lowable variability in the Self subspace. If the distance
data. The matching is performed based on a matching exceeds this value then the vector is deemed to be ab-
threshold r that sets the number of bits that have to normal. The technique is implemented using KD-Tree
match before two strings are deemed similar. One ma- for faster querying. The negative characterization ap-
jor concern, as identi-ed by the authors, of this system proach is more in tune with Dasgupta and Forrest [11]
is the matching threshold. This has to be data speci-c and Dasgupta and Nino [14], but implemented using
and its correct setting is essential for satisfactory sys- genetic algorithms (GAs) to build a representation of
tem performance. With larger r the detectors become the non-self subspace using the Self subspace as input.
too sensitive to any novelty in the data whereas small GAs are used to evolve rules to cover the non-self sub-
r might not result on a reasonable size of detector set. space, although the shape of neither Self or non-self
This technique was tested on tool breakage detection subspaces is known a priori (the patterns that belong
and the results on an average of 50 runs showed very to self can be used to approximate those subspaces).
good performance in detecting abnormal behaviour. The genetic algorithm attempts to evolve ‘good’ rules
Further work in this area extends the previous two that cover the non-self space. The goodness of rule is
studies [12]. This paper extends the idea presented in determined by various factors: the number of normal
2494 M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497
samples that cover the space, its area and the overlap Conventional data analysis methods of fMRIs assume
with other rules. This is a multi-objective, multi-modal that a model of the requisite function is available (for
optimisation problem. The objective is to -nd not only example, a brain’s response to a designed cognitive or
a single solution but a number of solutions that co- motor stimulus) and that the validity of this model can
operatively solve the problem. Since the covering of be tested by statistical methods of inference.
the non-self space is accomplished by a set of rules, However, in the case of neuroscience, researchers are
it is necessary to evolve multiple rules. A sequential probing brain function with increasingly complex cog-
niching algorithm is employed to evolve these dif- nitive experiments. This complexity demands that any
ferent rules. During system operation, if a test vec- model validation approach should be versatile, adap-
tor falls in the non-self space then it is deemed to be tive, data-driven and model-free. Therefore, novelty
abnormal. The two techniques were tested and com- detection is a prime candidate in solving this problem.
pared on a computer intrusion detection system. The EvIdent clusters time courses within the volumetric
data was obtained from the MIT-Lincoln Lab. The at- data, using a much-enhanced variant of Bezdek’s
tack free data was used for training and the system fuzzy C-means algorithm [3]. The time courses are
was tested using both systems. The negative charac- separated in such a way that the intra-cluster distance
terization approach was clearly more eJcient (in time is minimised, while simultaneously maximising the
and space) compared to the positive characterization inter-cluster distances. The fuzzy index m controls
although the positive characterization exhibited more the fuzziness of the cluster portions: as m approaches
precise results. 1, the fuzzy C-means algorithm converges to a clas-
Finally, Dasgupta and Majumdar [13] extend the sical hard means algorithm. As m approaches ∞, all
above studies for multi-dimensional data. The tech- cluster centroids tend towards the centroid of all time
nique proposed is identical to the one described in courses. The use of the fuzzy C-means as opposed
those previous papers except that the multidimensional to the classical k-means algorithm is dictated by the
data is -rst passed through PCA for dimensionality ability of the former to avoid getting stuck to local
reduction discarding features that accounted for less minima of the objective function it tries to optimise.
than 20% variability selecting only two dimensions. Fuzzy cluster analysis produces for each cluster a
The two dimensions were binary encoded as before but membership map for each anatomical slice of the
unlike those studied the binary strings were also gray volumetric data. The membership map is an image
coded. The system was tested on a network anomaly representing the degree of membership of each active
detection task and although good overall results were voxel to each cluster. By thresholding a membership
obtained, some anomalies went undetected. map, those voxels that belong to the cluster with at
least the user-speci-ed level of membership can be
identi-ed.
2.2.4. Clustering approaches Yang et al. [61] attempt to automatically detect
Clustering based approaches are aimed at partition- novel events from a temporally ordered stream of news
ing data into a number of clusters, where each data stories, either retrospectively or as the stories arrive.
point can be assigned a degree of membership to each The objective is to identify stories in several continu-
of the clusters. If the degree of membership is thresh- ous news streams that belong to previously unidenti-
olded to suggest if a data point belongs or not to a -ed events. This can be done in an on-line fashion, i.e.
cluster, novelty can be detected when a sample be- as the events occur or an accumulated collection. In
longs to none of the available classes. retrospective event detection, stories are grouped to-
Pizzi et al. [42] describe EvIdent, a data analy- gether where each cluster uniquely identi-es an event.
sis software for quickly detecting, investigating and In on-line event detection each document is labelled
visualizing novel events in a set of images as they as it arrives in sequence with a new or old Iag indi-
evolve in time and/or frequency. This work follows cating whether or not the document is the -rst story
the earlier research by Scarth et al. [48]. For instance, discussing a novel event. Two clustering approaches
in magnetic resonance neuro-images, novelty may are investigated: an agglomerative (hierarchical) al-
manifest itself as neural activations in a time course. gorithm based on group-average clustering (GAC),
M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497 2495
[18] R.A. Fisher, L.H.C. Tippett, Limiting forms of the frequency [36] G. Manson, G. Pierce, K. Worden, On the long-term stability
distribution of the largest and smallest member of a sample, of normal condition for damage detection in a composite
Proc. Camb. Philos. Soc. 24 (1928) 180–190. panel, Proceedings of the 4th International Conference on
[19] P. Foggia, C. Sansone, F. Tortorella, M. Vento, Multi- Damage Assessment of Structures, Cardi?, UK, June 2001.
classi-cation: reject criteria for the Bayesian combiner, [37] G. Manson, G. Pierce, K. Worden, T. Monnier, P. Guy,
Pattern Recognition 32 (1999) 1435–1447. K. Atherton, Long term stability of normal condition data
[20] S. Forrest, A.S. Perelson, L. Allen, R. Cherukuri, Self-non-self for novelty detection, Proceedings of the 7th International
discrimination in a computer, Proceedings of the IEEE Symposium on Smart Structures and Materials, California,
Symposium on Research in Security and Privacy, Oakland, USA, 2000.
CA, USA, May 1994, pp. 202–212. [38] A. Nairac, T. Corbett-Clark, R. Ripley, N. Townsend, L.
[21] G. Fumera, F. Roli, G. Giacinto, Reject option with multiple Tarassenko, Choosing an appropriate model for novelty
thresholds, Pattern Recognition 33 (2000) 2099–2101. detection, Proceedings of the 5th IEEE International
[22] R.S. Guh, F. Zorriassatine, J.D.T. Tannock, On-line control Conference on Arti-cial Neural Networks, Cambridge, 1997,
chart pattern detection and discrimination—a neural network pp. 227–232.
approach, Arti-cial Intell. Eng. 13 (1999) 413–425. [39] A. Nairac, N. Townsend, R. Carr, S. King, P. Cowley, L.
[23] S.E. Guttormsson, R.J. Marks II, M.A. El-Sharkawi, Elliptical Tarassenko, A system for the analysis of jet engine vibration
novelty grouping for on-line short-turn detection of excited data, Integrated Comput. Aided Eng. 6 (1999) 53–65.
running rotors, IEEE Trans. Energy Conversion 14(1) (March [40] T. Odin, D. Addison, Novelty detection using neural network
1999). technology, Proceedings of the COMADEN Conference,
[24] L.K. Hansen, C. Liisberg, P. Salamon, The error-reject Houston, TX, 2000.
tradeo?, Open Systems Inform. Dynamics 4 (1997) 159–184. [41] L. Parra, G. Deco, S. Miesbach, Statistical independence
[25] L.K. Hansen, S. Sigurdsson, T. Kolenda, F.A. Nielson, U. and novelty detection with information preserving non-linear
Kjems, J. Larsen, Modelling text with generalizable Gaussian maps, Neural Comput. 8 (2) (1995) 260–269.
mixtures, Proceedings of IEEE ICASSP ’2000, Vol. 6, [42] N.J. Pizzi, R.A. Vivanco, R.L. Somorjai, EvIdent: a functional
Istanbul, Turkey, 2000, pp. 3494 –3497. magnetic resonance image analysis system, Artif. Intell. Med.
[26] M.E. Hellman, The nearest neighbour classi-cation with a 21 (2001) 263–269.
reject option, IEEE Trans. Systems Sci. Cybernet. 6 (3) (July [43] S.J. Roberts, Novelty detection using extreme value statistics,
1970) 179–185. IEE Proc. Vision, Image Signal Process. 146 (3) (1999)
[27] S.J. Hickinbotham, J. Austin, Neural networks for novelty 124–129.
detection in airframe strain data, Proceedings of IEEE IJCNN, [44] S.J. Roberts, Extreme value statistics for novelty detection
Como, Italy, 2000. in biomedical signal processing, Proceedings of the 1st
[28] N. Japkowicz, C. Myers, M. Gluck, A novelty detection International Conference on Advances in Medical Signal and
approach to classi-cation, Proceedings of the 14th IJCAI Information Processing, 2002, pp. 166 –172.
Conference, Montreal, 1995, pp. 518–523. [45] S. Roberts, L. Tarassenko, A probabilistic resource allocating
[29] M.F. Jiang, S.S. Tseng, C.M. Su, Two-phase clustering network for novelty detection, Neural Comput. 6 (1994)
algorithm for outliers detection, Pattern Recognition Lett. 22 270–284.
(2001) 691–700. [46] R. Ruotolo, C. Surace, A statistical approach to damage
[30] S.P. King, D.M. King, P. Anuzis, K. Astley, L. Tarassenko, detection through vibration monitoring, Proceedings of the
P. Hayton, S. Utete, The use of novelty detection techniques 5th Pan American Congress of Applied Mechanics, Puerto,
for monitoring high-integrity plant, Proceedings of the 2002 Rico, 1997.
International Conference on Control Applications, Vol. 1, [47] R. Saunders, J.S. Gero, The importance of being emergent,
Cancun, Mexico, 2002, pp. 221–226. Proceedings of the Arti-cial Intelligence in Design, 2000.
[31] E.M. Knorr, R.T. Ng, V. Tucakov, Distance-based outliers: [48] G. Scarth, M. McIntyre, B. Wowk, R. Somorjai, Detection
algorithms and applications, VLDB J. 8 (3– 4) (2000) of novelty in functional images using fuzzy clustering,
237–253. Proceedings of the 3rd Meeting ISMRM, Nice, France, 1995,
[32] M. Lauer, A mixture approach to novelty detection using p. 238.
training data with outliers, in: L. De Raedt, P. Flach (Eds.), [49] S. Singh, M. Markou, An approach to novelty detection
Proceedings of the 12th European Conference on Machine applied to the classi-cation of image regions, IEEE Trans.
Learning, Springer, Freiburg, Germany, 2001, pp. 300 –311. Knowledge Data Eng., 2003, in press.
[33] J. Laurikkala, M. Juhola, E. Kentala, Informal identi-cation of [50] C. Spence, L. Parra, P. Sajda, Detection, synthesis and
outliers in medical data, Intelligent Data Analysis in Medicine compression in mammographic image analysis with a
and Pharmacology (IDAMAP-2000), Berlin, August 2000. hierarchical image probability model, IEEE Workshop
[34] C. Manikopoulos, S. Papavassiliou, Network intrusion and on Mathematical Methods in Biomedical Image Analysis,
fault detection: a statistical anomaly approach, IEEE Comm. MMBIA 2001, Kauai, HI, 2001, pp. 3–10.
Mag. 40 (October 2002). [51] C.D. Stefano, C. Sansone, M. Vento, To reject or not to reject:
[35] G. Manson, Identifying damage sensitive, environment that is the question—an answer in case of neural classi-ers,
insensitive features for damage detection, Proceedings of the IEEE Trans. Systems, Man Cybernet. Part C 30 (1) (2000)
IES Conference, Swansea, UK, 2002. 84–94.
M. Markou, S. Singh / Signal Processing 83 (2003) 2481 – 2497 2497
[52] A.O. Tarakanov, V.A. Skormin, Pattern recognition by [59] K. Yamanishi, J. Takeuchi, G. Williams, On-line
immunocomputing, Proceedings of the 2002 Congress on unsupervised outlier detection using -nite mixtures with
Evolutionary Computation, CEC ’02., Vol. 1, Honolulu, HI, discounting learning algorithms, Proceedings of the 6th
2002, pp. 938–943. ACM SIGKDD International Conference on Knowledge
[53] L. Tarassenko, Novelty detection for the identi-cation Discovery and Data Mining, Boston, MA, USA, August 2000,
of masses in mammograms, Proceedings of the 4th IEE pp. 320 –324.
International Conference on Arti-cial Neural Networks, Vol. [60] Y. Yang, X. Liu, A re-examination of text categorization
4, Cambridge, UK, 1995, pp. 442– 447. methods, Proceedings of the ACM SIGIR Conference on
[54] L. Tarassenko, A. Nairac, N. Townsend, P. Cowley, Research and Development in Information Retrieval, 1999,
Novelty detection in jet engines, IEE Colloquium on pp. 42– 49.
Condition Monitoring, Imagery, External Structures and [61] Y. Yang, T. Pierce, J. Carbonell, A study on retrospective
Health, Birmingham, UK, 1999, pp. 41– 45. and on-line event detection, Proceedings of the ACM SIGIR
[55] D.M.J. Tax, R.P.W. Duin, Outlier detection using classi-er Conference on Research and Development in Information
instability, in: Advances in Pattern Recognition, the Joint Retrieval, Melbourne, Australia, 1998, pp. 28–36.
IAPR International Workshops, Sydney, Australia, 1998, pp. [62] Y. Yang, J. Zhang, J. Carbonell, C. Jin, Topic-conditioned
593– 601. novelty detection International Conference on Knowledge
[56] D.M.J. Tax, R.P.W. Duin, Data description in subspaces, Discovery and Data Mining, July 2002.
International Conference on Pattern Recognition, Vol. 2, [63] D.Y. Yeung, C. Chow, Parzen window network intrusion
Barcelona, 2000. detectors, Proceedings of the International Conference on
[57] A. Webb, Statistical Pattern Recognition, Arnold, Paris, 1999. Pattern Recognition, Quebec, Canada, 2002.
[58] F. Wei, M. Miller, S.J. Stolfo, L. Wenke, P.K. Chan, [64] D.Y. Yeung, Y. Ding, Host-based intrusion detection using
Using arti-cial anomalies to detect unknown and known dynamic and static behavioral models, Pattern Recognition
network intrusions, Proceedings of the IEEE International 36 (2002) 229–243.
Conference on Data Mining, ICDM 2001, San Jose, CA, 2001,
pp. 123–130.