0% found this document useful (0 votes)
42 views35 pages

ND Review 2014

This document is a review of novelty detection, which involves classifying test data that differ from training data, often referred to as one-class classification. It discusses various methods of novelty detection, including probabilistic, distance-based, reconstruction-based, domain-based, and information-theoretic approaches, while also evaluating their effectiveness. The review highlights the practical applications of novelty detection in fields such as healthcare, industrial monitoring, and electronic security.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views35 pages

ND Review 2014

This document is a review of novelty detection, which involves classifying test data that differ from training data, often referred to as one-class classification. It discusses various methods of novelty detection, including probabilistic, distance-based, reconstruction-based, domain-based, and information-theoretic approaches, while also evaluating their effectiveness. The review highlights the practical applications of novelty detection in fields such as healthcare, industrial monitoring, and electronic security.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Signal Processing 99 (2014) 215–249

Contents lists available at ScienceDirect

Signal Processing
journal homepage: www.elsevier.com/locate/sigpro

Review

A review of novelty detection


Marco A.F. Pimentel n, David A. Clifton, Lei Clifton, Lionel Tarassenko
Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford OX3 7DQ, UK

a r t i c l e i n f o abstract

Article history: Novelty detection is the task of classifying test data that differ in some respect from the
Received 17 October 2012 data that are available during training. This may be seen as “one-class classification”, in
Received in revised form which a model is constructed to describe “normal” training data. The novelty detection
16 December 2013
approach is typically used when the quantity of available “abnormal” data is insufficient
Accepted 23 December 2013
Available online 2 January 2014
to construct explicit models for non-normal classes. Application includes inference in
datasets from critical systems, where the quantity of available normal data is very large,
Keywords: such that “normality” may be accurately modelled. In this review we aim to provide an
Novelty detection updated and structured investigation of novelty detection research papers that have
One-class classification
appeared in the machine learning literature during the last decade.
Machine learning
& 2014 Published by Elsevier B.V.

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
1.1. Novelty detection as one-class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
1.2. Overview of reviews on novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
1.3. Methods of novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
1.4. Organisation of the survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
2. Probabilistic novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
2.1. Parametric approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
2.1.1. Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
2.1.2. State-space models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
2.2. Non-parametric approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
2.2.1. Kernel density estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
2.2.2. Negative selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
2.3. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3. Distance-based novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.1. Nearest neighbour-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.2. Clustering-based approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
3.3. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
4. Reconstruction-based novelty detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
4.1. Neural network-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
4.2. Subspace-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.3. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
5. Domain-based novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

n
Correspondence author.
E-mail address: marco.pimentel@eng.ox.ac.uk (M.A.F. Pimentel).

0165-1684/$ - see front matter & 2014 Published by Elsevier B.V.


http://dx.doi.org/10.1016/j.sigpro.2013.12.026
216 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

5.1. Support vector data description approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238


5.2. One-class support vector machine approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
5.3. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6. Information-theoretic novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.1. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7. Application domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.1. Electronic IT security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.2. Healthcare informatics/medical diagnostics and monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.3. Industrial monitoring and damage detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.4. Image processing/video surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.5. Text mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.6. Sensor networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

1. Introduction examine previous reviews of the literature, concluding


that a new review is necessary in light of recent research
Novelty detection can be defined as the task of recognis- results.
ing that test data differ in some respect from the data that
are available during training. Its practical importance and 1.1. Novelty detection as one-class classification
challenging nature have led to many approaches being
proposed. These methods are typically applied to datasets Conventional pattern recognition typically focuses on
in which a very large number of examples of the “normal” the classification of two or more classes. General multi-
condition (also known as positive examples) is available class classification problems are often decomposed into
and where there are insufficient data to describe “abnorm- multiple two-class classification problems, where the two-
alities” (also known as negative examples). class problem is considered the basic classification task
Novelty detection has gained much research attention [16,17]. In a two-class classification problem we are given
in application domains involving large datasets acquired a set of training examples X ¼ fðxi ; ωi Þjxi A RD ; i ¼ 1…Ng,
from critical systems. These include the detection of mass- where each example consists of a D dimensional vector xi
like structures in mammograms [1] and other medical and its label ωi A f  1; 1g. From the labelled dataset, a
diagnostic problems [2,3], faults and failure detection in function hðxÞ is constructed such that for a given input
complex industrial systems [4], structural damage [5], vector x0 an estimate of one of the two labels is obtained,
intrusions in electronic security systems, such as credit ω ¼ hðx0 jXÞ:
card or mobile phone fraud detection [6,7], video surveil-
lance [8,9], mobile robotics [10,11], sensor networks [12], hðx0 jXÞ : RD -½ 1; 1
astronomy catalogues [13,14] and text mining [15]. The The problem of novelty detection, however, is approached
complexity of modern high-integrity systems is such that within the framework of one-class classification [18], in
only a limited understanding of the relationships bet- which one class (the specified normal, positive class) has to
ween the various system components can be obtained. be distinguished from all other possibilities. It is usually
An inevitable consequence of this is the existence of a assumed that the positive class is very well sampled, while
large number of possible “abnormal” modes, some of the other class(es) is/are severely under-sampled. The scarcity
which may not be known a priori, which makes conven- of negative examples can be due to high measurement costs,
tional multi-class classification schemes unsuitable for or the low frequency at which abnormal events occur. For
these applications. A solution to this problem is offered example, in a machine monitoring system, we require an
by novelty detection, in which a description of normality is alarm to be triggered whenever the machine exhibits “abnor-
learnt by constructing a model with numerous examples mal” behaviour. Measurements of the machine during its
representing positive instances (i.e., data indicative of normal operational state are inexpensive and easy to obtain.
normal system behaviour). Previously unseen patterns Conversely, measurements of failure of the machine would
are then tested by comparing them with the model of require the destruction of similar machines in all possible
normality, often resulting in some form of novelty score. ways. Therefore, it is difficult, if not impossible, to obtain a
The score, which may or may not be probabilistic, is very well-sampled negative class [19]. This problem is often
typically compared to a decision threshold, and the test compounded for the analysis of critical systems such as
data are then deemed to be “abnormal” if the threshold is human patients or jet engines, in which there is significant
exceeded. variability between individual entities, thereby limiting the
This survey aims to provide an updated and structured use of “abnormal” data acquired from other examples [20,19].
overview of recent studies and approaches to novelty In the novelty detection approach to classification,
detection that have appeared in the machine learning “normal” patterns X are available for training, while
and signal processing literature. The complexity and main “abnormal” ones are relatively few. A model of normality
application domains of each method are also discussed. MðθÞ, where θ represents the free parameters of the model,
This review is motivated in Section 1.2, in which we is inferred and used to assign novelty scores zðxÞ to
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 217

previously unseen test data x. Larger novelty scores zðxÞ a taxonomy which is appropriate for the state of the art in
correspond to increased “abnormality” with respect to the the research literature today.
model of normality. A novelty threshold zðxÞ ¼ k is defined Markou and Singh distinguished between two main
such that x is classified “normal” if zðxÞ rk, or “abnormal” categories of novelty detection techniques: statistical
otherwise. Thus, zðxÞ ¼ k defines a decision boundary. approaches [26] and neural network based approaches
Different types of models M, methods for setting their [27]. While appropriate in 2003, these classifications are
parameters θ, and methods for determining novelty now problematic, due to the convergence of statistics and
thresholds k have been proposed in the literature and will machine learning. The former are mostly based on using
be considered in this review. the statistical properties of data to estimate whether a new
Two interchangeable synonyms of novelty detection test point comes from the same distribution as the training
[21,1] often used in the literature are anomaly detection data or not, using either parametric or non-parametric
and outlier detection [22]. The different terms originate from techniques, while the latter come from a wide range of
different domains of application to which one-class classi- flexible non-linear regression and classification models,
fication can be applied, and there is no universally accepted data reduction models, and non-linear dynamical models
definition. Merriam-Webster [23] defines “novelty” to mean that have been extensively used for novelty detection
“new and not resembling something formerly known or [33,34]. Another review of the literature of novelty detec-
used”. Anomalies and outliers are two terms used most tion using machine learning techniques is provided by
commonly in the context of anomaly detection; sometimes Marsland [28]. The latter offers brief descriptions of the
interchangeably [24]. Barnett and Lewis [25] define an related topics of statistical outlier detection and novelty
outlier as a data point that “appears to be inconsistent with detection in biological organisms. The author emphasises
the remainder of that set of [training] data”. However, it is some fundamental issues of novelty detection, such as the
also used to describe a small fraction of “normal” data lack of a definition of how different a novel biological
which lie far way from the majority of “normal” data in stimulus can be before it is classified as “abnormal”, and
the feature space [9]. Therefore, outlier detection aims to how often a stimulus must be observed before it is
handle these “rogue” observations in a set of data, which classified as “normal”. This issue is also acknowledged by
can have a large effect on the analysis of the data. In other Modenesi and Braga [35], who describe novelty detection
words, outliers are assumed to contaminate the dataset strategies applied to the domain of time-series modelling.
under consideration and the goal is to cope with their Hodge and Austin [29], Agyemang et al. [30], and Chan-
presence during the model-construction stage. A different dola et al. [36] provide comprehensive surveys of outlier
goal is to learn a model of normality MðθÞ from a set of data detection methodologies developed in machine learning and
that is considered “normal”, in which the assumption is that statistical domains. Three fundamental approaches to the
the data used to train the learning system constitute the problem of outlier detection are addressed in [29]. In the first
basis to build a model of normality and the decision process approach, outliers are determined with no prior knowledge
on test data is based on the use of this model. Furthermore, of the data; this is a learning approach analogous to unsu-
the notion of normal data as expressed in anomaly detec- pervised clustering. The second approach is analogous to
tion is often not the same as that used in novelty detection. supervised classification and requires labelled data (“normal”
Anomalies are often taken to refer to irregularities or or “abnormal”). In this latter type, both normality and
transient events in otherwise “normal” data. These transi- abnormality are modelled explicitly. Lastly, the third approach
ent events are typically noisy events, which give rise models only normality. According to Hodge and Austin [29],
to artefacts that act as obstacles to data analysis, to be this approach is analogous to a semi-supervised recognition
removed before analysis can be performed. From this approach, which they term novelty detection or novelty
definition, novel data are not necessarily anomalies; this recognition. As with Markou and Singh [26,27], outlier
distinction has also been drawn by recent reviews in detection methods are grouped into “statistical models” and
anomaly detection [24]. Nevertheless, the term “anomaly “neural networks” in [29,30]. Additionally, the authors sug-
detection” is typically used synonymously with “novelty gest another two categories: machine learning and hybrid
detection”, and because the solutions and methods used in methods. According to Hodge and Austin [29], most “statis-
novelty detection, anomaly detection, and outlier detection tical” and “neural network” approaches require cardinal or
are often common, this review aims to consider all such ordinal data to allow distances to be computed between data
detection schemes and variants. points. For this reason, the machine learning category was
suggested to include multi-type vectors and symbolic attri-
butes, such as rule-based systems and tree-structure based
1.2. Overview of reviews on novelty detection methods. Their “hybrid” category covers systems that incor-
porate algorithms from at least two of the other three
This review is timely because there has not been a categories. Again, research since 2004 makes the use of these
comprehensive review of novelty detection since the two categories problematical.
papers by Markou and Singh [26,27] in this journal ten The most recent comprehensive survey of methods
years ago. A number of surveys have been published since related to anomaly detection was compiled by Chandola
then [26–32,24], but none of these attempts to be as wide- et al. [24]. Their focus is on the detection of anomalies; i.e.,
ranging as we are in our review. We cover not only the “patterns in data that do not conform to expected beha-
topic of novelty detection but also the related topics of viour” [24, p. 15:1]. This survey builds upon the three
outlier, anomaly and, briefly, change-point detection, using previous methods discussed in [29,30,36] by expanding
218 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

the discussion of each method considered and adding two x, and deviations from normality are detected according to
more categories of anomaly detection techniques: informa- a decision boundary that is usually referred to as the
tion theoretic techniques, which analyse the information novelty threshold zðxÞ ¼ k.
content of the dataset using information-theoretic measures Different metrics are used to evaluate the effectiveness
such as entropy; and spectral techniques, which attempt to and efficiency of novelty detection methods. The effective-
find an approximation of the data using a combination ness of novelty detection techniques can be evaluated
of attributes that capture the bulk of the variability in according to how many novel data points are correctly
the data. The surveys [29,30,24] agree that approaches to identified and also according to how many normal data are
anomaly detection can be supervised, unsupervised, or incorrectly classified as novel data. The latter is also known
semi-supervised. More recently, Kittler et al. [37] addressed as the false alarm rate. Receiver operating characteristic
the problem of anomaly detection in machine perception (ROC) curves are usually used to represent the trade-off
(where the key objective is to instantiate models to explain between the detection rate and the false alarm rate.
observations), and introduced the concept of domain anom- Novelty detection techniques should aim to have a high
aly, which refers to the situation when none of the models detection rate while keeping the false alarm rate low. The
characterising a domain are able to explain the data. The efficiency of novelty detection approaches is evaluated
authors argued that the conventional notions of anomalies according to computational cost, and both time and space
in data (such as being an outlier or distribution drift) alone complexity. Efficient novelty detection techniques should
cannot detect all anomalous events of interest in machine be scalable to large and high-dimensional data sets. In
perception, and proposed a taxonomy of domain anomalies, addition, depending on the specific novelty detection task,
which distinguishes between component, configuration, the amount of memory required to implement the tech-
and joint component and configuration domain anomaly nique is typically considered to be an important perfor-
events. mance evaluation metric.
Some novelty detection methods have been the topic of We classify novelty detection techniques according to
a number of other very brief overviews that have recently the following five general categories: (i) probabilistic,
been published [31,38–41,32]. Other surveys have focused (ii) distance-based, (iii) reconstruction-based, (iv) domain-
on novelty detection methods used in specific applications based, and (v) information-theoretic techniques. Approach
such as cyber-intrusion detection [6,42,7] and wireless (i) uses probabilistic methods that often involve a density
sensor networks [12]. estimation of the “normal” class. These methods assume
Only a few of the recent surveys attempt to provide a that low density areas in the training set indicate that
comprehensive review of the different methods used in these areas have a low probability of containing “normal”
different application domains. Since the review paper by objects. Approach (ii) includes the concepts of nearest-
Markou and Singh [26,27], we believe that there has been neighbour and clustering analysis that have also been used
no rigorous review of all the major topics in novelty in classification problems. The assumption here is that
detection. In fact, many reviews recently published contain “normal” data are tightly clustered, while novel data occur
fewer than 30 references (e.g., the reviews [35,40,41]), far from their nearest neighbours. Approach (iii) involves
and do not include significant papers from the literature. training a regression model using the training set. When
The most recent comprehensive survey of a related topic “abnormal” data are mapped using the trained model, the
(anomaly detection) was published by Chandola et al. [24]. reconstruction error between the regression target and the
However, as discussed in the previous subsection, although actual observed value gives rise to a high novelty score.
they can be seen as related topics, there are some funda- Neural networks, for example, can be used in this way
mental differences between anomaly detection and novelty and can offer many of the same advantages for novelty
detection. Also, Chandola et al. [24] do not attempt to detection as they do for regular classification problems.
review the novelty detection literature, which itself has Approach (iv) uses domain-based methods to characterise
attracted significant attention within the research commu- the training data. These methods typically try to describe a
nity as shown by the increasing number of publications in domain containing “normal” data by defining a boundary
this field in the last decade. In this review, we therefore aim around the “normal” class such that it follows the dis-
to provide a comprehensive overview of novelty detection tribution of the data, but does not explicitly provide a
research, but also include anomaly detection, outlier detec- distribution in high-density regions. Approach (v) com-
tion, and related approaches. To the best of our knowledge, putes the information content in the training data
this is the first attempt (since 2003) to provide such a using information-theoretic measures, such as entropy or
structured and detailed review. Kolmogorov complexity. The main concept here is that
novel data significantly alter the information content in a
1.3. Methods of novelty detection dataset.

Approaches to novelty detection include both Frequen- 1.4. Organisation of the survey
tist and Bayesian approaches, information theory, extreme
value statistics, support vector methods, other kernel The rest of the survey is organised as follows (see Fig. 1).
methods, and neural networks. In general, all of these We provide a state-of-the-art review of novelty detection
methods build some model of a training set that is selected research based on approaches from the different categories.
to contain no examples (or very few) of the important (i.e., Probabilistic novelty detection approaches are described in
novel) class. Novelty scores zðxÞ are then assigned to data Section 2, distance-based novelty detection approaches are
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 219

Fig. 1. Schematic representation of the organisation of the survey (the numbers within brackets correspond to the section where the topic is discussed).

presented in Section 3, reconstruction-based novelty detection in medical laboratory reference data. Box-plots graphically
approaches are described in Section 4. Sections 5 and 6 focus depict groups of numerical data using five quantities: the
on domain-based and information-theoretic techniques, smallest observation (sample minimum), lower quartile (Q1),
respectively. The application domains for all five categories median (Q2), upper quartile (Q3), and largest observation
of novelty detection methods discussed in this review are (sample maximum). The method used in [45] starts by
summarised in Section 7. In Section 8 we provide an overall transforming the original data so as to achieve a distribution
conclusion for this review. that is close to a Gaussian distribution (by applying the Box-
Cox transformation). Then, the lower and upper quartiles (Q1
and Q3, respectively) are estimated for the transformed
2. Probabilistic novelty detection data, and the interquartile range (IQR), which is defined
by IQR ¼ Q 3  Q 1, is used to define two detection limits:
Probabilistic approaches to novelty detection are based Q 1  ð1:5  IQRÞ and Q 3 þ ð1:5  IQRÞ. All values located out-
on estimating the generative probability density function side the two limits are identified as outliers. Although experi-
(pdf) of the data. The resultant distribution may then be ments have shown that the algorithm has potential for outlier
thresholded to define the boundaries of normality in the detection, they also suggest that the normalisation of dis-
data space and test whether a test sample comes from the tributions achieved by use of the transformation functions is
same distribution or not. Training data are assumed to be not sufficient to allow the algorithm to work as expected.
generated from some underlying probability distribution Many other sophisticated statistical tests have been used
D, which can be estimated using the training data. This to detect anomalies and outliers, as discussed in [25]. A
estimate D ^ usually represents a model of normality. A description of these statistical tests is beyond the scope of
novelty threshold can then be set using D^ in some manner, this review. Instead, we will concentrate on more advanced
such that it has a probabilistic interpretation. statistical modelling methods that are used for novelty detec-
The techniques proposed vary in terms of their complex- tion involving complex, multivariate data distributions.
ity. The simplest statistical techniques for novelty detection The estimation of some underlying data density D from
can be based on statistical hypothesis tests, which are multivariate training data is a well-established field [46,47].
equivalent to discordancy tests in the statistical outlier Broadly, these techniques fall into parametric and non-
detection literature [25]. These techniques determine parametric approaches, in which the former impose a restric-
whether a test sample was generated from the same tive model on the data, which results in a large bias when the
distribution as the “normal” data or not, and are usually model does not fit the data, while the latter set up a very
employed to detect outliers. Many of these statistical tests, flexible model by making fewer assumptions. The model
such as the frequently used Grubbs0 test [43], assume a grows in size to accommodate the complexity of the data,
Gaussian distribution for the training data and work only but this requires a large sample size for a reliable fit of all free
with univariate continuous data, although variants of these parameters. Opinion in the literature is divided as to whether
tests have been proposed to handle multivariate data sets; e. various techniques should be classified as parametric or non-
g., Aggarwal and Yu [44] recently proposed a variant of the parametric. For the purposes of providing a probabilistic
Grubbs0 test for multivariate data. The Grubbs0 test com- estimate D, ^ Gaussian mixture models (GMMs) and kernel
putes the distance of the test data points from the estimated density estimators have proven popular. GMMs are typically
sample mean and declares any point with a distance above classified as a parametric technique [26,24,41], because of the
a certain threshold to be an outlier [43]. This requires a assumption that the data are generated from a weighted
threshold parameter to determine the length of the tail that mixture of Gaussian distributions. Kernel density estimators
includes the outliers (and which is often associated with a are typically classified as a non-parametric technique
distance of three standard deviations from the estimated [33,26,24] as they are closely related to histogram methods,
mean). Another simple statistical scheme for outlier detec- one of the earliest forms of non-parametric density estima-
tion is based on the use of the box-plot rule. Solberg and tion. Parametric and non-parametric approaches are discussed
Lahti [45] have applied this technique to eliminate outliers in the next two sub-sections (see Table 1).
220 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

Table 1
Examples of novelty detection methods using both parametric and non-parametric probabilistic approaches.

Probabilistic Section References


approach

Parametric 2.1
Mixture models 2.1.1 Filev and Tseng [48,49], Flexer et al. [50], Ilonen et al. [51], Larsen [52], Paalanen et al. [53], Pontoppidan and Larsen [54],
Song et al. [55] and Zorriassatine et al. [56]
Extreme value 2.1.1 Clifton et al. [57–61], Hazan et al. [62], Hugueny et al. [63], Roberts [64,65], Sohn et al. [66] and Sundaram et al. [67]
theory
State-space models 2.1.2 Gwadera et al. [68,69], Ihler et al. [70], Janakiram et al. [71], Lee and Roberts [72], McSharry et al. [73,74],
Ntalampiras et al. [75], Pinto et al. [76], Qiau et al. [77], Quinn et al. [2,78], Siaterlis and Maglaris [79],
Williams et al. [80], Wong et al. [81,82] Yeung and Ding [83] and Zhang et al. [84]
Non-parametric 2.2
Kernel density 2.2.1 Angelov [85], Bengio et al. [86,87], Erdogmus et al. [88], Kapoor et al. [89], Kemmler et al. [90], Kim and Lee [91],
estimators Ramezani et al. [92], Subramaniam et al. [93], Tarassenko et al. [94,95], Vincent et al. [96] and Yeung and Chow [97]
Negative selection 2.2.2 Dasgupta and Majumbar [98], Esponda et al. [99], Gómez et al. [100], González and Dasgupta [101],
Surace and Worden [5] and Taylor and Corne [102]

2.1. Parametric approaches typical approach to setting a novelty threshold k is to


threshold this value; i.e., pðxÞ ¼ k. This method has been
Parametric approaches assume that normal data are used for novelty detection in [25,1,109] among others.
generated from an underlying parametric distribution with However, because pðxÞ is a probability density function, a
parameters θ A Θ, where θ is finite, and probability density threshold on pðxÞ has no direct probabilistic interpretation.
function pðx; θÞ, where x corresponds to an observation. Some authors [1,110] have interpreted the model output
The parameters θ are estimated from the given training pðxÞ probabilistically, by considering the cumulative prob-
data. The most commonly used form of distribution for ability P associated with pðxÞ; i.e., determining the prob-
continuous variables is the Gaussian. The parameters of ability mass obtained by numerically estimating the
which are estimated from the given training data using integral of pðxÞ over the region R for which the value of
maximum likelihood estimates (MLE), for which there is a pðxÞ is above the novelty threshold k. For unimodal
closed-form analytical solution for a Gaussian distribution. distributions, one can integrate from the mode of the
More complex forms of data distribution may be modelled probability density function to the probability contour
by mixture models such as GMMs [34,103], or other defined by the novelty threshold pðxÞ ¼ k, which can be
mixtures of different types of distributions such as the achieved in closed form for most regular distributions. For
gamma [104,105], the Poisson [106], the Student0 s t [107], multi-modal distributions, however, this may need to be
and the Weibull [108] distributions. In the absence of prior performed using Monte-Carlo techniques, as suggested by
information regarding the form of the underlying distribu- Nairac et al. [110]. An approximation in closed form for this
tion of the data, the Gaussian distribution is often used was proposed by Larsen [52]. The sampling approach can
because of its convenient analytical properties when then be used to set thresholds in relation to the actual
determining the location of a novelty threshold (discussed probability of observing novel data.
later in this section). Ilonen et al. [51] introduce a confidence measure for
GMM pdfs that can be used in one-class classification
2.1.1. Mixture models problems in order to select a novelty threshold. In this
GMMs estimate the probability density of the target case, confidence is used to estimate the reliability of a
class (here the normal class), given by a training set, classification result where a class label is assigned to
typically using fewer kernels than the number of patterns an unknown observation. If the confidence is low, it is
in the training set [46]. The parameters of the model may probable that a wrong decision has been made. The
be estimated using maximum likelihood methods (via method is based on a branch of functional theory dealing
optimisation algorithms such as conjugate gradients or with high-density regions (HDR), also termed the minimal
expectation-maximisation, EM) or via Bayesian methods volume region, which was originally proposed by Hynd-
(such as variational Bayes) [34]. Mixture models, however, man [111]. To determine the confidence region Ilonen et al.
can suffer from the requirement of large numbers of [51] use an approximation to find a pdf value τ which is at
training examples to estimate model parameters. A further the border of the confidence region. It is assumed that the
limitation of parametric techniques is that the chosen gradient of the pdf is never zero in the neighbourhood
functional form for the data distribution may not be a of any point for which the pdf value is non-zero. The
good model of the distribution that generates the data. proposed measure is based on the pdf density quantile;
However, GMMs have been used and explored in many specifically, τ is computed by rank-order statistics using
studies for novelty detection, as described below. the density quantile FðτÞ and by generating data according
One of the major issues in novelty detection is the to the pdf. In [53], the use of confidence information was
selection of a suitable novelty threshold. Within a prob- demonstrated experimentally for face detection.
abilistic approach, novelty scores can be defined using the Another principled method of setting thresholds for
unconditional probability distribution zðxÞ ¼ pðxÞ, and a novelty detection uses extreme value theory (EVT) [112].
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 221

EVT is a branch of statistics which deals with extreme In [59], the use of an alternate definition of extrema and
deviations of a probability distribution; i.e., it considers the derivation of closed-form solutions for the distribution
extremely large or small values in the tails of the distribu- function over pdf values allow accurate estimates of
tion that is assumed to generate the data. Existing work on the extreme value distributions of multivariate Gaussian
the use of EVT for novelty detection is described in [59]. kernels to be obtained. The approach was applied to
Multivariate extrema defined in the EVT literature [113] patient vital-sign data (comprising heart rate and breath-
are those n-dimensional data xn that are maxima or ing rate), to identify physiological deterioration in the
minima in one or more dimensions of n. Rather than patients being monitored.
considering extrema in each dimension independently EVT was also applied by Hazan et al. [62] in the context
(yielding the case of univariate distributions), extremes of vibration log-periodograms. They proposed the use of
with regard to the multivariate model of normality are excess-value statistics instead of maximum statistics.
required. EVT was first used for novelty detection in Under mild conditions for the pdf of the training set, the
multivariate data by Roberts [64,65], with models of probability of the excess value can be approximated by the
normality represented by GMMs. According to the Fisher– Generalised Pareto distribution if the novelty threshold is
Tippett theorem [114], upon which classical EVT is based, sufficiently large. The fault detection algorithm begins by
the distribution of the maximum of the training set must selecting a subset of the learning dataset, comprising N
belong to one of three families of extreme value distribu- log-periodograms. For each frequency in the periodogram
tions: the Gumbel, Fréchet, or Weibull distributions. the maximum of the log-periodograms across the subset is
The method proposed by Roberts [64,65] is concerned computed, and a “mask” is obtained. Excesses over the
with samples drawn from a distribution whose maxima mask in the remainder of the training set are identified
distribution converges to the Gumbel distribution. and used for estimation of the parameters of the General-
Although the multi-modal distribution was represented ised Pareto distribution. A detection threshold is then
by a mixture of Gaussian component distributions, the determined: any excess over the threshold is considered
problem was reduced to a single-component problem by to be a fault. The algorithm was evaluated in a publicly
assuming that the component closest to the test data point available set of vibration data, and analysis of receiver
(determined using the Mahalanobis distance) dominates operating characteristics (ROC) curves corresponding to
the EVT statistics. In that case, the EVT probability for a test the results of classification showed that bearing deteriora-
point is based on the Gumbel distribution corresponding tion could be identified even when little wear was present.
to the closest component in the GMM. The contribution of Yamanishi et al. [117] provide a theoretical basis and
the other components is assumed to be negligible. This demonstrate the effectiveness of “SmartSifter”, an outlier
method was applied to different biomedical datasets, detection algorithm based on statistical learning theory.
such as EEG (electroencephalography) records, in order This approach was first proposed in [118] for data mining,
to detect regions of (abnormal or novel) epileptic activity, specifically in fraud detection, network intrusion detec-
and MRI (Magnetic Resonance Imaging) data, in order to tion, or network monitoring. It uses a probabilistic model
find the location of tumours in an image. However, Clifton which has a hierarchical structure. While the probability
et al. [59] demonstrate that this single-component approx- density over the domain of categorical data is represented
imation is unsuitable for novelty detection when the by a histogram, a finite mixture is used to represent
generative data distribution is multivariate or multimodal. the probability density over the domain of continuous
In general, the assumption that only the closest compo- variables. When new data are presented as input, the
nent distribution to the test data point needs to be algorithm employs an online learning scheme to update
considered when determining the extreme value distribu- the model, using either a sequentially discounting Laplace
tion is not valid. The main reason is that the effect of other estimation algorithm for learning the histogram or a
components on the extreme value distribution may be sequentially discounting expectation and maximising algo-
significant due to the relative differences in variances rithm for learning the finite mixture model. The two
between kernels. Clifton et al. [59] propose a numerical algorithms developed by the authors gradually “discount”
method of accurately determining the extreme value the effect of past examples in the online process; i.e.,
distribution of a multivariate, multimodal distribution, recent examples take a greater weight in the update
which is a transformation of the probability density con- algorithm than older examples. A score is then assigned
tours of the generative distribution. This allows the to each new vector based on the normal model, measuring
extreme value distributions for mixture models of arbi- how much the model has changed after learning the new
trary complexity to be estimated by finding the MLE vector, with a high score indicating significant model
Gumbel distribution in the transformed space. A novelty change, suggesting that the new vector is a statistical
threshold may then be set on the corresponding univariate outlier. The authors claim that one of the advantages of
cumulative density function in the transformed space, their method is its ability to adapt to non-stationary
which describes where the most extreme samples gener- sources of data. The method has been successfully tested
ated from the normal distribution will lie. using network intrusion and health insurance databases
Clifton et al. [59] have also proposed a closed-form, from Australia0 s Health Insurance Commission.
analytical solution for multivariate, unimodal models of Disease outbreak detection has been proposed by detect-
normality. Classifying the extrema of unimodal distribu- ing anomalies in the event logs of emergency hospital
tions for novelty detection has been the focus of a large visits [119,120]. The authors propose a hierarchical Bayesian
body of work in the field of EVT [115,57,58,63,66,67,116]. procedure for data that arise as multidimensional arrays
222 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

with each dimension corresponding to the levels of a feature data and recursively updates the structure and the
categorical variable. Anomalies are detected by comparing parameters of the operating mode clusters. The algorithm
information at the current time to historical data. The was validated using a set of experimental vibration data
distribution of data (deviations at current time) is modelled collected in an accelerated testing facility.
with a two-component mixture of Gaussians, one for the Song et al. [55] propose a conditional anomaly detection
normal data and another for the outliers. Using Bayesian (CAD) technique by assuming that attributes are already
inference, the probability of an observation being generated partitioned into contextual and behavioural attributes; that
by the outlier model is estimated. The authors assume that is, the context of the measurements is considered before
the priors of an observation being normal or an outlier are identifying a data point as “anomalous”. To detect anoma-
known a priori. The algorithm uses EM to set the parameters lies, the CAD technique learns two models: the statistical
of a mixture of models for the two classes, assuming that behaviour of the monitored system, and the statistical
each data point is an anomaly with a priori probability λ, and behaviour of the environment. The probabilistic links
normal with a priori probability 1 λ. between the models are also learnt, giving a combined
Zorriassatine et al. [56] use GMMs for pattern recogni- model of likely data that co-occurs in the environment and
tion in condition monitoring of multivariate processes. The the system. True anomalies can then be defined as being
methodology for applying novelty detection to bivariate statistically unlikely events in the system parameters that
processes, which was described in previous work [121], occur when the environment is normal. Within the CAD
was adapted to monitor the condition of a milling process literature, the parameters from the system under study are
by analysing signals for 10 process variates. Signals col- described as the indicator parameters, while those from
lected from the normal states of the machining process the surrounding conditions are called the environment
were used to create various models of the underlying pdf parameters. The technique used in [55] for learning the
for the machining process, in which each model represents indicator and environment models is Gaussian mixture
the healthy states of the cutter at given combinations of modelling.
machining parameters, such as the depth of cut, spindle Zhang et al. [124] propose a hierarchical probabilistic
speed, and feed rate. The centre of each Gaussian was model for online document clustering. The generation of
initialised using a k-means clustering algorithm, with all new clusters is modelled with a Dirichlet process mixture
model parameters then determined using EM. The GMM model, for which the base distribution can be treated as
with the lowest training error was used as the novelty the prior of the general English model and the precision
detector, with a threshold set to be the minimum log parameter is related to the generation rate of new clusters.
likelihood of the training data (as in [21]). Previously The empirical Bayes method is used to estimate model
unseen data were used to adjust the novelty threshold hyperparameters based on a historical dataset. This prob-
using a heuristic approach. abilistic model was compared with existing approaches in
Pontoppidan and Larsen [54] describe a probabilistic the literature, such as logistic regression, and applied to
change detection framework based on Independent Com- the novelty detection task of topic detection and tracking.
ponent Analysis (ICA), and compare it to a GMM-based The objective of this task was to detect the earliest report
approach. Bayesian ICA using mean field training [122] is for each news event as soon as the report arrives in a
used to train the ICA model. The noise is assumed to be temporal sequence of new stories. Results showed that the
Gaussian and independent of the sources, with a diagonal performance of the proposed method is comparable to
covariance matrix. This ICA-based method was success- other methods in terms of topic-weighted measure (i.e., in
fully applied to the detection of changes in the condition of terms of the topic detection and tracking evaluation mea-
large diesel engines using acoustic emission signals. sure in which the cost of the method is computed for every
Flexer et al. [50] apply novelty detection to retrieve event, and then the average is taken).
songs from a library based on similarity to test songs and Perner [125] outlines a new approach to novelty
genre label information. Experiments considered 2522 detection, which is based on a case-based reasoning (CBR)
songs from 22 genres. Two minutes of each song were process. The author combines statistical and similarity
evaluated and novel data detected using two algorithms. inference methods. This view of CBR takes into account
The first, named ratio-reject, uses a GMM in order to the properties of data such as uncertainty, and underlying
estimate the pdf of the training data. The second algo- concepts such as adaptation, storage, and learning. Known
rithm, named Knn-reject, defines a neighbourhood which classes are described by statistical models. The performance
is used to classify unseen data. Both methods were shown of the models is improved by incremental updating based
to perform equally well in terms of sensitivity, specificity, on newly available events, once a sufficiently large number
and accuracy within a genre-classification context. of cases is collected. Information about cases is now impli-
Filev and Tseng [48,49] describe a real-time algorithm citly represented by the model, and storage capacity is
for modelling and predicting machine health status. It preserved. New events, not belonging to one of the known
utilises the concepts of fuzzy k-nearest neighbour clustering case classes, are classified as being novel events. These
and the GMM to model the data acquired from the machine events are stored by a similarity-based registration proce-
as a collection of clusters that represent the dynamics of the dure in a second case base. The similarity-based learning
machine0 s main operating modes. The Greedy EM cluster- procedure ensures that similar cases are grouped into case
ing algorithm [123] is used to identify and initialise clusters classes, a representative for the case class is learnt, and
corresponding to different operating modes. An unsuper- generalisation over case classes can be performed. Novel
vised learning algorithm then continuously reads new events can be collected and grouped so that retrieval is
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 223

efficient. When a sufficiently large number of cases is authors describe an offline novelty detector similarly
available in the second case base, the case class is passed based on a time-varying GMM, which was tested using
to the statistical learning unit to learn a new statistical sleep EEG data. Their novelty detector was based on
model. The statistical learning strategy for updating a model “growing” the mixture model over time using a process
and learning new models is based on the minimum message similar to reinforcement learning. During learning, train-
length learning principle. ing data were presented at random and were either added
Hempstalk et al. [126] introduce a technique that to regions that formed the basis of the Gaussian kernels in
combines density estimation and class probability estima- the model or used to estimate the parameters of new
tion for converting one-class classification problems into kernels. A variant of these regression-based techniques,
binary classification tasks. The initial phase of the method which detects anomalies in multivariate time-series data
involves an examination of the training data for the generated by an Autoregressive Moving Average (ARMA)
normal class in order to determine its distribution (such model, was proposed in [133]. Here, a multivariate time-
as fitting a GMM to the data). This knowledge is subse- series is transformed into a univariate time-series by linearly
quently used in the generation of an artificial dataset combining the components that are obtained using a
that represents an abnormal class. In the second phase, a projection pursuit technique, which maximises the kurtosis
standard binary classifier is trained based on the normal coefficient of the time-series data. Univariate statistical tests
class and the generated abnormal class. Using Bayes0 rule, are then used for anomaly detection in each resulting
the authors demonstrate how the class density function projection. Similar approaches have been applied in ARIMA
can be combined with the class probability estimate models [134], or have been used to detect anomalies by
to yield a description of the normal class. The authors analysing the Akaike Information Criterion during model-
conclude that the combination of the density function fitting [135].
with a class probability estimate (i.e., a classification
model) produces an improvement in accuracy beyond that 2.1.2. State-space models
which results from one-class classification with the den- State-space models are often used for novelty detection
sity function alone, but this will depend on how well the in time-series data. These models assume that there is
artificial dataset represents the abnormal class. some underlying hidden state that generates the observa-
Chen and Meng [127] propose a framework for patient- tions, and that this hidden state evolves through time,
specific physiological monitoring, in which the ratio the possibly as a function of the inputs. The two most common
densities of training and test data [128,129] are used to state-space models used for novelty detection are the
define a health status index. The density ratio is estimated Hidden Markov Model (HMM) and the Kalman filter.
by a linear model whose parameter values are found using HMMs include temporal dependence through the use
a least-squares algorithm, without involving density esti- of a state-based representation updated at every time step.
mation. The training and testing data were selected While the features are directly observable, the underlying
from the MIT-BIH Arrhythmia Database, which contains system states are not, and hence they are called unobser-
sequences of ECG (electrocardiogram) signals acquired vable, hidden, or latent states. The transitions between the
from 10 patients. The authors claim that the method is hidden states of the model are governed by a stochastic
advantageous because density ratio parameters are esti- process [33,4]. The “emission” probabilities of the obser-
mated without involving actual density estimation (a vable events are determined by a set of probability
comprehensive review of density ratio estimation meth- distributions associated with each state, which can be
ods can be found in [130]). thresholded to perform novelty detection [83]. HMM
Another strategy for novelty detection is to use time- parameters are typically learnt using a variant of the EM
series methods such as the well-known stochastic process, algorithm. Novelty detection with HMMs may also be
the Autoregressive Integrated Moving Average (ARIMA). This achieved by constructing an “abnormal” state, a transition
may be used to predict the next data point, and hence into which implies abnormal system behaviour [136].
determine if it is artefactual or not; see, for example, [131]. Yeung and Ding [83] use HMMs for detection of
In the latter, an online novelty detector, the Automatic intrusions based on shell command sequences within the
Dynamic Data Mapper (ADDaM), is proposed, which is network security domain. Two different types of beha-
based on the construction of a pdf using Gaussian kernels. vioural models are presented: a dynamic modelling
Two forms of pdf are possible: a static pdf for which approach based on HMMs and a static modelling approach
the prior probability of each kernel is determined by the which is based on the frequency distributions of event
number of observations it represents, and a “temporal” pdf occurrences. In the former, the parameters of an HMM for
for which more recent observations have a higher prior modelling normal system behaviour are estimated using
probability. Testing against the current pdf assesses the an EM algorithm for mixture density estimation. The
novelty of the next test point. The performance of this likelihood of an observation sequence, with respect to a
method for artefact detection in heart rate data (from an given trained HMM, is computed using either the “for-
automatic anaesthesia monitoring database) was com- ward” or “backward” algorithm. By comparing the like-
pared with Kalman filtering, ARIMA, and moving average lihood of an observation sequence against a threshold
filters using ROC curves. The authors reported that both (chosen to be the minimum likelihood among all training
ADDaM-based methods outperformed all other methods sequences), one can decide whether that sequence is
tested. Their proposed method is similar to that proposed abnormal or not. In the second approach, the probability
earlier by Roberts and Tarassenko [132], in which the distribution of normal behaviour of the system observed
224 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

over a certain period of time is modelled using a simple dataset, in addition to the Well-Log1 and Dow Jones2
occurrence frequency distribution. The behaviour of the datasets.
test system being monitored is modelled in the same way. Also based on a dynamical model of time-series normal
An information-theoretic measure, cross-entropy, which is data, the Multidimensional Probability Evolution method
related to the Kullback–Leibler measure, is used to quantify [73,74] characterises normal data by using a non-linear
separation between the two distributions (corresponding state-space model; i.e., the pdf within a multidimensional
to “training” and test sequences). The Kullback–Leibler state space is computed for each window of the time-
divergence is a statistical tool for estimating the difference varying signal. The regions of state space visited during
in information content between two distributions. By normal behaviour are modelled, and departures from
checking whether the cross-entropy between the two these, that can correspond to both linear and non-linear
distributions is larger than a certain threshold, chosen to dynamical changes, are deemed abnormal. The performance
be the maximum cross-entropy value computed between of this technique was illustrated using a synthetic signal,
the entire training set and each time-series in the training in addition to electroencephalography (EEG) recordings to
set, one can decide whether the observed sequence should identify epileptic seizures.
be considered an intrusion with respect to the model. A task related to that of time-series novelty detection is
Although the HMM is better suited for intrusion detection to determine whether a pattern discovered in the data
based on Unix system calls, the static modelling approach is significant. Assuming an underlying statistical model
based on the information-theoretic technique outper- for the data, one can estimate the expected number of
formed the dynamic modelling approach across all experi- occurrences of a particular pattern in the data. If the
ments. Other similar intrusion detection methods based number of times a pattern actually occurs is significantly
on HMMs have been proposed [77,84]. different from this expected value, then it could be
Ntalampiras et al. [75] explore HMMs for novelty indicative of unusual activity (and thus the pattern dis-
detection applied to acoustic surveillance of abnormal covered may be regarded as being significant). Further-
situations, the goal being to help an authorised person more, since the statistics governing the data generation
take the appropriate action for preventing life or property process are assumed to be known, it is possible to quantify
loss. The HMM is a model commonly used in sound the extent of deviation from the expected value that
recognition, as it takes into account the temporal evolution corresponds to a test pattern being classified as “signifi-
of the audio signal. The framework was tested using a cant”. An application to the “frequent episode discovery
dataset that contains recordings from a smart-home envir- problem” in temporal data mining is presented in [69]. It is
onment, an open public space, and an office corridor. shown that the number of sliding windows over the data
A related state-based approach to novelty detection in in which a given episode occurs at least once converges to
time-series relies on Factorial Switching Kalman Filters [2]. a Gaussian distribution with mean and variance that can
A Kalman Filter can be seen as a generalisation of an be determined from the parameters of an underlying
autoregressive process, describing an observed process in Bernoulli distribution (which are in turn estimated from
terms of an evolving hidden state process. This may be some training data). For a pre-defined confidence level,
generalised to the Switched Kalman Filter (SKF) [137], in upper and lower thresholds for the observed frequency of
which the evolving hidden process is dependent on a an episode can be determined, which can be used to
switching variable (which also evolves through time). The decide whether an episode is over- or under-represented
Factorial SKF (FSKF) is a dynamic extension of the SKF, in in the data. These ideas are extended in [138] to the case of
which a cross-product of multiple factors is used rather determining significance for a set of frequent episodes, and
than a single variable. The FSKF allows the modelling of in [68] to the case of a Markov model assumption for the
multiple time-series by assuming that a continuous, hid- data sequence.
den state is responsible for data generation, the effects of Ihler et al. [70] consider the modelling of web click
which are observed through a noise process. An explicit data. The proposed method is based on a time-varying
abnormal mode of behaviour is included within the model Poisson process model that can account for anomalous
which is used to identify departures from normality. This events. The normal behaviour in a time-series is assumed
method was applied to the monitoring of premature to be generated by a non-stationary Poisson process while
infants in intensive care, and is described in [2,80]. The the outliers are assumed to be generated by a homoge-
method was later extended in [78], which used a dataset of neous Poisson process. The transition between normal and
continuously observed physiological variables such as outlying behaviours is modelled using a Markov process.
heart rate and blood pressure. Lee and Roberts [72] Markov Chain Monte Carlo (MCMC) is used to estimate the
propose an online novelty detection framework using the parameters of these processes. A test time-series is mod-
Kalman filter and EVT. A multivariate Gaussian probability elled using this process and the time points for which the
density over the target variables is obtained via a Kalman outlying model is selected are considered as outliers.
filter, with an auto-regression state model. This was used
to model the dynamics of the state space and thereby
1
to detect changes in the underlying system, as well as The Well-Log data set contains measurements of nuclear magnetic
identify outliers in the observation sequence. EVT is then response while drilling a well.
2
The Dow Jones data set contains daily stock market indexes
used to define a threshold and obtain a novelty measure (Industrial Average), that show how 30 large publicly owned companies
on the univariate predictive distribution. Experiments based in the United States have traded during standard trading sessions
were conducted on three univariate data sets: an artificial on the stock market.
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 225

Both methods described above can be generalised to necessary to fit the data and accommodate the complexity
Dynamic Bayesian Networks (DBNs), which are more of the data. The simplest non-parametric statistical tech-
general state-space models [34]. DBNs generalise HMMs nique is the use of histograms which graphically display
and Kalman filter models by representing the hidden and tabulated frequencies. The algorithm typically defines a
observed states in terms of state variables, which can have distance measure between a new test data point and
complex interdependencies. A DBN is a directed probabil- the histogram-based model of normality to determine if
istic graphical model of a stochastic process, which it is an outlier or not [36]. For multivariate data, attribute-
provides an easy way to specify these conditional inde- wise histograms are constructed and an overall novelty
pendencies. They can also be seen as an extension of score for a test data point is obtained by aggregating the
Bayesian networks to handle temporal models. A Bayesian novelty scores from each attribute. This has been applied
network estimates the probabilistic relationship among to network intrusion and web-based attack detection
variables of a dataset, in the form of a probabilistic [117,140–142].
graphical model. In addition to DBNs, Bayesian networks
are sometimes termed naïve Bayesian networks or Bayesian 2.2.1. Kernel density estimators
belief networks. Janakiram et al. [71] propose a classifica- A non-parametric approach to probability density esti-
tion system based on a Bayesian belief network (BBN) to mation is the kernel density estimator [34]. In this
detect any missing or anomalous data in wireless sensor approach, the probability density function is estimated
networks. Each node in the graph corresponds to a sensor, using large numbers of kernels distributed over the data
and models sensor measurements from neighbouring space. The estimate of the probability density at each
nodes in addition to its own. The authors then estimate location in data space relies on the data points that lie
the probability of each observed attribute using the BBN within a localised neighbourhood of the kernel. The kernel
model. This model is suitable if some dependency exists density estimator places a (typically Gaussian) kernel on
between sensor variables and between nodes. Its accuracy each data point and then sums the local contributions
depends on the number of neighbours that each node has. from each kernel. This kernel density estimator is often
The technique requires offline training, and regeneration termed the Parzen windows estimator [143]. This method
of a conditional probability table for each node if the has been used for novelty detection in applications such as
network topology changes. BBNs are also used to incorpo- network intrusion detection [97], oil flow data [21], and for
rate prior probabilities into a novelty detection framework. mammographic image analysis [1]. In the Parzen Windows
Several variants based on naïve Bayes, which assumes estimator, an isotropic Gaussian kernel is centred on
independence between the variables, have been proposed each training point, with a single shared variance hyper-
for network intrusion detection [139,79] and for disease parameter. Training the Parzen density estimator consists
outbreak detection [81,82]. of determining the variance of the kernels, which controls
More recently, Pinto et al. [76] have proposed novelty the smoothness of the overall distribution. The fixed width
threshold functions that operate on top of probabilistic in each feature direction also means that the Parzen
graphical models instantiated dynamically from sensed density estimator is sensitive to the scaling of the data.
semantic data in the context of room categorisation. By This problem is addressed in [21], in which the variance is
using thresholds on the distributions defined by the graph determined using a nearest-neighbour method.
based solely on the conditional probability, as seen in [34], a Vincent and Bengio [96] propose an approach to
novelty system can be implemented. However, it may not improve on this estimator, by using general covariance
be suitable to perform novelty detection using graphs that matrices for individual components set according to
are dynamically generated. Pinto et al. [76] show that the neighbourhoods local to each kernel. Not only are the
ratio between a conditional and unconditional probability localisation of the data point and its neighbours used but
is a suitable detector for implementing a threshold when also their geometry, in order to try and infer the principal
samples are taken from dynamic distributions, under the characteristics of the local shape of the manifold (where
assumption that the probability of a sample being gener- the density is concentrated), which can be summarised by
ated by a (novel) unknown class is constant across all graph the covariance matrix of the Gaussian. Bengio et al. [87]
structures. This assumption may not be appropriate for describe a non-local non-parametric density estimator
some graph structures; e.g., a graph where there is only which builds upon previously proposed GMMs with reg-
access to room-size information versus a graph where there ularised covariance matrices to take into account the local
is more information concerning the properties of the room shape of the manifold. The proposed approach builds upon
available. The authors also show that correct estimation of the Manifold Parzen density estimator [96] that associates
unconditional probability plays an important role, and that a regularised Gaussian with each training point, and upon
unlabelled data can be used to construct an unconditional previous work on non-local estimators of the tangent
pdf that can then be used to optimise the novelty threshold. plane of a manifold [86]. The local covariance matrix
However, only synthetic data distributions were used to characterising the density in the immediate neighbour-
evaluate the effectiveness of the approach. hood of a data point is learnt as a function of that data
point, with global parameters. Generalisation may then be
2.2. Non-parametric approaches possible in regions with little or no training data, unlike
traditional, local, non-parametric models. The implicit
Non-parametric approaches do not assume that the assumption is that there is some kind of regularity in the
structure of a model is fixed, i.e., the model grows in size as shape of the density, such that learning about its shape in
226 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

one region could be informative of the shape in another with Kalman filter or evolving Takagi-Sugeno fuzzy models
region that is not adjacent. The proposed method was [92,85].
tested in three types of experiments involving artificial More recently, one-class classification using Gaussian
datasets and the USPS3 dataset, which showed that the Processes (GPs) has been proposed [147,89–91], in which a
non-local estimator yielded improved density estimation point-wise approach to novelty detection is also taken,
and reduced classification errors when compared to local dividing the data space into regions with high support and
algorithms. low support depending on whether or not those regions
Erdogmus et al. [88] describe a multivariate density are close to those occupied by normal training data, or not,
estimation method that uses Parzen windows to estimate respectively. It is often assumed that the desired mapping
marginal distributions from samples. The kernel size is from inputs x to labels y can be modelled by y ¼ f ðxÞ þ ε
optimised to minimise the Kullback–Leibler divergence of where f is an unknown latent function and ε is a noise
the true marginal distribution from the estimated mar- term. By choosing a proper GP prior, it is possible to derive
ginal density. The estimated marginal densities are used to useful membership scores for one-class classification.
transform the random variables to be Gaussian-distribu- In [90], the authors use a mean of the prior with a smaller
ted, whereby joint statistics can be simply determined by value than the positive class labels (e.g., y ¼1), such as
sample covariance estimation. The proposed method was a zero mean. This restricts the space of probable latent
shown to be more data efficient than Parzen windows functions to functions with values gradually decreasing
with a structured multidimensional kernel. when the inputs are far from training points. Because the
Subramaniam et al. [93] use kernel density estimators predictive probability is solely described by its first and
in a framework that computes an approximation to multi- second order moments, the authors also investigate the
dimensional data distributions in order to enable complex power of the predictive mean and variance as alternative
applications in resource-constrained sensor networks. The membership scores: the mean decreases for inputs distant
authors propose an online approximation of the data from the training data and can be directly utilised as a
distribution by considering the values in a sliding window. novelty detection measure, while the predictive variance
The variance of the kernel for the values in the sliding increases which suggests that the negative variance value
window is computed using a histogram along the time can serve as an alternative criterion for novelty detection.
axis. A network of nodes is considered, where the estima- This latter concept is used in the context of clustering in
tor updates are propagated around the network such that [91]. Kemmler et al. [90] explore an heuristic measure: the
child nodes transmit updates to the parent nodes. Experi- predictive mean divided by the standard deviation, which
ments showed that this method can achieve high precision was proposed in [89] as a combined measure for describ-
for identifying outliers, but that it consumes a large ing the uncertainty of the estimation.
amount of memory space and may not find all outliers. A related approach involves a class of methods which
Tarassenko et al. [94,95] propose an approach to are part of the well-established field of changepoint detec-
patient monitoring based on novelty detection, in which tion [148,149]. Here the problem setting is more specific,
a multivariate, multimodal model of the distribution of the aim being (typically) to detect whether the generative
vital-sign data from “normal” high-risk patients is con- distribution of a sequence of observations has remained
structed using Parzen windows. Multivariate test data are stable or has undergone some abrupt change. This may
then compared with this model to give a novelty score, include not only detecting that a change has occurred but
and an alert is generated when the novelty score exceeds also, if it has occurred, estimating the time at which the
the novelty threshold. This system was used for monitor- change has occurred. Methods vary according to which
ing patients in a clinical trial involving 336 patients [145], restrictions are placed on the pre- and post-change dis-
and it was able to provide early warning of adverse tributions and the degree of knowledge of potential
physiological events. changes. Changepoint detection can also take place in a
Ramezani et al. [92] consider the problem of novelty batch or online setting. The basic approach to the retro-
detection in video streams, and use a method derived from spective problem is to find a test statistic appropriate for
a kernel density estimator and an evolving clustering testing the hypothesis that a change has occurred with
approach, “e-Clustering”. In this method, the pdf of the respect to the hypothesis that no change has occurred. This
colour intensity of the image frames is approximated by a statistic is usually based on a likelihood ratio, but other
Cauchy kernel. A recursive expression derived in [85,146] approaches exist [149,150]. The online setting has the goal
is then used to update this estimation online. The recursive of detecting a change as quickly as possible once it has
density estimation clusters pixel colour intensities into occurred. The most basic approach to this is also based on
“background” and “foreground” (i.e., pixels for which likelihood ratios. The Cumulative Summation (CUSUM)
significant novelty is detected). The proposed approach algorithm is generally used in statistical process control
gradually updates the background model, and it was found to detect abrupt changes in the mean value of a stochastic
to be faster than the traditional kernel density estimate process [151]. Non-parametric CUSUM algorithms sequen-
for background subtraction. The approach can also be tially accumulate data values that are higher than the
extended to automatic object tracking when combined mean value observed under normal conditions. An anom-
aly is detected by comparing this CUSUM value to a
threshold, where the latter determines the sensitivity of
3
The USPS (United States Postal Service) dataset contains hand- the detector and the detection delay. This approach has
written digit images, and comes from the UCI repository [144]. been used in wireless sensor networks [152] and intrusion
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 227

detection systems [151]. Bayesian approaches have also that represents its radius. The matching rule is expressed
been proposed for both offline and online changepoint by a membership function, which is a function of the
detection schemes [153]. The Bayesian framework incor- detector-antigen Euclidean distance and the radius of the
porates prior knowledge about the distribution of the detector. The algorithm tries to evolve a set of points
change time. The decision function is then based on the (antibodies or detectors) that covers the non-self space
a posteriori probability of a change. Because it is often hard using an iterative process that updates the position of the
in practice to elicit specific information about the distribu- detector. In order to detect if a detector matches a self
tion of the changepoint, it is common to assume that the point, the algorithm uses a nearest-neighbour approach to
change time follows a geometric distribution, but finding calculate a distance measure. A different approach is also
the parameter of this distribution requires a preliminary taken in which a multi-layer perceptron trained with back-
estimation problem to be solved. There is a large body of propagation is used for the detection problem. The system
literature on the topic of changepoint detection, and the was tested on network traffic data sets. Gómez et al. [100]
interested reader is referred a book on this topic which has extended the negative characterisation approach to gen-
recently been published [154]. erate more flexible boundaries between self and non-self
space using fuzzy rules. Taylor and Corne [102] demon-
2.2.2. Negative selection strate the feasibility of the above approaches in fault
Negative selection approaches have been widely used detection in refrigeration systems. Esponda et al. [99]
for change detection and novelty detection [155,98,101, describe a framework for outlier detection using this general
102]. The nature of the negative selection algorithm [155] is approach.
inspired by the properties of the immune system. The Surace and Worden [5] apply the negative selection
human immune system has the ability to detect antigens; algorithm to more general feature sets, rather than the
i.e., anything which is not part of the human body (such as windowed time-series data used in previous studies. The
viruses, bacteria, etc.). The task of the immune system is to authors consider Structural Health Monitoring (SHM) for
differentiate between antigens and the body itself, a situations where the normal condition of a structure may
process known as self/non-self discrimination, which is change due to time-varying environmental or operational
achieved when the antigen is “recognised” by a specific conditions. Data were simulated for an offshore platform
antibody called a T-cell receptor. These T-cell receptors are model with changing mass as a result of changing oil
created by a random process of genetic rearrangements. storage requirements. The method was also applied to
Those cells that successfully bind with self-cells (normal the analysis of the structure of a transport aircraft, for
cells) and thus incorrectly mark them for destruction, are which the effective mass decreases due to a reduction in
destroyed in the thymus gland. Only those cells that fail fuel, simulated using a finite-element model. The negative
to bind to self-cells are allowed to leave the thymus selection algorithm proved capable of distinguishing var-
gland and become antibodies in the immune system. This ious damage conditions in structures induced by time-
process is called negative selection. In a similar way, varying oil storage and fuel use.
novelty detection has the fundamental objective of distin-
guishing between self (which corresponds to the normal 2.3. Method evaluation
operation of the monitored system) and non-self (which
corresponds to novel data). Probabilistic approaches are mathematically well-
The negative selection algorithm was first used by grounded and can effectively identify novel data if an
Forrest et al. [155] as a means of detecting unknown or accurate estimate of the pdf is obtained. Also, after the
illegal strings for virus detection in computer systems. A model has been constructed, only a minimal amount of
population of detectors is created to perform the function information is required to represent it, rather than requir-
of T-cells, which are simple, fixed-length binary strings. A ing the storage of the entire set of data used for training.
simple rule based on a distance measure is then used to Probabilistic methods are also known to be “transparent”
compare bits in two such strings and decide whether a methods, i.e., their outputs can be analysed using standard
match has occurred. One major concern identified by the numerical techniques. However, the performance of these
authors is the matching threshold, which has to be data approaches is limited when the size of the training set is
specific for satisfactory system performance. Dasgupta and very small, particularly in moderately high-dimensional
Majumdar [98] extended the approach for use with multi- spaces. As the dimensionality increases, the data points
dimensional data, where the dimensionality of the original are spread through a larger volume in the data space. The
data is first reduced to two dimensions using principal problem encountered when applying density methods
component analysis, and then binary encoded. The authors to sparsely populated training sets is that there is little
conclude that the encoding of self and non-self sets using control over the inherent variability introduced by the
binary strings results in a risk of destroying the semantic sparsity of the training data; i.e., the estimated quantiles
value of relationships between data items. Later, the can differ substantially from the true quantiles of the
authors propose a real-valued negative selection algorithm distribution. Different approaches have been proposed to
[101]. The main feature of the latter is that the self/ overcome the problems associated with increasing dimen-
non-self space corresponds to a subset of the original sionality which both increase the processing time and
n-dimensional data space. A detector (antibody) is defined distort the data distribution. In many real-life scenarios, no
by a hypersphere; i.e., an n-dimensional vector that a priori knowledge of the data distributions is available,
corresponds to the centre of the sphere and a scalar value and so parametric approaches may be problematic if the
228 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

Table 2
Examples of novelty detection methods using distance-based approaches.

Distance-based Section References


approach

Nearest neighbour 3.1 Angiulli and Pizzuti [156], Bay and Schwabacher [157], Boriah et al. [158], Breunig et al. [159], Chandola et al. [160],
Chawla and Sun [161], Ghoting et al. [162,163], Hautamaki et al. [164], Jiang et al. [165], Kou et al. [166], Otey et al. [167],
Palshikar [168], Pokrajac et al. [169], Wu and Jermaine [170] and Zhang and Wang [171]
Clustering 3.2 Barbará et al. [172,173], Budalakoti et al. [174], Clifton et al. [175,176], Filippone et al. [177], He et al. [178],
Kim et al. [179], Srivastava and Zane-Ulman [180,181], Sun et al. [182], Syed et al. [183], Wang [184],
Yang and Wang [185], Yong et al. [186,187], Yu et al. [188] and Zhang et al. [189]

data do not follow the assumed distribution. Thus, non- space into a grid of hypercubes of fixed sizes. If a
parametric techniques are appealing since they make hypercube contains many data points, such points are
fewer assumptions about the distribution characteristics. likely to be normal. Conversely, if a test point lies in a
Kernel functions, for example, generally scale reasonably hypercube that contains very few examples, the test point
well for multivariate data and are not computationally is likely to be an outlier. In [156], a high-dimensional data
expensive. set is mapped onto the interval ½0; 1n using Hilbert space-
filling curves. Each successive mapping improves the
3. Distance-based novelty detection estimate of the example0 s outlier score in the original
high-dimensional space. In related works, [168] adapts the
Distance-based methods, including clustering or nearest- technique proposed in [190] to continuous sequences, and
neighbour methods (Table 2), are another type of technique [166] incorporates spatial correlation between data. These
that can be used for performing a task equivalent to that analyses are related to the very well established applica-
of estimating the pdf of data. These methods rely on well- tion domain of Rough Sets, and indeed a formalisation of a
defined distance metrics to compute the distance (similarity similar approach within the framework of Rough Sets has
measure) between two data points. been proposed in [165].
Zhang and Wang [171] describe the “HighDOD” method,
3.1. Nearest neighbour-based approaches the High-Dimension Outlying Subspace Detection method, for
efficiently characterising the outlying subspaces of high-
Nearest neighbour-based approaches are among the dimensional data spaces. The novelty score of a point is
most commonly used methods for novelty detection. The measured using the sum of distances between it and its
k-nearest neighbour (k-NN) approach is based on the k nearest neighbours, as in [156]. Two heuristic pruning
assumption that normal data points have close neighbours strategies are proposed to perform fast pruning in the
in the “normal” training set, while novel points are located search, and an efficient dynamic method with a sample-
far from those points [164]. A point is declared as an based learning process is described. The dynamic subspace
outlier if it is located far from its neighbours. Euclidean search method begins the search in those subspaces that
distance is a popular choice for univariate and multivariate have the highest total saving factor. The total saving factor of
continuous attributes, but other measures, such the a subspace is defined to be the combined savings obtained
Mahalanobis distance, can be used. For categorical by applying the two pruning strategies during the search
attributes, a simple matching coefficient is often used, process. As the search proceeds, the total saving factor of
although other more complex measures have been pro- subspaces with different dimensions is updated and the set
posed [158,160]. Several well-defined distance metrics to of subspaces with the highest values are selected for
compute the distance (or similarity measure) between two exploration in each subsequent step. The search process
data points can be used [33], which can broadly be divided terminates when all the subspaces have been evaluated or
into distance-based methods, such as the distance to the pruned. Experiments were performed using both synthetic
kth nearest neighbour [171], and local density-based and real high-dimensional data sets, ranging from 8 to 160
methods in which the distance to the average of the k dimensions.
nearest neighbours is considered [164]. Many of these Different approaches were taken in [157,163], which
algorithms are unable to deal with high-dimensional data prune the search by ignoring data that cannot be consid-
sets efficiently. A recent trend in high-dimensional outlier ered to be outliers. Bay and Schwabacher [157] introduced
detection is to use the evolutionary search method where the distance-based outlier detection algorithm called
outliers are detected by searching for sparse subspaces. “ORCA”. The authors showed that for sufficiently rando-
The approach proposed in [156] considers a weighted mised data, a simple pruning step could result in the
sum of the distances from the k nearest neighbours to each average complexity of the nearest neighbour search being
data point, and classifies as outliers those points which approximately linear. After finding the nearest neighbours
have the largest weighted sums. The k nearest neighbours for a point, a threshold based on the score of the weakest
of each point are found by linearising the search space outlier (i.e., the outlier that is closer to the point) found so
using a Hilbert space curve. This work is built upon far is set for any new data point. Therefore points that
previous techniques that prune the search space for are close are discarded by the algorithm. To improve the
nearest neighbours [190]. The latter partition the data performance of ORCA, Ghoting et al. [163] proposed the
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 229

Recursive Binning and Re-projection algorithm. In the first dependencies between categorical and continuous data
of two phases, a divisive hierarchical clustering algorithm are violated by it. We note that the construction of the
is used to partition objects into bins or clusters. Objects in covariance matrix implies an assumption that the data
the same bin are reorganised according to a projection share the same distribution, which may not hold for real-
along their principal component. In the second phase, the life applications.
strategy of ORCA [157] is employed on the clustered data. A density-based scheme for outlier detection has been
This pre-processing step allowed faster determination of proposed in [159], in which a Local Outlier Factor (LOF) is
the closest nearest neighbours compared with ORCA. A computed for each point. The LOF of a point is based on
similar cluster-based pruning has been proposed in [191]. the ratios of the local density of the area around the point
Wu and Jermaine [170] used a sampling algorithm to and the local densities of its neighbours. The size of the
improve the efficiency of detection of the nearest neighbourhood of a point is determined by the area
neighbour-based technique. Rather than using the entire containing a user-supplied minimum number of points.
dataset, the nearest neighbour of every data point is The LOF takes high values for outliers, because it quantifies
computed only within a smaller sample of the dataset. how isolated the point is with regard to the density of its
This reduces the complexity of the proposed method neighbourhood. Note that LOF ranks points by only con-
according to the sample size chosen. sidering the neighbourhood density of the points, and so it
While most techniques discussed so far in this category may miss potential outliers whose densities are close to
have been proposed to handle continuous attributes, those of their neighbours.
variants have been proposed to handle other data types. A similar technique called LOCI (Local Correlation Inte-
Wei et al. [192] propose a hypergraph-based technique gral) is presented in [194]. LOCI addresses the difficulty of
called HOT, an efficient approach for detecting local out- choosing values for the number of neighbours in the LOF
liers in categorical data. A hypergraph can be defined as a technique by using data-driven methods. The local neigh-
generalised graph, consisting of a set of vertices and hyper- bourhood is defined such that each point has the same
edges. The authors use hyper-edges, which simply store radius of neighbourhood, instead having a fixed number
“frequent itemsets” (commonly used terms in association of neighbours. It uses the concept of a multi-granularity
rule mining) along with the data points (vertices) that deviation factor, to measure the relative deviation of a
contain these frequent itemsets. First, all the frequent point0 s local neighbourhood density from the average local
itemsets are mined by using the Apriori algorithm [193]. neighbourhood density in its neighbourhood. A point can
Then, they are arranged in a hierarchy according to the then be declared as an outlier by comparing its factor with
containment relationship. The hierarchy is visited using a a data-derived threshold value. The choice of an appro-
bottom-up strategy. Frequent itemsets I represent com- priate radius for the local neighbourhood becomes critical
mon attributes, while each attribute A not in I represents a for high-dimensional datasets.
potential exceptional attribute. For each itemset I, the A variant of LOF was proposed in [195]. The GridLOF
histogram of frequencies associated with each attribute A uses a simple grid-based technique to prune away some
not in I is stored, and used to compute the deviation of non-outliers and then only computes the LOF values for
each value taken by A in the database. The objects the remaining data. This avoids the computation of LOF for
assuming a value for the attribute A whose deviation is all points. Tang et al. [196] present a variation of the LOF
smaller than a defined threshold are returned as outliers. that considers both the density of a test point in its
Two advantages of this method are that (i) it alleviates the neighbourhood and the degree that the point is connected
problem of the curse of dimensionality in very large to other points; it uses a connectivity-based outlier factor to
databases, and (ii) it uses the connectivity property of identify outliers. This factor is calculated using the ratio of
points to deal efficiently with missing values. the average distance from the test point to its k-nearest
Otey et al. [167] and Ghoting et al. [162] propose a neighbours and the average distance from its k-nearest
distance measure for data containing a mix of categorical neighbours to their own k-nearest neighbours. Points
and continuous attributes. The authors define links that have the largest factors are declared as outliers. This
between two points by taking into account the dependen- approach was found to be a more effective approach for
cies between continuous and categorical attributes, where outlier detection, especially for sparse data sets, where
the distances for categorical and continuous attributes are “non-outlier” patterns may have low densities.
considered separately. For categorical attributes, the dis- Ren et al. [197] develop an efficient density-based
tance between two points is defined to be the number outlier detection approach based on a relative density
of attributes which take the same values; two points are factor (RDF). This is another local density measurement
considered linked if they have at least one common for determining the degree of being an outlier by con-
attribute-value pair. The number of attribute-value pairs trasting the density between a point and that of its
in common indicates the strength of the associated link neighbours. A P-Trees approach is used to efficiently prune
between these two points. For continuous attributes, a some non-outliers, and the remaining subset of the data is
covariance matrix is maintained to capture the dependen- then used to compute the RDF. Points with an RDF greater
cies between the continuous values. In a mixed attribute than a pre-defined threshold are considered outliers.
space, the dependence between the values with mixed Yu et al. [198] propose an outlier detection approach for
continuous and categorical attributes is captured by incre- detecting “outliers” that may occur within the loci of
mental maintenance of the covariance matrix. Thus, a data “normal” data, rather than being far from the “normal”
point can be considered to be an outlier if the expected points in data space, for both categorical and numerical
230 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

data. Similarity between points is measured by a similarity Different clustering-based techniques have been pro-
graph, which is a weighted, connected, undirected graph. posed [188,178,205,182]. Yu et al. [188] propose an outlier
A weight for a pair of points specifies the similarity detection approach based on a wavelet transform, which
between them. A point can be considered to be an outlier can be extended to detect outliers in datasets with
if its similarity relationship with its neighbours is lower different densities. This approach uses wavelets to trans-
than the similarity relationships among its neighbours0 form the data and then find dense clusters in the trans-
neighbourhood. This use of similarity graphs overcomes formed space. He et al. [178] present a new definition of a
the disadvantage of the traditional similarity measure cluster-based local outlier, which takes into account both
(which assumes that outliers are far away from the the size of a point0 s cluster and the distance between the
“normal” points in data space) and can easily be applicable point and its closest cluster. Each point is associated with a
for categorical as well as numerical data. Several other cluster-based local outlier factor, which is used to deter-
variants of the LOF method have been proposed to handle mine the likelihood of the point being an outlier. This
different data types [199] and applied for detecting spatial approach partitions the data into clusters using a squeezer
outliers or anomalies in climate data [161,200], protein algorithm, which makes a single pass over the dataset and
sequences [201], network intrusion [202,164], and video produces initial clustering results. The outlier factor is then
sensor data [169]. computed for each point, and those points which have the
largest factors are considered outliers. This approach is
linearly scalable with respect to the number of data points
3.2. Clustering-based approaches and was found to work well with large datasets. Another
technique that addressed computational efficiency was
Clustering-based approaches to distance-based novelty proposed by Sun et al. [182], in which an efficient indexing
detection include methods such as the k-means clustering. technique called CD-trees was used to partition data
In this general type of methods, the “normal” class is into clusters. Those points belonging to sparse clusters
characterised by a small number of prototype points in the are declared anomalies.
data space. The minimum distance from a test point to the Basu et al. [15] introduce a method for semisupervised
nearest prototype is often used to quantify abnormality. clustering that employs Hidden Random Markov Fields
The methods use different approaches to obtain the (HMRFs) to use both labelled and unlabelled data in the
prototype locations. The k-means clustering algorithm is clustering process. The method can be used with a number
perhaps the most popular method of clustering structured of distortion measures, including Bregman divergences
data due to its simplicity of implementation [180,181]. Kim (such as the Kullback–Leibler divergence) and directional
et al. [179] use novelty detection to identify faulty wafers measures. The authors propose an EM-based clustering
in semiconductor manufacturing. Among other methods, algorithm, HMRF-KMEANS that incorporates supervision
the authors employed the k-means clustering technique, in the form of pairwise constraints at all stages of the
GMMs, Parzen windows, and other non-probabilistic algorithm: initialisation, cluster assignment, and para-
methods, such as one-class support vector machines and meter estimation. The HMRF method led to improved
reconstruction methods based on principal component results when applied to realistic textual datasets, in
analysis. The k-means algorithm works by choosing k comparison to unsupervised clustering methods. The
random initial cluster centres, computing the distances algorithm discussed in [206] seeks to minimise class
between these cluster centres and each point in the differences between nearby points and maximise class
training set, and then identifying those points that are differences between distant points. It was applied to the
closest to each cluster centre. The corresponding cluster classification of representations of handwritten digits and
centres are moved to the centroid of those nearest points a synthetic control time-series. A similar approach was
and the procedure is repeated. The algorithm converges applied to network intrusion detection tasks in [207],
when the cluster centres do not move from one iteration combining factor analysis and the Mahalanobis distance
to the next. This procedure is often used as a pre- metric.
processing procedure [95]. Clustering data that are represented by a sequence of
Among the prototype-based clustering algorithms, we individual symbols was considered by Yang and Wang
can identify many modifications of the k-means algorithm. [185], who describe a clustering algorithm called CLUSEQ
Popular fuzzy-clustering algorithms are the fuzzy versions that produces a set of overlapped clusters, and which is
of the k-means algorithm with probabilistic and possibi- able to adjust automatically the number of clusters and
listic descriptions of memberships: fuzzy c-means [203] the boundary used to separate “normal” sequences from
and possibilistic c-means [204], respectively. The latter outliers. The similarity measure uses the conditional
technique has recently been extended by Filippone et al. probability distribution derived from sequences, where a
[177]. In the proposed extension, positive semidefinite variation of the suffix tree (the probabilistic suffix tree) is
kernels are used to map implicitly input patterns into a used to estimate the distribution. The CLUSEQ algorithm
high-dimensional space, in which the mapped data are takes a training set of sequences and a set of tree para-
modelled by means of the possibilistic clustering algo- meters as input from which it produces a set of clusters. An
rithm. Wang [184] presents a hybrid approach that incor- iterative process is used in which, for each iteration, a set
porates two kernel-based clustering methods (using the of new clusters is generated from the set of unclustered
concepts of fuzzy and possibilistic c-means) for outlier sequences to augment the current set of clusters, which is
identification and market segmentation. followed by a sequential examination of every sequence to
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 231

evaluate its similarity to each cluster and to update may not be stable due to the dynamic changes of the
its cluster membership. At the end of each iteration, a network topology [12].
consolidation procedure is invoked to merge heavily Clifton et al. [175,176] apply the k-means clustering algo-
overlapped clusters. Budalakoti et al. [174] propose a rithm to condition monitoring of aerospace gas-turbine
different outlier detection approach that efficiently clus- engines. Novelty scores zðxÞ are defined to be the number of
ters sequence data into groups and finds anomalous standard deviations that a test point x lies from its closest
sequences that deviate from normal behaviour. A fast cluster centre, relative to the distribution of all clusters.
“normalised longest common subsequence” (nLCS) is used Yong et al. [187,186] consider novelty detection in
as the similarity measure for comparing sequences. multiple-scene image sets. The framework starts with
Syed et al. [183] explore k-NN and clustering-based wildlife video frame image segmentation, followed by
novelty detection approaches to identify high-risk patients. feature extraction and classification using the k-NN
For many clinical conditions, patients experiencing adverse method. The labelled image blocks are then used to
outcomes comprise a small minority of the population. generate a co-occurrence matrix of object labels (called
When evaluated with demographic and comorbidity data the block label co-occurrence matrix, BLCM), which repre-
acquired from over 100,000 patients, the methods consid- sents the semantic context within the scene. Principal
ered were able to identify patients at an elevated risk of component analysis is used to perform dimensionality
mortality and morbidity following surgical procedures. reduction, resulting in models for scene categories. The
Barbará et al. [173] suggest the use of a bootstrapping classification model is used to classify an image into scene
technique that first separates normal data from outliers classes; if it does not belong to any scene class, the image
using frequent itemset mining. Data are windowed is classified as being “abnormal”. Yong et al. [186] assume
in time, and frequent itemsets then generated for each that in the BLCM feature space, for each scene type, there
window. All itemsets which exist in more than one is a dense cluster related to normal images, while novel
window are considered normal. Clusters were obtained images are sparsely distributed around these clusters.
using COOLCAT, a clustering tool for categorical data During training, the centroid of the BLCM for each image
developed in [172]. This method was applied to intrusion group is calculated. The distances to the centroid for all
detection tasks. points in the same scene are computed, and a threshold
Ertöz et al. [208] explore a clustering algorithm based based on the mean and standard deviation of the distances
on a shared nearest-neighbour approach. This technique is determined. Test images are then assessed by calculating
first finds the nearest neighbours of each point and then their BLCM feature distance to the trained one-class
redefines the similarity between pairs of points in terms of centres: if it is smaller than a defined threshold for that
how many nearest neighbours the two points share. Using class, the images are accepted by that one-class classifier,
this definition of similarity, the algorithm identifies “core otherwise they are rejected from that class. If the test
points” and then builds clusters around the core points. image is rejected by all classes, the image is classified as
The use of a shared nearest neighbour definition of being novel. This multiple one-class classification with a
similarity alleviates problems with varying densities and distance thresholding algorithm was compared with a pdf-
high dimensionality, and the use of core points handles based one-class classifier [126] and a one-class support
problems with shape and size of the distribution. The vector machine [209]. The proposed algorithm was shown
number of clusters is automatically determined by the to perform the most successfully at the task of detecting
location and distribution of core points. Another novel novel wildlife scenes.
aspect of the shared nearest neighbour clustering algo- Zhou et al. [210] present a method for distributed novelty
rithm is that the resulting clusters do not contain all detection on simulation mesh data. Large-scale simulation
points, but contain only those points lying in regions of datasets are typically located on multiple computers and
relatively uniform density. The authors apply this algo- cannot be merged due to communication overhead and
rithm to the task of finding topics in collections of computational inefficiency. The proposed method consists of
documents, for which it out-performed the k-means three steps. In the first step, local models from all distributed
clustering method. data sources are built using clustering-based methods.
Zhang et al. [189] describe an unsupervised distance- Novelty scores for test points are based on the distance to
based technique to identify global outliers in query- the nearest cluster centre. In the second step, all local outliers
processing applications of sensor networks. The proposed are collected from distributed sites and shared with each site,
technique reduces the communication overhead of sensor and all local models are rebuilt. Finally, in the third step, the
nodes using an aggregation tree. Each node in the tree local outliers0 novelty scores are computed using the ensem-
transmits data to its parent after collecting all data sent ble of the local models0 results from the previous step.
from its children. The sink (also called the base station) The ensemble methods consider both quality criteria of local
uses the information from its children to identify the most models acting on local points and diversity criteria4 of local
likely outliers and “floods” these outliers for verification. If models acting on all local outliers to detect novelty in a
any node finds that it has two kinds of data which global view.
may modify the global result, it will send them to its There are other variants of the methods described
parent in an appropriate time interval. This procedure is above. Spinosa et al. [211] describe an online novelty and
repeated until all nodes in the network agree on the
results produced by the sink node. This technique con-
siders only univariate data and the aggregation tree used 4
Mutual information is used as a measure to indicate diversity.
232 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

drift detection algorithm that uses standard clustering Probabilistic and distance-based approaches rely on
methods to generate candidate clusters among examples similar assumptions. They attempt to characterise the area
that are not explained by the currently known concepts. of the data space occupied by normal data, with test
Hassan et al. [212] propose a heuristic method based on a data being assigned a novelty score based on some sort
clustering approach in wireless sensor networks. Idé et al. of distance metric. Nearest-neighbour and clustering-
[213] address the task of change-point analysis in corre- based techniques require a distance measure computation
lated multi-sensor systems. Their approach is based on between a pair of data points. These techniques, when
a neighbourhood preservation principle: if the system applied to novelty detection, assume that the distance
is working normally, the neighbourhood graph for each measure can discriminate between novel and normal data
sensor is almost invariant with respect to fluctuations points. Probabilistic techniques typically fit a probability
arising from experimental conditions. With this notion of model to the given data and determine whether or not a
a stochastic neighbourhood, the proposed method was test data point comes from the same model by assuming
able to compute novelty scores for each sensor. Onuma that normal data points occur in the so called “high
et al. [214] use clustering-based novelty detection for density regions” of the model. One of the main differences
recommender systems. The authors apply a graph-based between these two types of approach is the computational
method to recommend items that are not in the user0 s complexity and scalability of the proposed techniques.
current set of interests, but which lie in neighbouring
areas of interest. This ensures novelty, and provides variety
in the recommendations, which is seen favourably
4. Reconstruction-based novelty detection
by users.
Reconstruction-based methods are often used in safety-
3.3. Method evaluation critical applications for regression or classification pur-
poses. They can autonomously model the underlying data,
Distance-based approaches do not require a priori and when test data are presented to the system, the
knowledge of the data distribution and share some com- reconstruction error, defined to be the distance between
mon assumptions with probabilistic approaches. Nearest the test vector and the output of the system, can be related
neighbour-based techniques, however, rely on the exis- to the novelty score. Neural networks and subspace-based
tence of suitable distance metrics to establish the similar- methods can be trained in this way (Table 3).
ity between two data points, even in high-dimensional
data spaces. Furthermore, most of them only identify novel 4.1. Neural network-based approaches
data points globally and are not flexible enough to detect
local novelty in data sets that have diverse densities and Several types of neural networks have been proposed
arbitrary shapes. Generally, in high-dimensional data sets for novelty detection, a review of which can be found in
it is computationally expensive to calculate the distance [27]. In this section, we will concentrate on more recent
between data points and as a result these techniques lack methods that use neural networks.
scalability. Clustering-based approaches are capable of Augusteijn and Folkert [215] investigate the ability
being used in incremental models, i.e., new data points of the back-propagation neural network architecture (a
can be fed into the system and tested to identify novelty. multi-layer perceptron, or MLP) to detect novel points.
New techniques have been developed to optimise the One novelty detection approach uses a threshold on the
novelty detection process and reduce the time complexity highest output value and declares a point to be novel if
with respect to the size of data. However, these techniques this value remains below the threshold; a second approach
suffer from having to choose an appropriate value calculates the distance between the output and all
of cluster width and are also susceptible to the curse of target points and classifies the test point as novel if the
dimensionality. minimum distance is found to exceed a predefined

Table 3
Examples of novelty detection methods using reconstruction-based approaches.

Reconstruction-based Section References


approach

Neural networks 4.1


Multi-layer perceptron Augusteijn and Folkert [215] and Singh and Markou [216]
Hopfield networks Crook et al. [217]
Autoassociative Diaz and Hollmen [218], Hawkins et al. [219], Japkowicz [220], Manevitz and Yousef [221], Thompson et al. [222],
networks Williams et al. [223]
Radial basis function Bishop [21], Jakubek and Strasser [224], Li et al. [225] and Nairac et al. [110]
Self-organising Albertini and de Mello [226], Barreto and Aguayo [227], Deng and Kasabov [228], García-Rodríguez et al. [229],
networks Hristozov et al. [230], Kit et al. [231], Marsland et al. [232,233], Ramadas et al. [234] and Wu et al. [235]
Subspace methods 4.2 Chen and Malin [236,237], Günter et al. [238], Hoffmann [239], Lakhina et al. [240], Kassab and Alevandre [241],
McBain and Timusk [242], Perera et al. [243], Ide and Kashima [244], Shyu et al. [245], Thottan and Ji [246],
Toivola et al. [247] and Xiao et al. [248]
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 233

threshold. Both approaches were found to lead to a poor outperforms other evolutionary algorithms and genetic
ability to identify novel data. The authors have also algorithm approaches.
explored the applicability of the probabilistic neural net- A novelty detection method applied to region-segmented
work, which contains as many nodes as there are points in outdoor scenes in video sequences is proposed in [216,9].
the training set, where the connections to these nodes are Their approach uses a feature-selection mechanism to encode
weighted with the feature values of the training data. image regions, and an MLP acting as a classifier. The MLP is
Points belonging to the same category may first be used to reject any input not similar to the training data.
clustered, and the cluster centres may then be used as A rejection filter is used to classify test data as either known
initial connection weights. When presented with test data, or unknown. The known data points are classified by the
each output unit, which can incorporate a prior probability neural network into one of the known classes on the basis of
and a cost of misclassification associated with the cate- a “winner takes all” strategy. The rejected data points are
gory, calculates a quasi-probability of the data belonging to collected in a “bin” for further processing. Post-processing of
that category. If the highest output value lies below a this filtered output has the goal of identifying clusters.
predefined threshold then the pattern can be assumed to Clusters that represent novel data are manually labelled and
belong to a class not represented by the network. This a new network is trained. This method reduces multi-class
method showed superior performance as an overall clas- classification into a number of binary classifications; i.e.,
sifier when compared to the MLP and was able to identify a classification problem with 10 classes is decomposed into
novel patterns. a set of 10 binary problems. One neural network per class is
Hawkins et al. [219] and Williams et al. [223] present trained, where data are labelled as either belonging or not
an outlier detection approach for large multivariate data- belonging to the class. In this approach, the innovation is that
sets based on the construction of a Replicator Neural random rejects (data that were labelled as not belonging to
Network (RNN). The RNN is an MLP which has the same the class) are artificially generated. Although the performance
number of input and output neurons (that correspond to of this framework was demonstrated using video data, the
the features in the dataset), and three hidden layers. The number of random rejects to be generated requires further
aim of the RNN is to reproduce the input points at the investigation.
output layer with the minimum reconstruction error, after Generalised radial basis functions (RBF) neural net-
compression through the hidden layers (which contain works have also been proposed in several different appli-
a smaller number of nodes than the input and output cations for novelty detection [21,110,225]. In this case,
layers). If a small number of input points are not recon- reverse connections from the output layer to the central
structed well (they have large reconstruction errors), these layer are added, similar to a self-organising Bayesian
points can be considered as outliers. An outlier factor classifier, which is capable of novelty detection. Each
based on the average reconstruction error is used as a neuron in the central layer has an associated Normal
novelty score. These techniques have been used in several distribution which is learned from the training data. Novel
investigations [218–223]. The auto-associative network points have low likelihood with respect to these distribu-
described in [222], also termed an autoenconder, computes tions and hence result in low values at each output node.
the bitwise difference between input and output to high- In [110], a kernel is used to represent the distribution at
light novel components of the input. Diaz and Hollmen each neuron, such that the distance of a test point from
[218] also use an auto-associative neural network, where the nearest kernel centre can be determined and used
the residual mean-square error is used to quantify novelty. to detect novelty. Jakubek and Strasser [224] propose a
The method was used to detect outliers in vibration data technique which uses neural networks with ellipsoid basis
for fraud detection in synchronous mechanical units. functions for fault detection. The advantage of these
Haggett et al. [249] present a dynamic predictive functions is that they can be fitted to the data with more
coding mechanism using a neural network model of accuracy than radially symmetric RBFs. The distribution of
circuits in the retina proposed in [250]. This latter model each cluster is represented by a kernel function. Results
is a feed-forward network in which the connections may from experiments showed that the proposed network uses
be modifiable synapses (weights). These modifiable fewer basis functions than a RBF network of equivalent
weights are modulated according to an anti-Hebbian accuracy.
learning rule which causes the modifiable synapses to Crook et al. [217] applied Hopfield Networks to detect
weaken when the activity at the pre-synaptic and post- novelty in a mobile robot0 s environment. A Hopfield net-
synaptic neurons are correlated, and to strengthen when work uses binary threshold nodes with recurrent connec-
the activity is anti-correlated. Hosoya et al. [250] demon- tions between them. A global energy function for the
strate the operation of this network with a number of network with symmetric connections is defined and used
artificially generated visual environments. This network is to determine if a test point presented to the network is
used as the basis for the novelty detector proposed in novel or not. The value of the energy function is lower for
[249], which uses dynamic predictive coding. Three evolu- “normal” points and higher for novel points. It was demon-
tionary algorithms, including a genetic algorithm and the strated that the method can be used to learn a model of an
Neuro-evolution of Augmenting Topologies (NEAT), are used environment during exploration by a robot and then detect
to optimise the structure of the network to improve its novel features in subsequently encountered environments.
performance using stimuli from a number of artificially Kohonen maps, also called Self-Organising Maps (SOMs),
generated visual environments. The authors have demon- can be used for novelty detection [251]. The SOM is a neural
strated that the optimised network evolved by NEAT network with a grid-like architecture that is primarily used
234 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

as an unsupervised technique for identifying clusters in image data and any available spatial data. Growing neural
a dataset and which, in effect, moves the position of the gas is an unsupervised incremental clustering algorithm.
nodes in the feature space to represent these identified Given some input distribution in the feature space, the
clusters. When normal data are used to train a SOM, it method creates a network of nodes, where each node has a
creates a kernel-based representation of normality which position in the feature space. It is an adaptive algorithm in
can be used for novelty detection. The Euclidean distance the sense that if the input distribution slowly changes over
between a test point and nodes in the SOM is evaluated, and time, the network is able to adapt by moving the nodes to
used to determine novelty. SOMs have been used to detect cover the new distribution. The closest node to a test point
network intrusions [252,234]. One important characteristic is found, an error distance between the two can be
of this type of neural network is that they are topology- determined, and the error compared to a threshold to
preserving; i.e., the network preserves neighbourhood rela- determine novelty.
tionships between the data by mapping neighbouring inputs García-Rodríguez et al. [229] address the ability of self-
onto neighbouring nodes in the map. The “Kohonen SOM” is organising neural network models to manage real-time
a static SOM with a fixed structure: the grid size and the applications, using a modified learning algorithm for a
number of nodes have to be determined a priori. This results growing neural gas network. The modification proposed
in a significant limitation on the final mapping as it is aims to satisfy real-time temporal constraints in the
unlikely that the most appropriate structure is known adaptation of the network. The proposed learning algo-
beforehand. Several SOM variations, known as dynamic rithm can add multiple neurons per iteration, the number
SOMs, and other neural network-based approaches, known of which is controlled dynamically. The authors concluded
as growing networks, have been introduced in the past to that the use of a large number of neurons made it difficult
overcome these shortcomings. The latter include the “grow- to obtain a representation of the distribution of training
ing cell structures” model, the “incremental grid-growing” data with good accuracy in real-time.
model, the “growing neural gas”, the “growing SOM” and Albertini and de Mello [226] propose a network that
the “evolving SOM” [253–256,228,232]. integrates features from SOM, GWR, and adaptive reso-
Marsland et al. [232] propose a self-organising network nance theory networks. When an input is presented, the
that Grows When Required (GWR). In this method, each network searches through categories stored for a match. If
node is associated with a subset of the input space, and no match is found, then the input is considered to be
the network is initialised with a small number of nodes novel. A ligand-based virtual screening method based on
randomly located in the input space, and which are using SOMs for novelty detection is described in [230]. Wu
unconnected. At each training iteration, a new input is et al. [235] present a case study in which an online fault
presented to the network, and existing nodes are either learning method based on SOM techniques was adopted
(i) moved or removed to better represent the distribution for use with mechanical maintenance systems.
of the training data (the adaptation process), or (ii) new Barreto and Aguayo [227] evaluate the performance of
nodes and connections are added to the network (the different static and temporal SOM-based algorithms for
growing process). A new node is added when the “activity” identifying anomalous patterns in time series. The meth-
of the best matching node (the node that best matches the odology consists of computing decision thresholds from
input) is not sufficiently high. The activity of nodes is the distribution of quantisation errors produced by normal
calculated using the Euclidean distance between the training data, which are then used for classifying incoming
weights for the node and the input. Novelty detection data samples. Results from the experiments conducted
can be performed by describing how often a node has fired show that temporal variants of the SOM are more suitable
before (i.e., how often it has been the “winning” node to deal with time-series data than static competitive
when presented with input data). This network was neural networks.
applied to different tasks, including robot sonar scans,
medical diagnosis, and machine fault detection. This
method was also successfully applied in real-time auto- 4.2. Subspace-based approaches
mated visual inspection using mobile robots [10], in which
colour statistics are used to encode visual features. In A different type of reconstruction-based novelty detec-
[233], the GWR network is combined with habituation tion (termed spectral methods in [24]) uses a combination
networks. This method is based on learning to ignore of attributes to best describe the variability in the training
previously encountered “normal” data, so that novel inputs data. These methods assume that data can be projected or
are given more weight in the analysis, and become easier embedded into a lower dimensional subspace in which
to detect. This is achieved using the GWR network with “normal” data can be better distinguished from “abnor-
the addition of synapses connecting each node in the mal” data. Principal Components Analysis (PCA) is a
network to an output node. The algorithm was demon- technique for performing an orthogonal basis transforma-
strated with the task of mobile robot inspection of corridor tion of the data into a lower-dimensional subspace. The
environments, using inputs from sonar sensors and images number of features needed for effective data representa-
from a camera mounted on the robot. tion can thus be reduced. This technique can be used for
A different approach applied to data from a camera novelty detection by constructing a model of the distribu-
carried by a robot is taken by Kit et al. [231], who use tion of training data in the transformed space [257, ch. 10].
growing neural gas [256] to detect changes. The model The first few principal components of a dataset correspond
uses a growing neural gas network constructed using to vectors in the data space that account for most of the
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 235

variance of the data. The last few principal components sensor signals; however, in the presence of a new source
can be used to find features that are not apparent with of variance, the associated PCA decomposition changes.
respect to the original variables. Dutta et al. [13] describe Sensor drift and other effects may be taken into account
an outlier detection algorithm that uses approximate prin- because the model is adaptive, and is updated in a recursive
cipal components. The last principal component enables manner with minimal computational effort. The technique
identification of points which deviate significantly from the was applied to signals from oil vapour leakages in air
“correlation structure” of the data. Thus, a normal point that compressors.
exhibits similar correlation structure to that of the training Further examples of reconstruction-based novelty detec-
data will have a low value for such projections, and an tion with graph-based data have been proposed in recent
outlier that deviates from the correlation structure will have years [261–263,244]. Ide and Kashima [244] approach the
a large value. This approach was applied to novelty detec- problem of Web-based systems as a weighted graph, using
tion in astronomy catalogues. eigenvectors of adjacency matrices (that represent the
Shyu et al. [245] propose a novelty detection approach activities of all of the services) from a time-series of graphs
based on PCA, which can be seen as a robust estimator of to detect novelty. At each time point, the principal compo-
the correlation matrix of normal data. Two functions of nent of the matrix is chosen as the activity vector for the
principal components to identify outliers are sequentially given graph (which is represented as an adjacency matrix
executed. The first function uses the major principal for a given time). The “normal” time-dependencies of the
components to detect extreme points with large variances data are captured using the principal left-singular vector of
and covariances depending on the subset of original the matrix which contains these time-series activity vectors
attributes. The second function, as in [13], uses the minor as column vectors. For a new test graph, the angle between
principal components to further identify the remainder of its activity vector (the principal component vector for the
the outliers, which have different correlation structures test adjacency matrix) and the principal-left singular vector
from normal data. Experiments with network intrusion obtained from previous graphs is computed and used to
data indicated that the proposed scheme performed better calculate a novelty score for the test graph. If the angle
than techniques based on clustering approaches, and that changes by more than some threshold, an abnormality is
it can work in an unsupervised manner. Other authors declared to be present. In a similar approach, Sun et al.
have applied this PCA-based technique to network intru- [263] perform Compact Matrix Decomposition on the adja-
sion detection [240,246] and to the detection of anomalies cency matrix of each graph on a sequence of graphs. An
in spacecraft components [258]. approximate version of the original matrix is constructed
Hoffmann [239] explores kernel PCA in the context of from the decompositions, and a time-series of approxima-
the novelty detection task. Kernel PCA [259] extends the tion (or residual) errors between the original matrix and the
standard PCA to non-linear data distributions by mapping approximation matrix is then determined and used to
points into a higher-dimensional feature space before detect abnormalities.
performing PCA. In this feature space, using a kernel, the Kassab and Alexandre [241] introduce an Incremental
originally linear operations of PCA are performed with a data-driven Learning of Novelty Detector Filter (ILoNDF) for
non-linear mapping (the “kernel trick”). This method has one-class classification with application to high-dimen-
been applied to astronomical data for the prediction of sional noisy data. The novelty detection filter is imple-
stellar populations in space [14]. In [239], a principal- mented with a recurrent network of neuron-like adaptive
component subspace in an infinite-dimensional feature elements. It consists of n fully connected neurons, where n
space describes the distribution of training data, and the is the dimensionality of the input vectors, such that all
reconstruction error of a test point with respect to this neurons are both input and output neurons. The weights
subspace is used as a measure of novelty. A strategy to associated with the feedback connections provide the
improve the convergence of the kernel algorithm for variable internal state of the network, and are updated
iterative kernel PCA is described in [238]. Novelty detec- after presentation of an input vector using an anti-Hebbian
tion with kernel PCA achieved better classification of learning rule. The network continuously integrates infor-
novelty when applied to hand-written-digit and breast mation relating to the distribution of the training data and
cancer databases, compared with a one-class support their co-occurrence dependencies. Because it operates
vector machine and a Parzen window density estimator. online without repeated training, the proposed method
Kernel PCA is not robust to outliers in the “normal” does not require extensive computational resources.
training set due to the properties of the L2 norm used in Experiments conducted involving text categorisation
the optimisation part of the training procedure [248]. tasks showed that ILoNDF tends to be more robust, is less
Kwak [260] proposes a PCA method based on the L1 norm. affected by initial settings, and outperforms methods
Xiao et al. [248] extend this work to L1 norm-based kernel such as auto-associative neural networks and PCA-based
PCA. The proposed method was applied to novelty detec- models.
tion, and benefited from the robustness of the L1 norm to Chatzigiannakis et al. [264] present an approach that
outliers. fuses data gathered from different nodes in a distributed
Perera et al. [243] have developed a novelty detection wireless sensor network. An offline analysis step creates a
method based on a recursive dynamic PCA approach for gas model of normality; testing is then performed in real-time.
sensor array applications. Their method is based on a sliding PCA is used to identify test data that are considered
window-based variance analysis algorithm. Under normal “normal”, where anomalies tend to result in large
conditions, a certain variance distribution characterises variations in the PCA residual. This procedure can be
236 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

computationally intense, and so a sliding window is used, In [247], three dimensionality-reduction approaches
with the principal components being re-estimated only are assessed for novelty detection: the non-linear Curvi-
when the deviation in one or more correlation coefficients linear Component Analysis (CCA), classical PCA, and a
(of all the monitored metrics) exceeds a threshold. The computationally inexpensive Random Projections algo-
approach was demonstrated using meteorological data rithm. As with Kohonen0 s SOM, the CCA aims to reproduce
collected from a distributed set of sensor nodes. the topology of the original data in a projection subspace,
Chen et al. [236,237] introduce the community anomaly but without fixing the configuration of the topology. It
detection system (CADS), an unsupervised algorithm for is an adaptive algorithm for non-linear dimensionality
detecting computer access threats based on the access logs reduction, which minimises a cost function based on
of collaborative environments. Collaborative system users inter-point distances in both input and output space. Since
tend to form community structures based on the subjects the topology cannot be entirely reproduced in the projec-
accessed (e.g., patients0 records are typically viewed by tion subspace, which has a lower dimension than the
healthcare providers). CADS is a “hybrid” method that original subspace, the local topology is favoured to the
consists of two schemes: it uses singular value decom- detriment of the global topology. Random Projections is a
position to infer communities from relational networks of computationally inexpensive method of linear dimension-
users, and k-NN to establish sets of nearest neighbours. ality reduction. It embeds a set of points from the original
The latter is used as a model to determine if users have space in a randomly selected subspace whose dimension-
deviated from the behaviour of existing communities. In ality is logarithmic with respect to the dimension of
[237], CADS is extended to MetaCADS to account for the the original space, such that pairwise distances between
semantics of subjects (e.g., patient diagnoses). The frame- points before and after projection change only by a small
work was empirically evaluated using three months of factor. Four classification approaches were evaluated:
access logs from the electronic health record system of a three based on probabilistic models, such as the GMM,
large medical centre. When the number of “illicit” users is and one based on a nearest-neighbour method. Experi-
small, MetaCADS was shown to be the best-performing ments showed that CCA was the best projection method
model of those considered, but as the number of illicit users for the novelty detectors assessed in this work. The
grows, the original CADS algorithm was most effective. authors mention that this agrees with expected results,
Lämsä and Raiko [265] demonstrate how non-linear factor since the CCA method should be more powerful due to its
analysis (NFA), as a neural network-based method, can be non-linear nature. Nevertheless, PCA was able to compete
used for separating structural changes from environmental for datasets of lower dimensionality, when using a nearest-
and operational variation, and thereafter also for damage neighbour classifier.
detection. Here, the relationships between the observations
are described in terms of a few underlying but unobservable
factors. The goal is to eliminate the adverse effects of these 4.3. Method evaluation
underlying factors from the observations resulting in new
variables that can be used in damage detection. NFA is based Reconstruction-based approaches belong to a very flex-
on the assumption that the underlying factors are indepen- ible class of methods that are trained to model the under-
dent and normally distributed, which is typically not the lying data distribution without a priori assumptions on the
case. A non-linear mapping from hidden factors to observa- properties of the data. Neural networks require the optimi-
tions is modelled by a two-layer MLP. This method was sation of a pre-defined number of parameters that define
applied to vibration data, and it was shown that damage the structure of the model, and their performance may be
detection via elimination of environmental and operational very sensitive to these model parameters. They may there-
effects from damage features is feasible. fore be very difficult to train in high-dimensional spaces.
McBain and Timusk [242] propose a feature reduction Moreover, networks that use constructive algorithms, in
technique for novelty detection. The method is similar to which the structure of the model is allowed to grow, suffer
multiple discriminant analysis in that it attempts to find from the additional problem of having to select the most
a subspace that maximises the difference between the effective training method to enable the integration of new
average distance of the “normal” class and the average units into the existing model structure, and an appropriate
distance of the “abnormal” class. The effect of the reduced stopping criterion (for when to stop adding new units).
subspace on classification was shown to be better than With subspace-based approaches, appropriate values must
that obtained from other dimensionality-reduction be selected for the parameters which control the mapping
methods (such as PCA and kernel PCA), for machinery to a lower-dimensional space. It is difficult to determine
monitoring data. which are the key attributes and it is computationally
Timusk et al. [266] describe a strategy for vibration- expensive to estimate the correlation matrix of normal
based online detection of faults in machinery. A selection patterns accurately.
of seven different types of novelty detection algorithms
were implemented and compared, including a similar PCA-
based approach to those described previously, probabilistic 5. Domain-based novelty detection
methods (such as a single Gaussian distribution density
estimator), and clustering methods (such as k-NN and Domain-based methods require a boundary to be created
k-means). Results showed that the PCA-based approach based on the structure of the training dataset. These methods
was the best-performing fault detector. are typically insensitive to the specific sampling and density
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 237

of the target class, because they describe the target class A classifier based on a generalised SVM was trained offline
boundary, or the domain, and not the class density. Class with models of people and cars, and was later used to reject
membership of unknown data is then determined by their previously unseen moving objects, such as bicycles, vans,
location with respect to the boundary. As with two-class and trucks. The sequences of training images are processed
SVMs, novelty detection SVMs (most commonly termed to derive class labels for training the sequence classifier.
“one-class SVMs” in the literature) determine the location Each training image sequence has a corresponding class
of the novelty boundary using only those data that lie closest label distribution which is simply the relative frequencies of
to it (in the transformed space); i.e., the support vectors. All each class label in the class label sequence. Using these class
other data from the training set (those that are not support label distributions, a logistic linear classifier is constructed to
vectors) are not considered when setting the novelty bound- partition the class label distribution space. Image sequences
ary. Hence, the distribution of data in the training set is not assigned to the same class are ranked based on class
considered which is seen as “not solving a more general likelihood. This likelihood monotonically increases with
problem than is necessary” [209,267]. increasing novelty in the sequence, which allows the user
SVMs are a popular technique for forming decision to focus on those examples that cause the greatest uncer-
boundaries that separate data into different classes. The tainty of classification for the classifier.
original SVM is a network that is ideally suited for binary SVMs have been used for novelty detection in two
pattern classification of data that are linearly separable. related approaches [271,209,267]. The idea of the one-class
The SVM uses a hyperplane that maximises the separating SVM approach proposed by Schölkopf et al. [209] is to
margin between two classes. The training points that lie define the novelty boundary in the feature space corre-
near the boundary defining this separating margin are sponding to a kernel, by separating the transformed
called support vectors. Since the introduction of the origi- training data from the origin in the feature space, with
nal idea, several modifications and improvements have maximum margin. This approach requires fixing a priori
been made. The Robust Support Vector Machines (RSVMs) the percentage of positive data allowed to fall outside the
algorithm [268,269] addresses the over-fitting problem description of the “normal” class. This makes the one-class
introduced by noise in the training dataset. In this approach, SVM more tolerant to outliers in the “normal” training
an averaging technique (in the form of class centre) is data. However, setting this parameter strongly influences
incorporated into the standard SVM, which makes the the performance of this approach (as discussed in [271]).
decision surface smoother, controlling the amount of reg- Another approach, the support vector data description
ularisation automatically. The number of support vectors of (SVDD) method, proposed by Tax and Duin [267], defines
RSVMs is often fewer than that of standard SVMs, which the novelty boundary as being the hypersphere with
leads to a faster execution speed. Experiments using system minimum volume that encloses all (or most) of the
intrusion detection data showed that RSVMs can provide a “normal” training data. The SVDD automatically optimises
reasonable ability to detect intrusions in the presence of the model parameters by using artificially generated
noise. Nevertheless, it does not address the fundamental unlabelled data uniformly distributed in a hypersphere
issue of the unbalanced nature between “normal” and around the “normal” class. This causes the method to
“abnormal” training examples as typically exists for novelty struggle with applications involving high-dimensional
detection applications. spaces. Novelty is assessed by determining if a test point
Li et al. [270] propose an algorithm for online novelty lies within the hypersphere. In order to address the
detection with kernels in a Reproducing Kernel Hilbert problem that the transformed data are not spherically
Space. The technique differs from the original SVM for- distributed, Campbell and Bennett [272] use different
mulation, in that training data are processed sequentially, kernels with linear programming optimisation methods,
and the Lagrangian dual equation is solved by maximising rather than the quadratic programming approaches typi-
a quadratic equation in a given interval. The proposed cally used with SVMs. Other approaches to novelty detec-
approach has a simple update procedure with a much tion have recently been proposed which are based on the
lower computational cost than with the equivalent proce- use of either the SVDD or the one-class SVM (Table 4).
dure for conventional SVMs.
Diehl et al. [8] have implemented a real-time novelty 5.1. Support vector data description approaches
detection mechanism for video surveillance. Their method
is based on the extraction of monochromatic spatial Some extensions to the SVDD approach have recently been
features in image sequences to represent moving objects. proposed to improve the margins of the hyperspherically

Table 4
Examples of novelty detection methods using domain-based approaches.

Domain-based approach Section References

Support vector data 5.1 Campbell and Bennett [272], Le et al. [273,274], Liu et al. [275,276], Peng and Xu [277], Tax and Duin [267],
description Wu and Ye [278] and Xiao et al. [279]
One-class support vector 5.2 Clifton et al. [280,281,3], Evangelista et al. [282], Gardner et al. [283], Hardoon et al. [284], Hayton et al. [285],
machine Heller et al. [286], Lazarevic et al. [287], Lee and Cho [288], Ma and Perkins [289,290], Manevitz and Yousef [271],
Rabaoui et al. [291], Schölkopf et al. [209] and Zhuang and Dai [292]
238 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

shaped novelty boundary. The first extension is proposed in novelty scores by classifying short-time, energy-based
[278], where the authors present a “small sphere and large statistics computed from one-second windows of data.
margin” approach that surrounds the normal data with a The model is trained using epochs of normal EEG. Epochs
hypersphere such that the margin from any outliers to the containing seizure activity exhibit changes in the distribu-
hypersphere is maximised. A further extension of the method tion in feature space that increase the empirical outlier
is proposed by Le et al. [273], whose method aims to fraction, allowing seizure events to be detected.
maximise (i) the margin between the surface of the hyper- Ma and Perkins [290] extend the one-class SVM
sphere and abnormal data, and (ii) the margin between that approach for temporal sequences. The method unfolds
surface and the normal data, while the volume of the hyper- time-series into a phase space using a time-delay embed-
sphere is minimised. ding process, and projects all the vectors from a phase
Xiao et al. [279] propose to use a number of hyper- space to a subspace, thereby avoiding the bias created by
spheres to describe the normal data set. However, the extremely large or extremely small values. The one-class
method is heuristic and no demonstration that the multi- SVM is then applied to the projected data in the subspace.
hypersphere approach can provide a better description of A different approach for online novelty detection in
the data is provided. In [274], a more detailed multi- temporal sequences is presented by the same authors
hypersphere approach to SVDD is described, in which a [289]. The latter algorithm is based on support vector
set of hyperspheres with different centres and radii is regression (SVR), in which a linear regression function is
considered. The optimisation problem for this approach is constructed in the high-dimensional feature space of the
solved by introducing slack variables and applying an kernel. Although it has good generalisation properties and
iterative algorithm that consists of two alternative steps: can handle high-dimensional data efficiently, the SVR
one that calculates radii, centres of hyperspheres, and training algorithm requires retraining whenever a new
slack variables, and another that determines the hyper- sample is observed, which is not efficient for an online
sphere membership of each data point. Experimental algorithm. An incremental SVR training algorithm is pro-
results over 28 data sets showed that the multi- posed to perform efficient updating of the model when-
hypersphere SVDD performed better than the original ever a sample is added to or removed from the training set.
SVDD in all cases. In [275], a fast SVDD method to improve To perform novelty detection, the authors define a match-
the speed of the algorithm is proposed. The original SVDD ing function to determine how well a test sequence
centre is spanned by the images of support vectors in the matches the normal model.
feature space. Unlike traditional methods which try to Roth [294,295] propose the one-class kernel Fisher
compress a kernel expansion into one with fewer terms, discriminant classifier to overcome the “main conceptual
the proposed fast SVDD directly finds the pre-image shortcoming” of one-class SVM classifiers, which is that
of a feature vector, and then uses a simple relationship the expected fraction of outliers has to be specified in
between this feature vector and the SVDD centre to update advance. The method relates kernelised one-class classifi-
the position of the centre. The decision function in this cation and Gaussian density estimation in the induced
case contains only one kernel term, and thus the decision feature space. With respect to classification, the proposed
boundary of the fast SVDD is only spherical in the original model inherits the simple complexity control mechanism
space. Hence, the run-time complexity of the fast SVDD obtained by using regularisation techniques. The relation
decision function is no longer linear in the support vectors, to Gaussian density estimation makes it possible to for-
but is a constant, independent of the size of the training malise the notion of novel objects by quantifying devia-
set. Results obtained from experiments using several real- tions from the Gaussian model. The parameters of the
world data sets (including a critical fabrication process for model are selected using a likelihood-based cross-valida-
thin-film transistor liquid crystal display manufacturing tion procedure.
[276]) are encouraging. Peng and Xu [277] also address the Hayton et al. [285] describe static and dynamic novelty
speed of the algorithm by proposing an efficient SVDD. The detection methods for jet engine health monitoring. The
authors argue that in the fast SVDD, using a Gaussian one-class SVM for static novelty detection proposed
kernel, the decision hypersphere being a sphere in the in [209] is used to construct a model of normality to
input space is not suitable for many real applications. The characterise the distribution of energy across the vibration
efficient SVDD first finds critical points using the kernel spectra acquired from a three-shaft engine. A Kalman filter
fuzzy c-means cluster technique [293] and then uses is then used as a linear dynamic model, with changes in
the images of these points to re-express the centre of the test data identified using the normalised squared innova-
SVDD. The resulting decision function is linear in the tions of the Kalman filter. Both static and dynamic models
number of clusters. Computational comparisons with were shown to perform well in detecting anomalous
one-class SVM, SVDD and fast SVDD in terms of prediction behaviour in jet engine vibration spectra.
performance and learning time have shown that the Sotiris et al. [296] link SVM classifiers to Bayesian linear
proposed efficient SVDD achieves a faster test speed. models to model the posterior class probability of test
data. A PCA decomposition of the multivariate training
5.2. One-class support vector machine approaches data is performed, which defines a number of orthonormal
subspaces, which can be used to estimate the joint class
Gardner et al. [283] apply a one-class SVM to the probability. A kernel density estimator is then computed
detection of seizures in patients. The intracranial EEG for the projected data in two of the subspaces to estimate
time-series is mapped into corresponding sequences of the likelihood of the “normal” class, from which the
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 239

“abnormal” class is estimated. An SVM classifier is con- identification of extreme data. Surface normal vectors are
structed and a logistic distribution finally used to map used to determine whether a point is extreme or not; i.e., a
SVM outputs onto posterior probabilities. novel point is detected if it is located outside the region
Clifton et al. [280,281] investigate the use of a one-class formed by the closed data surface. Experimental results
SVM using multivariate combustion data for the prediction demonstrated that the proposed method performs with
of combustion instability. Wavelet analysis is used for high accuracy in detecting the novel class as well as
feature extraction, from which detail coefficients are used identifying known classes. The performance of the pro-
as two-dimensional features. Novelty scores computed posed method, however, was not compared with that of
using the one-class SVM approach are obtained from each other current approaches.
of the input time-series, and different classifier combina- Sofman et al. [11] present an anytime novelty detection
tion strategies are studied. algorithm that deals with noisy and redundant high-
Heller et al. [286] consider an intrusion detection dimensional feature spaces. A supervised dimensionality-
system which monitors accesses to the Microsoft Windows reduction technique, multiple discriminant analysis, is
Registry using Registry Anomaly Detection (RAD). During used which causes the projected data to form clusters that
normal computer activity, a certain set of registry keys are are as compact as possible for within-class data, while
typically accessed by Windows programs. A one-class SVM being as far away as possible from cluster centres corre-
is compared with a probabilistic algorithm which uses a sponding to other classes. The algorithm determines the
Dirichlet-based hierarchical prior to smooth the distribu- influence of all previously seen novel data on a test point;
tion and account for the likelihoods of unobserved ele- if the accumulated influences exceed a novelty threshold,
ments in sparse datasets by adjusting their probability then the test point is identified as being novel, and is used
mass based on the number of examples seen during for novelty prediction with subsequently observed test
training. The probabilistic algorithm was able to discrimi- points. “Normal” data are not stored as they are assumed
nate accurately between “normal” and “abnormal” exam- to have minimal impact on future novelty detection. This
ples. The authors suggest that a more effective selection of method was validated using data acquired by two mobile
the SVM kernel should be used. robots, and it was shown that the algorithm was able to
Lee and Cho [288] compare the performance of a one- identify all major unique objects (vegetation, barrels, fence,
class SVM with that of an auto-associative neural network, etc.) with a relatively small number of false positives.
and results obtained from the analysis of six benchmark
datasets show that the former performs consistently better 5.3. Method evaluation
than the latter. The one-class SVM has been used for
novelty detection in: functional magnetic resonance ima- Domain-based approaches determine the location of
ging data [284]; audio recordings [291]; text data [292]; the novelty boundary using only those data that lie closest
medical data to identify patient deterioration in vital signs to it and do not rely on the properties of the distribution of
[3]; and network intrusion detection [287]. data in the training set. A drawback of these methods
Evangelista et al. [282] illustrate the impact of high is the complexity associated with the computation of
dimensionality on kernel methods and, specifically, on the the kernel functions. Although some extensions have been
one-class SVM. It is shown that variance contributed by proposed to overcome this problem, the choice of the
meaningless noisy variables confounds learning methods. appropriate kernel function may also be problematic.
The authors propose a framework to overcome this pro- Additionally, it is not easy to select values for the para-
blem, which involves exploring subspaces of the data, meters which control the size of the boundary region.
training a separate model for each subspace, and then
fusing the decision variables produced by the test data for 6. Information-theoretic novelty detection
each subspace using fuzzy logic aggregators. Experiments
conducted using synthetic data sets showed that learning in Information theoretic methods compute the informa-
the subspaces of high-dimensional data typically outperforms tion content of a dataset using measures such as entropy,
learning in the high-dimensional data space as a whole. relative entropy, etc. These methods assume that novelty
Munoz and Moguerza [297] apply the one-class neigh- significantly alters the information content of the other-
bour machine to the problem of estimating high-density wise “normal” dataset. Typically, metrics are calculated
regions for the distribution of “normal” training data in the using the whole dataset and then that subset of points
data space. This method is a block-based procedure that whose elimination from the dataset induces the biggest
provides a binary decision function indicating whether or difference in the metric is found. This subset is then
not a point is a member of a minimum volume set defined assumed to consist of novel data.
around the “normal” data. The algorithm replaces the task He et al. [299,300] present a local-search heuristic
of estimating the density at each point by using a simpler approach, which involves entropy analysis, to identify
measure that asymptotically preserves the order induced outliers in categorical data. Entropy in information theory
by the pdf. Numerical experiments showed that the is a measure of the uncertainty associated with a random
proposed method performed consistently better than the variable, and the authors use an entropy function to
one-class SVM in estimating minimum volume sets. measure the degree of disorder of the remaining dataset
Li [298] performs novelty detection by constructing after removal of high-entropy points. A point is considered
a closed decision surface around the “normal” training to be an outlier if the entropy of the dataset decreases after
data through the derivation of surface normal vectors and its removal, compared with the entropy of the dataset
240 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

after removal of all previous outlier candidates. This extended to non-parametric methods. More recently, the
procedure is repeated until k outliers are identified. same authors [308] have considered the online identifica-
Experimental results showed that this approach scales tion of event-based novelty in stationary linear autore-
well with the size of the training sets. gressive models with Gaussian noise. The authors use a
Ando [301] considers the task of identifying clusters of perturbative approximation to the information-theoretic
“atypical” objects which strongly contrast from the bulk of measure introduced previously in [307,309] for indepen-
the data in terms of their distribution, using an information dent and identical distributed data. Again, they consider
bottleneck measure. The latter is derived from rate distor- the Kullback–Leibler divergence between the estimates of
tion theory and is used for unsupervised problems using the distributions of the stochastic term obtained before
two criteria associated with lossy data compression: rate and after the test point arrives. This procedure yields a
and distortion. The former refers to the level of compres- modified F-test, which more accurately incorporates the
sion measured by the mutual information, while the latter variability introduced by finite sample size effects.
refers to the disruption caused by compression. The algo- Itti and Baldi [310] propose a Bayesian definition of
rithm was evaluated using a text classification task, and surprise to capture subjective aspects of sensory informa-
was shown to outperform a related method called Breg- tion. “Surprise” measures how data affect an observer, in
man Bubble Clustering, which is an extension of one-class terms of differences between posterior and prior beliefs
clustering for multiple clusters. about the world, i.e., only data observations which sub-
Keogh et al. [302] propose parameter-free methods for stantially affect the observer0 s beliefs yield surprise. This
data mining tasks (including clustering, anomaly detection difference or mismatch between expectations of the obser-
and classification) based on compression theory. Sub- ver and the consequent perception of reality is calculated
sequences of a given sequence of continuous observations using the Kullback–Leibler divergence, which evaluates
are defined using a sliding window. Each sub-sequence the statistics of visual attributes in specific scene types, or
is then compared with the entire sequence using a descriptions and layout of the scene.
compression-based dissimilarity method, the Kolmogorov
complexity, which is a measure of the computational
resources needed to specify an object (i.e., it is a measure
6.1. Method evaluation
of randomness of strings based on their information
content). This metric cannot be computed in the general
Information-theoretic approaches to novelty detection
case, and so the size of the compressed file that contains
typically do not make any assumptions about the under-
the string is used to approximate the measure. Empirical
lying distribution of the data. They require a measure that
tests with time-series datasets showed that the approach
is sensitive enough to detect the effects of novel points in
is competitive with respect to others, such as the SVM,
the dataset. The main drawback with this type of techni-
for anomaly detection tasks. Keogh et al. [303] propose a
ques is the selection of the information-theoretic measure.
related technique (called HOT SAX) to solve the same
Typically, these measures can detect the presence of novel
problem for continuous time-series. Sub-sequences are
data points only if there is a significantly large number of
extracted as before and then the Euclidean distance of
novel data points present in the dataset. Thus, the perfor-
each sub-sequence to its closest non-overlapping sub-
mance of such techniques is very dependent on the choice
sequences is calculated. This distance is then used as a
of the information-theoretic measure. These techniques
novelty score for the sub-sequences. A similar approach is
are also computationally expensive, although approxima-
also applied to the domain of medical data in [304], in
tions have been proposed to deal with this problem.
which the problem of finding the sequence that is least
Finally, it may be difficult to associate a novelty score with
similar to all other sequences is addressed. The authors
a test point using an information theoretic-based method.
propose the use of the Haar wavelet transformation [305].
Gamon [306] investigates feature sets derived from a
graph representation of sentences and sets of sentences
using the information-theoretic metric of Kullback–Leibler 7. Application domains
divergence. The author shows that a highly connected
graph produced using sentence-level term distances and Novelty detection has many practical real-life applica-
pointwise mutual information can serve as a source to tions in different domains, and it is of crucial importance
extract features for novelty detection. in applications that involve large datasets acquired from
Filippone and Sanguinetti [307] propose a new method critical systems. These include the detection of faults
to control the false-positive rate in novelty detection. Their in complex industrial systems, of structural damage, and
method estimates the information content of the training of failure in electronic security systems. The main areas of
set including and excluding the test point; the two result- application can be broadly categorised in six main
ing distributions are compared by computing the domains, as shown in Table 5. Other domains of applica-
Kullback–Leibler divergence. The rationale for doing this tion not mentioned in the table include speech recogni-
is that this method explicitly takes into account the size of tion, mobile robotics, astronomical data analysis and
the training set in establishing a threshold for novelty. The environmental monitoring. In the next sections we briefly
method was shown to perform well in univariate and discuss several applications of novelty detection, including
multivariate Gaussian cases, as well as in the mixture of the concept of novel data and the importance of novelty
Gaussians case, but it is not clear whether it can be detection in each domain.
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 241

Table 5
Examples of novelty detection methods in the six main application domains covered in this review.

Application domain Method (category) References

1 2 3 4 5

Electronic IT security ✓ ✓ ✓ ✓ ✓ Helali [42], Heller et al. [286], Jyothsna et al. [7], Lakhina et al. [240],
Peng et al. [151] and Yeung and Ding [83]
Healthcare informatics, medical diagnostics ✓ ✓ ✓ ✓ Clifton et al. [3], Lin et al. [304], Quinn and Williams [2], Solberg and Lahit [45]
and Tarassenko et al. [94,95]
Industrial monitoring and damage detection ✓ ✓ ✓ ✓ Clifton et al. [175,176], Lämsä and Raiko [265], Surace and Worden [5]
and Tarassenko et al. [4]
Image processing, video surveillance ✓ ✓ ✓ ✓ Pokrajac et al. [169], Ramezani et al. [92], Singh and Markou [216,9]
and Yong et al. [186,187]
Text mining ✓ ✓ ✓ ✓ ✓ Ando [301], Basu et al. [15], Ertöz et al. [208], Manevitz and Yousef [221],
Zhang et al. [124] and Zhuang and Dai [292]
Sensor networks ✓ ✓ ✓ Chatzigiannakis et al. [264], Hassan et al. [212], Janakiram et al. [71],
Phuong et al. [152], Subramaniam et al. [93] and Zhang et al. [12]

7.1. Electronic IT security 7.3. Industrial monitoring and damage detection

Novelty detection has generated much research in the Industrial assets deteriorate over time due to usage and
field of electronic IT security systems, in which the goals normal wear. Such deterioration has to be identified early
include network intrusion detection and fraud detection. to prevent further escalation and losses, and for optimising
The former refers to the detection of malicious activity in a machine performance while reducing maintenance and
computer-related system. Frequent attacks on computer repair costs. High-value machinery is usually instrumen-
systems may result in systems being disabled, or even ted with sensors that are dedicated to monitoring their
completely collapsing. The identification of such intrusions operation (including structural defects). The data collected
could help the discovery of malicious programs in the by these sensors typically include temperature, pressure,
operating system and also detect unauthorised access to and vibration amplitude.
computer network systems. Intrusions may be associated
with behaviours different from that of normal users. 7.4. Image processing/video surveillance
Fraud detection involves the identification of criminal
activities that occur in insurance claims, credit card pur- Novelty detection has been extensively applied to
chases, mobile phone usage, and financial transactions, recognising novel objects in images and video streams.
among others. The purchasing behaviour of people who Specifically, extracting novel data from video streams is
steal credit cards (for example) is usually different from gaining attention because of the availability of large
that of the owners of the cards. The identification of amounts of video data, and because of the lack of auto-
such changes in credit card use may prevent a subsequent mated methods for extracting important details from such
period of fraud activity. The data used for constructing media. Detecting novel events in security or surveillance
novelty detection models typically comprise several fea- applications, in which there are particularly large streams
tures, such as user ID, amount of money spent, time of seemingly unimportant video data, is a very important
between consecutive card transactions, purchase records, task.
and geographic location.
7.5. Text mining
7.2. Healthcare informatics/medical diagnostics
and monitoring The novelty detection problem applied to text data
seeks an automatic means of detecting novel topics, new
Healthcare informatics and medical diagnostics are an interesting events, or new stories in a given collection of
important domain of application of novelty detection documents or news articles. The data are typically very
approaches. Patient records with unusual symptoms or high-dimensional and very sparse, and comprise simple
test results may indicate potential health problems for that bag-of-word features and features derived from sophisti-
patient. Ideally, the identification of unusual data should cated linguistic representations.
be able to discriminate between instrumentation or
recording errors and clinically relevant changes in the 7.6. Sensor networks
condition of the patient, so that timely medical interven-
tion may occur in the latter case. The data typically consist Sensor networks usually consist of a large number of
of records that have several types of features: patient small, low-cost sensor nodes distributed over a large area
age, weight, vital signs (such as heart rate), physiological with one or more “sink” nodes gathering readings from
signals (such as the electrocardiogram), blood test results, sensor nodes. The sensor nodes are integrated with sen-
and medical image data. sing, processing, and wireless communication capabilities.
242 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

Each node is equipped with a wireless radio transceiver, a Domain-based methods determine the location of the
small microcontroller, a power source, and multi-type novelty boundary using only those data that lie closest to
sensors for recording temperature, pressure, sound and it, and do not make any assumption about data distribu-
vibration. Novelty detection is applied to sensor networks tion. By focusing on the decision boundary, these methods
to capture sensor faults and/or malicious attacks on the are often influenced by outliers in the training set and they
network. also depend on the choice of a suitable scaling for the
features. In contrast, probabilistic methods make use
8. Conclusion of the distribution of the training data to determine the
location of the novelty boundary. They are transparent
The main goal of novelty detection is to construct methods, meaning that their outputs can be analysed
classifiers when only one class is well-sampled and well- using standard numerical techniques. However, the per-
characterised by the training data. Novelty detection is an formance of such methods is limited when the size of the
important learning paradigm and has drawn significant training set is very small. In general, the problem encoun-
attention within the research community, as shown by the tered when applying density methods to sparsely popu-
increasing number of publications in this field. lated training sets, is that there is little control over
We have presented a review of the current state-of-the- the inherent variability introduced by the sparsity of the
art in novelty detection. We observe that a precise defini- training data; i.e., the estimated quantiles can differ sub-
tion of novelty detection is difficult to achieve, nor is it stantially from the true quantiles of the distribution.
possible to suggest what an “optimal” method of novelty Information-theoretic methods require a measure that is
detection would be. The variety of methods employed is a sensitive enough to detect the effects of novel points in
consequence of the wide variety of practical and theore- the dataset. Although they do not make any assumptions
tical considerations that arise from novelty detection in about the underlying distribution of the data, the perfor-
real-world datasets, such as the availability of training mance of such methods is highly dependent on the choice
data, the type of data (including its dimension, continuity, of the information theoretic measure, and it may be
and format), and application domain investigated. It is difficult to associate a novelty score with a test point using
perhaps because of this great variety of considerations that an information theoretic-based method.
there is no single universally applicable novelty detection The computational complexity of these methods is also
algorithm. an important aspect. Generally, probabilistic, reconstruc-
There are several promising directions for further tion-based, and domain-based methods have lengthy
research in novelty detection, which are mainly associated training phases, but with rapid testing. For many applica-
with the open challenges that need to be tackled for the tions, this is not a problem as models can be trained
effective operation of the approach in growing domains of offline, while testing is required to be in real time. On the
applicability. We observe that novelty detection algorithms other hand, distance-based and information theoretic-
can broadly be divided into five different categories, based methods, in general, are computationally expensive
depending mainly on the assumptions made about the in the test phase, which may be an important limitation in
nature of the training data. Each category of methods real-world settings.
discussed in this paper has its own strengths and weak-
nesses, and faces different challenges for complex datasets.
Distance-based methods, which include nearest-neighbour Acknowledgements
and clustering-based approaches, require the definition of
an appropriate distance measure for the given data, but MAFP was supported by the RCUK Digital Economy
are unable to deal with high-dimensional data efficiently, Programme grant number EP/G036861/1 (Oxford Centre
because distance measures in high dimensions are not able for Doctoral Training in Healthcare Innovation) and FCT –
to differentiate between normal and abnormal data points. Fundação para a Ciência e Tecnologia under the grant
Also, such methods are typically heuristic and require SFRH/BD/79799/2011. DAC was supported by a Royal
manual selection of parameters, removing the possibility Academy of Engineering Research Fellowship, the Balliol
of using them for automatic construction of a model of Interdisciplinary Institute, the Maurice Lubbock Memorial
normality (though approaches have been suggested for Fund, and the Wellcome Trust and the EPSRC under grant
parameter selection using semi-automated techniques). number WT 088877/Z/09/Z. LC was supported by the NIHR
Reconstruction-based methods are very flexible and typi- Biomedical Research Centre Programme, Oxford.
cally address high-dimensionality problems, with no a priori
assumptions about the properties of the data distribution. References
However, they require the optimisation of a pre-defined
number of parameters that define the structure of the [1] L. Tarassenko, P. Hayton, N. Cerneaz, M. Brady, Novelty detection for
model, and may also be very sensitive to these model the identification of masses in mammograms, in: Proceedings of
the 4th International Conference on Artificial Neural Networks, IET,
parameters. Furthermore, when constructive algorithms
1995, pp. 442–447.
are used (in which the structure of the model is allowed [2] J. Quinn, C. Williams, Known unknowns: novelty detection in
to grow), two important problems arise: the selection of the condition monitoring, Pattern Recognit. Image Anal. 4477 (2007)
most effective training method to enable the integration 1–6.
[3] L. Clifton, D. Clifton, P. Watkinson, L. Tarassenko, Identification of
of new units into the existing structure, and a stopping patient deterioration in vital-sign data using one-class support
criterion for when to stop adding new units. vector machines, in: Proceedings of the Federated Conference on
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 243

Computer Science and Information Systems (FedCSIS), IEEE, 2011, [31] Z. Bakar, R. Mohemad, A. Ahmad, M. Deris, A comparative study for
pp. 125–131. outlier detection techniques in data mining, in: Proceedings of the
[4] L. Tarassenko, D. Clifton, P. Bannister, S. King, D. King, Novelty IEEE Conference on Cybernetics and Intelligent Systems, IEEE,
Detection, John Wiley & Sons, Ltd, 2009, pp. 1–22 (Chapter 35). 2006, pp. 1–6.
[5] C. Surace, K. Worden, Novelty detection in a changing environ- [32] S. Khan, M. Madden, A survey of recent trends in one class
ment: a negative selection approach, Mech. Syst. Signal Process. classification, in: L. Coyle, J. Freyne (Eds.), Artificial Intelligence
24 (4) (2010) 1114–1128. and Cognitive Science, Lecture Notes in Computer Science, vol.
[6] A. Patcha, J. Park, An overview of anomaly detection techniques: 6206, Springer, Berlin/Heidelberg, 2010, pp. 188–197.
existing solutions and latest technological trends, Comput. Netw. [33] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley, NY,
51 (12) (2007) 3448–3470. USA, 2001.
[7] V. Jyothsna, V.V.R. Prasad, K.M. Prasad, A review of anomaly based [34] C. Bishop, Pattern Recognition and Machine Learning, vol. 4,
intrusion detection systems, Int. J. Comput. Appl. 28 (7) (2011) Springer, New York, 2006.
26–35. [35] A. Modenesi, A. Braga, Analysis of time series novelty detection
[8] C. Diehl, J. Hampshire, Real-time object classification and novelty strategies for synthetic and real data, Neural Process. Lett. 30 (1)
detection for collaborative video surveillance, in: Proceedings of (2009) 1–17.
the International Joint Conference on Neural Networks, IJCNN'02, [36] V. Chandola, A. Banerjee, V. Kumar, Outlier Detection: A Survey,
2002, vol. 3, pp. 2620–2625. Technical Report 07-017, University of Minnesota, 2007.
[9] M. Markou, S. Singh, A neural network-based novelty detector for [37] J. Kittler, W. Christmas, T. de Campos, D. Windridge, F. Yan,
image sequence analysis, IEEE Trans. Pattern Anal. Mach. Intell. J. Illingworth, M. Osman, Domain anomaly detection in machine
28 (10) (2006) 1664–1677. perception: a system architecture and taxonomy, IEEE Trans.
[10] H. Vieira Neto, U. Nehmzow, Real-time automated visual inspection Pattern Anal. Mach. Intell. 99 (2013) 1.
using mobile robots, J. Intell. Robotic Syst. 49 (3) (2007) 293–307. [38] A.M. Bartkowiak, Anomaly, novelty, one-class classification: a
[11] B. Sofman, B. Neuman, A. Stentz, J. Bagnell, Anytime online novelty comprehensive introduction, Int. J. Comput. Inf. Syst. Ind. Manage.
and change detection for mobile robots, J. Field Robot. 28 (4) (2011) Appl. 3 (2011) 61–71.
589–618. [39] Y. Gatsoulis, E. Kerr, J. Condell, N. Siddique, T. McGinnity, Novelty
[12] Y. Zhang, N. Meratnia, P. Havinga, Outlier detection techniques for detection for cumulative learning, in: Proceedings of the Confer-
wireless sensor networks: a survey, IEEE Commun. Surv. Tutor. ence on Towards Autonomous Robotic Systems, 2010, pp. 62–67.
12 (2) (2010) 159–170. [40] E. Kerr, Y. Gatsoulis, N.H. Siddique, J.V. Condell, T.M. McGinnity,
[13] H. Dutta, C. Giannella, K. Borne, H. Kargupta, Distributed top-k Brief overview of novelty detection methods for robotic cumulative
outlier detection from astronomy catalogs using the DEMAC learning, in: Proceedings of the 21st National Conference on
system, in: Proceedings of the 7th SIAM International Conference Artificial Intelligence and Cognitive Science, 2010, pp. 171–180.
on Data Mining, IEEE, 2007. [41] D. Miljkovic, Review of novelty detection methods, in: Proceedings of
[14] H. Escalante, A comparison of outlier detection algorithms for the 33rd International Convention (MIPRO), IEEE, 2010, pp. 593–598.
machine learning, in: Proceedings of the International Conference [42] R. Helali, Data mining based network intrusion detection system: a
on Communications in Computing, Citeseer, 2005. survey, Novel Algoritm. Tech. Telecommun. Netw. (2010) 501–505.
[15] S. Basu, M. Bilenko, R. Mooney, A probabilistic framework for semi- [43] F.E. Grubbs, Procedures for detecting outlying observations in
supervised clustering, in: Proceedings of the 10th ACM Interna- samples, Technometrics 11 (1) (1969) 1–21.
tional Conference on Knowledge Discovery and Data Mining [44] C. Aggarwal, P. Yu, Outlier detection with uncertain data, in:
(SIGKDD), ACM, 2004, pp. 59–68. Proceedings of the SIAM International Conference on Data Mining,
[16] D. Barber, Bayesian Reasoning and Machine Learning, Cambridge 2008, pp. 483–493.
University Press, 2012. [45] H. Solberg, A. Lahti, Detection of outliers in reference distributions:
[17] C. Sammut, G. Webb, Encyclopedia of Machine Learning. Springer, performance of Horn0 s algorithm, Clin. Chem. 51 (12) (2005)
2011. Springer reference. 2326–2332.
[18] M. Moya, M. Koch, L. Hostetler, One-class classifier networks for [46] C. Chow, On optimum recognition error and reject tradeoff, IEEE
target recognition applications, in: Proceedings of the World Trans. Inf. Theory 16 (1) (1970) 41–46.
Congress on Neural Networks, International Neural Network [47] D.W. Scott, Frontmatter, John Wiley & Sons, Inc., 2008.
Society, 1993, pp. 797–801. [48] D. Filev, F. Tseng, Real time novelty detection modeling for machine
[19] H. He, E. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. health prognostics, in: Proceedings of the Annual Meeting of the
Data Eng. 21 (9) (2009) 1263–1284. North American Fuzzy Information Processing Society (NAFIPS),
[20] H.-J. Lee, S. Cho, The novelty detection approach for different degrees IEEE, 2006, pp. 529–534.
of class imbalance, in: I. King, J. Wang, L.-W. Chan, D. Wang (Eds.), [49] D. Filev, F. Tseng, Novelty detection based machine health prog-
Neural Information Processing, Lecture Notes in Computer Science, nostics, in: International Symposium on Evolving Fuzzy Systems,
vol. 4233, Springer, Berlin/Heidelberg, 2006, pp. 21–30. 2006, pp. 193–199.
[21] C. Bishop, Novelty detection and neural network validation, in: [50] A. Flexer, E. Pampalk, G. Widmer, Novelty detection based on
Proceedings of the IEEE Conference on Vision, Image and Signal spectral similarity of songs, in: Proceedings of 6th International
Processing, vol. 141, IET, 1994, pp. 217–222. Conference on Music Information Retrieval, 2005, pp. 260–263.
[22] G. Ritter, M. Gallegos, Outliers in statistical pattern recognition and [51] J. Ilonen, P. Paalanen, J. Kamarainen, H. Kalviainen, Gaussian mixture
an application to automatic chromosome classification, Pattern pdf in one-class classification: computing and utilizing confidence
Recognit. Lett. 18 (6) (1997) 525–539. values, in: Proceedings of the 18th International Conference on
[23] I. Merriam-Webster, Merriam-webster – an encyclopedia britan- Pattern Recognition (ICPR), vol. 2, IEEE, 2006, pp. 577–580.
nica company, May 2012. URL 〈http://www.merriam-webster.com/ [52] J. Larsen, Distribution of the Density of a Gaussian Mixture, Technical
dictionary/novel/〉. Report, Informatics and Mathematical Modelling, DTU, 2003.
[24] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey, [53] P. Paalanen, J. Kamarainen, J. Ilonen, H. Kälviäinen, Feature repre-
ACM Comput. Surv. (CSUR) 41 (3) (2009) 1–58. sentation and discrimination based on Gaussian mixture model
[25] V. Barnett, T. Lewis, Outliers in Statistical Data, Wiley Series in probability densities – practices and algorithms, Pattern Recognit.
Probability and Mathematical Statistics: Applied Probability and 39 (7) (2006) 1346–1358.
Statistics, Wiley and Sons, 1994. [54] N. Pontoppidan, J. Larsen, Unsupervised condition change detec-
[26] M. Markou, S. Singh, Novelty detection: a review – part 1: tion in large diesel engines, in: Proceedings of the IEEE 13th
statistical approaches, Signal Process. 83 (12) (2003) 2481–2497. Workshop on Neural Networks for Signal Processing, NNSP0 03,
[27] M. Markou, S. Singh, Novelty detection: a review – part 2: neural IEEE, 2003, pp. 565–574.
network based approaches, Signal Process. 83 (12) (2003) [55] X. Song, M. Wu, C. Jermaine, S. Ranka, Conditional anomaly
2499–2521. detection, IEEE Trans. Knowl. Data Eng. 19 (5) (2007) 631–645.
[28] S. Marsland, Novelty detection in learning systems, Neural Comput. [56] F. Zorriassatine, A. Al-Habaibeh, R. Parkin, M. Jackson, J. Coy,
Surv. 3 (2003) 157–195. Novelty detection for practical pattern recognition in condition
[29] V. Hodge, J. Austin, A survey of outlier detection methodologies, monitoring of multivariate processes: a case study, Int. J. Adv.
Artif. Intell. Rev. 22 (2) (2004) 85–126. Manuf. Technol. 25 (9) (2005) 954–963.
[30] M. Agyemang, K. Barker, R. Alhajj, A comprehensive survey of [57] D. Clifton, L. Clifton, P. Bannister, L. Tarassenko, Automated novelty
numeric and symbolic outlier mining techniques, Intell. Data Anal. detection in industrial systems, Adv. Comput. Intell. Ind. Syst. 116
10 (6) (2006) 521–538. (2008) 269–296.
244 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

[58] D. Clifton, S. Hugueny, L. Tarassenko, A comparison of approaches [80] C. Williams, J. Quinn, N. McIntosh, Factorial switching Kalman
to multivariate extreme value theory for novelty detection, in: filters for condition monitoring in neonatal intensive care, Neural
Proceedings of the IEEE/SP 15th Workshop on Statistical Signal Inf. Process. (2006) 1513–1520.
Processing, IEEE, 2009, pp. 13–16. [81] W. Wong, A. Moore, G. Cooper, M. Wagner, Rule-based anomaly
[59] D. Clifton, S. Hugueny, L. Tarassenko, Novelty detection with pattern detection for detecting disease outbreaks, in: Proceedings
multivariate extreme value statistics, J. Signal Process. Syst. 65 (3) of the National Conference on Artificial Intelligence, Menlo Park,
(2011) 371–389. CA; Cambridge, MA; London, AAAI Press; MIT Press; 1999, 2002,
[60] D. Clifton, S. Hugueny, L. Tarassenko, Pinning the tail on the pp. 217–223.
distribution: a multivariate extension to the generalised Pareto [82] W. Wong, A. Moore, G. Cooper, M. Wagner, Bayesian network
distribution, in: IEEE International Workshop on Machine Learning anomaly pattern detection for disease outbreaks, in: Proceedings of
for Signal Processing (MLSP), 2011, pp. 1–6. the 20th International Conference on Machine Learning, vol. 20,
[61] D. Clifton, L. Clifton, S. Hugueny, D. Wong, L. Tarassenko, An AAAI Press, 2003, pp. 808–815.
extreme function theory for novelty detection, IEEE J. Sel. Top. [83] D.-Y. Yeung, Y. Ding, Host-based intrusion detection using dynamic
Signal Process. 7 (1) (2013) 28–37. and static behavioral models, Pattern Recognit. 36 (1) (2003)
[62] A. Hazan, J. Lacaille, K. Madani, Extreme value statistics for 229–243.
vibration spectra outlier detection, in: Proceedings of the 9th [84] X. Zhang, P. Fan, Z. Zhu, A new anomaly detection method based on
International Conference on Condition Monitoring and Machinery hierarchical HMM, in: Proceedings of the 4th International Con-
Failure Prevention Technologies, 2012. ference on Parallel and Distributed Computing, Applications and
[63] S. Hugueny, D. Clifton, L. Tarassenko, Novelty detection with Technologies, IEEE, 2003, pp. 249–252.
multivariate extreme value theory, part II: an analytical approach [85] P. Angelov, An approach for fuzzy rule-base adaptation using on-
to unimodal estimation, in: Proceedings of the IEEE International line clustering, Int. J. Approx. Reason. 35 (3) (2004) 275–289.
Workshop on Machine Learning for Signal Processing, IEEE, 2009, [86] Y. Bengio, M. Monperrus, Non-local manifold tangent learning, Adv.
pp. 1–6. Neural Inf. Process. Syst. 17 (2005) 129–136.
[64] S. Roberts, Novelty detection using extreme value statistics, in: [87] Y. Bengio, H. Larochelle, P. Vincent, Non-local manifold parzen
Proceedings of the IEEE Conference on Vision, Image and Signal windows, Adv. Neural Inf. Process. Syst. 18 (2006) 115–123.
Processing 146 (3) (1999) 124–129. [88] D. Erdogmus, R. Jenssen, Y. Rao, J. Principe, Multivariate density
[65] S. Roberts, Extreme value statistics for novelty detection in biome- estimation with optimal marginal parzen density estimation and
dical data processing, in: Proceedings of the IEEE Conference gaussianization, in: Proceedings of the 14th IEEE Signal Processing
on Science, Measurement and Technology, vol. 147, IET, 2000, Society Workshop on Machine Learning for Signal Processing, IEEE,
pp. 363–367. 2004, pp. 73–82.
[66] H. Sohn, D.W. Allen, K. Worden, C.R. Farrar, Structural damage [89] A. Kapoor, K. Grauman, R. Urtasun, T. Darrell, Gaussian processes
classification using extreme value statistics, J. Dyn. Syst. Meas. for object categorization, Int. J. Comput. Vis. 88 (2) (2010) 169–188.
Control 127 (1) (2005) 125–132. [90] M. Kemmler, E. Rodner, J. Denzler, One-class classification with
[67] S. Sundaram, D. Clifton, I. Strachan, L. Tarassenko, S. King, Aircraft Gaussian processes, in: Asian Conference on Computer Vision
engine health monitoring using density modelling and extreme (ACCV), vol. 6493, 2011, pp. 489–500.
value statistics, in: Proceedings of the 6th International Conference [91] H. Kim, J. Lee, Pseudo-density estimation for clustering with Gaussian
on Condition Monitoring and Machine Failure Prevention Technol- processes, Adv. Neural Netw. (ISNN) 3971 (2006) 1238–1243.
ogies, 2009. [92] R. Ramezani, P. Angelov, X. Zhou, A fast approach to novelty
[68] R. Gwadera, M. Atallah, W. Szpankowski, Markov models for detection in video streams using recursive density estimation, in:
identification of significant episodes, in: Proceedings of 5th Proceedings of the 4th International IEEE Conference Intelligent
SIAM International Conference on Data Mining, 2005, pp. Systems, IS0 08, IEEE, vol. 2, 2008, pp. 14–22.
404–414. [93] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki,
[69] R. Gwadera, M. Atallah, W. Szpankowski, Reliable detection of D. Gunopulos, Online outlier detection in sensor data using non-
episodes in event sequences, Knowl. Inf. Syst. 7 (4) (2005) parametric models, in: Proceedings of the 32nd International
415–437. Conference on Very Large Databases, VLDB Endowment, 2006,
[70] A. Ihler, J. Hutchins, P. Smyth, Adaptive event detection with time- pp. 187–198.
varying poisson processes, in: Proceedings of the 12th ACM [94] L. Tarassenko, A. Hann, A. Patterson, E. Braithwaite, K. Davidson,
International Conference on Knowledge Discovery and Data Mining V. Barber, D. Young, Biosign™: multi-parameter monitoring for
(SIGKDD), ACM, 2006, pp. 207–216. early warning of patient deterioration, in: Proceedings of the 3rd
[71] D. Janakiram, V. Adi Mallikarjuna Reddy, A. Phani Kumar, Outlier IEE International Seminar on Medical Applications of Signal Pro-
detection in wireless sensor networks using Bayesian belief net- cessing, IET, 2005, pp. 71–76.
works, in: Proceedings of the 1st International Conference on [95] L. Tarassenko, A. Hann, D. Young, Integrated monitoring and
Communication System Software and Middleware (Comsware), analysis for early warning of patient deterioration, Br. J. Anaesth.
IEEE, 2006, pp. 1–6. 97 (1) (2006) 64–68.
[72] H.-J. Lee, S. Roberts, On-line novelty detection using the Kalman [96] P. Vincent, Y. Bengio, Manifold parzen windows, Adv. Neural Inf.
filter and extreme value theory, in: Proceedings of the 19th Process. Syst. 15 (2002) 825–832.
International Conference on Pattern Recognition (ICPR), 2008, [97] D. Yeung, C. Chow, Parzen-window network intrusion detectors,
pp. 1–4. in: Proceedings of the 16th International Conference on Pattern
[73] P. McSharry, T. He, L. Smith, L. Tarassenko, Linear and non-linear Recognition, vol. 4, IEEE, 2002, pp. 385–388.
methods for automatic seizure detection in scalp electro- [98] D. Dasgupta, N. Majumdar, Anomaly detection in multidimensional
encephalogram recordings, Med. Biol. Eng. Comput. 40 (4) (2002) data using negative selection algorithm, in: Proceedings of the
447–461. Congress on Evolutionary Computation (CEC), vol. 2, IEEE, 2002,
[74] P. McSharry, Detection of dynamical transitions in biomedical pp. 1039–1044.
signals using nonlinear methods, in: Knowledge-Based Intelligent [99] F. Esponda, S. Forrest, P. Helman, A formal framework for positive
Information and Engineering Systems, Springer, 2004, pp. 483–490. and negative detection schemes, IEEE Trans. Syst. Man Cybern.
[75] S. Ntalampiras, I. Potamitis, N. Fakotakis, Probabilistic novelty Part B: Cybern. 34 (1) (2004) 357–373.
detection for acoustic surveillance under real-world conditions, [100] J. Gómez, F. González, D. Dasgupta, An immuno-fuzzy approach to
IEEE Trans. Multimed. 13 (4) (2011) 713–719. anomaly detection, in: 12th IEEE International Conference on Fuzzy
[76] A. Pinto, A. Pronobis, L. Reis, Novelty detection using graphical Systems (FUZZ 0 03), vol. 2, 2003, pp. 1219–1224.
models for semantic room classification, Prog. Artif. Intell. 7026 [101] F. González, D. Dasgupta, Anomaly detection using real-valued
(2011) 326–339. negative selection, Genet. Program. Evolvable Mach. 4 (4) (2003)
[77] Y. Qiao, X. Xin, Y. Bin, S. Ge, Anomaly intrusion detection method 383–403.
based on HMM, Electron. Lett. 38 (13) (2002) 663–664. [102] D. Taylor, D. Corne, An investigation of the negative selection
[78] J. Quinn, C. Williams, N. McIntosh, Factorial switching linear algorithm for fault detection in refrigeration systems, Artif.
dynamical systems applied to physiological condition monitoring, Immune Syst. 2787 (2003) 34–45.
IEEE Trans. Pattern Anal. Mach. Intell. 31 (9) (2009) 1537–1551. [103] G.J. McLachlan, K.E. Basford, Mixture Models: Inference and Appli-
[79] C. Siaterlis, B. Maglaris, Towards multisensor data fusion for dos cations to Clustering, vol. 1, Marcel Dekker, New York, 1988.
detection, in: Proceedings of the ACM Symposium on Applied [104] Y. Agusta, D. Dowe, Unsupervised learning of gamma mixture
Computing, SAC ’04, ACM, New York, NY, USA, 2004, pp. 439–446. models using minimum message length, in: M.H. Hamza (Ed.),
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 245

Proceedings of the 3rd IASTED Conference on Artificial Intelligence [129] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, T. Kanamori, Statistical
and Applications, 2003, pp. 457–462. outlier detection using direct density ratio estimation, Knowl. Inf.
[105] I. Mayrose, N. Friedman, T. Pupko, A gamma mixture model better Syst. 26 (2) (2011) 309–336.
accounts for among site rate heterogeneity, Bioinformatics 21 (2) [130] M. Sugiyama, T. Suzuki, T. Kanamori, Density ratio estimation: a
(2005) 151–158. comprehensive review, RIMS Kokyuroku (2010) 10–31.
[106] A. Carvalho, M. Tanner, Modelling nonlinear count time series with [131] S. Hoare, D. Asbridge, P. Beatty, On-line novelty detection for
local mixtures of poisson autoregressions, Comput. Stat. Data Anal. artefact identification in automatic anaesthesia record keeping,
51 (11) (2007) 5266–5294. Med. Eng. Phys. 24 (10) (2002) 673–681.
[107] M. Svensén, C. Bishop, Robust Bayesian mixture modelling, Neuro- [132] S. Roberts, L. Tarassenko, A probabilistic resource allocating net-
computing 64 (2005) 235–252. work for novelty detection, Neural Comput. 6 (2) (1994) 270–284.
[108] A. Stranjak, P. Dutta, M. Ebden, A. Rogers, P. Vytelingum, A multi- [133] P. Galeano, D. Peña, R. Tsay, Outlier detection in multivariate time
agent simulation system for prediction and scheduling of aero series by projection pursuit, J. Am. Stat. Assoc. 101 (474) (2006)
engine overhaul, in: Proceedings of the 7th International Joint 654–669.
Conference on Autonomous Agents and Multiagent Systems: [134] D. Chen, X. Shao, B. Hu, Q. Su, Simultaneous wavelength selection
Industrial Track, International Foundation for Autonomous Agents and outlier detection in multivariate regression of near-infrared
and Multiagent Systems, 2008, pp. 81–88. spectra, Anal. Sci. 21 (2) (2005) 161–166.
[109] L. Parra, G. Deco, S. Miesbach, Statistical independence and novelty [135] K. Kadota, D. Tominaga, Y. Akiyama, K. Takahashi, Detecting out-
detection with information preserving nonlinear maps, Neural lying samples in microarray data: a critical assessment of the effect
Comput. 8 (2) (1996) 260–269. of outliers on sample classification, Chem-Bio Informat. 3 (1)
[110] A. Nairac, T. Corbett-Clark, R. Ripley, N. Townsend, L. Tarassenko, (2003) 30–45.
Choosing an appropriate model for novelty detection, in: Proceed- [136] P. Smyth, Markov monitoring with unknown states, IEEE J. Sel.
ings of the 5th International Conference on Artificial Neural Net- Areas Commun. 12 (9) (1994) 1600–1612.
works, IET, 1997, pp. 117–122. [137] Z. Ghahramani, G. Hinton, Variational learning for switching state-
[111] R. Hyndman, Computing and graphing highest density regions, space models, Neural Comput. 12 (4) (2000) 831–864.
Am. Stat. 50 (2) (1996) 120–126. [138] M. Atallah, W. Szpankowski, R. Gwadera, Detection of significant
[112] J. Pickands, Statistical inference using extreme order statistics, Ann. sets of episodes in event sequences, in: Proceedings of the 4th IEEE
Stat. 3 (1) (1975) 119–131. International Conference on Data Mining, ICDM’04, IEEE, 2004,
[113] P. Embrechts, C. Klüppelberg, T. Mikosch, Modelling Extremal pp. 3–10.
Events for Insurance and Finance, vol. 33, Springer Verlag, 1997. [139] A. Sebyala, T. Olukemi, L. Sacks, Active platform security through
[114] R. Fisher, L. Tippett, Limiting forms of the frequency distribution of intrusion detection using naive Bayesian network for anomaly
the largest or smallest member of a sample, in: Proceedings of the detection, in: London Communications Symposium, Citeseer, 2002.
Cambridge Philosophical Society, vol. 24, Cambridge University [140] C. Kruegel, G. Vigna, Anomaly detection of web-based attacks, in:
Press, 1928, pp. 180–190. Proceedings of the 10th ACM Conference on Computer and Com-
[115] D. Clifton, L. Tarassenko, N. McGrogan, D. King, S. King, P. Anuzis, munications Security, ACM, 2003, pp. 251–261.
Bayesian extreme value statistics for novelty detection in gas- [141] C. Kruegel, D. Mutz, W. Robertson, F. Valeur, Bayesian event
turbine engines, in: Proceedings of the IEEE Aerospace Conference, classification for intrusion detection, in: Proceedings of the 19th
IEEE, 2008, pp. 1–11. Annual Computer Security Applications Conference, IEEE, 2003,
[116] K. Worden, G. Manson, D. Allman, Experimental validation of a pp. 14–23.
structural health monitoring methodology: part i. Novelty detection [142] M. Mahoney, P. Chan, Learning nonstationary models of normal
on a laboratory structure, J. Sound Vib. 259 (2) (2003) 323–343. network traffic for detecting novel attacks, in: Proceedings of the
[117] K. Yamanishi, J. Takeuchi, G. Williams, P. Milne, On-line unsupervised 8th ACM International Conference on Knowledge Discovery and
outlier detection using finite mixtures with discounting learning Data Mining (SIGKDD), ACM, 2002, pp. 376–385.
algorithms, Data Min. Knowl. Discov. 8 (3) (2004) 275–300. [143] E. Parzen, On estimation of a probability density function and
[118] K. Yamanishi, J. Takeuchi, G. Williams, P. Milne, On-line unsuper- mode, Ann. Math. Stat. 33 (3) (1962) 1065–1076.
vised outlier detection using finite mixtures with discounting [144] A. Frank, A. Asuncion, UCI machine learning repository, 2010.
learning algorithms, in: Proceedings of the 6th ACM International [145] M. Hravnak, L. Edwards, A. Clontz, C. Valenta, M. DeVita, M. Pinsky,
Conference on Knowledge Discovery and Data Mining (SIGKDD), Defining the incidence of cardiorespiratory instability in patients in
ACM, 2000, pp. 320–324. step-down units using an electronic integrated monitoring system,
[119] D. Agarwal, An empirical Bayes approach to detect anomalies in Arch. Internal Med. 168 (12) (2008) 1300–1308.
dynamic multidimensional arrays, in: Proceedings of the 5th IEEE [146] P. Angelov, D. Filev, An approach to online identification of Takagi-
International Conference on Data Mining, IEEE, 2005, pp. 26–33. Sugeno fuzzy models, IEEE Trans. Syst. Man Cybern. Part B: Cybern.
[120] D. Agarwal, Detecting anomalies in cross-classified streams: a 34 (1) (2004) 484–498.
Bayesian approach, Knowl. Inf. Syst. 11 (1) (2007) 29–44. [147] R. Adams, I. Murray, D. MacKay, The Gaussian process density
[121] F. Zorriassatine, J. Tannock, C. O0 Brien, Using novelty detection to sampler, in: Advances in Neural Information Processing Systems
identify abnormalities caused by mean shifts in bivariate processes, (NIPS) 21, 2009, pp. 9–16.
Comput. Ind. Eng. 44 (3) (2003) 385–408. [148] G. Lorden, Procedures for reacting to a change in distribution, Ann.
[122] P. Højen-Sørensen, O. Winther, L. Hansen, Mean-field approaches Math. Stat. 42 (6) (1971) 1897–1908.
to independent component analysis, Neural Comput. 14 (4) (2002) [149] M. Basseville, I.V. Nikiforov, Detection of Abrupt Changes: Theory
889–918. and Application, vol. 104, Prentice Hall, Englewood Cliffs, 1993.
[123] J. Verbeek, N. Vlassis, B. Kröse, Efficient greedy learning of Gaussian [150] J. Reeves, J. Chen, X.L. Wang, R. Lund, Q.Q. Lu, A review and
mixture models, Neural Comput. 15 (2) (2003) 469–485. comparison of changepoint detection techniques for climate data,
[124] J. Zhang, Z. Ghahramani, Y. Yang, A probabilistic model for online J. Appl. Meteorol. Climatol. 46 (6) (2007) 900–915.
document clustering with application to novelty detection, in: [151] T. Peng, C. Leckie, K. Ramamohanarao, Information sharing for
NIPS, 2005. distributed intrusion detection systems, J. Netw. Comput. Appl. 30
[125] P. Perner, Concepts for novelty detection and handling based on a (3) (2007) 877–899.
case-based reasoning process scheme, Eng. Appl. Artif. Intell. 22 (1) [152] T. Van Phuong, L. Hung, S. Cho, Y. Lee, S. Lee, An anomaly detection
(2009) 86–91. algorithm for detecting attacks in wireless sensor networks, Intell.
[126] K. Hempstalk, E. Frank, I. Witten, One-class classification by combin- Secur. Informat. 3975 (2006) 735–736.
ing density and class probability estimation, in: W. Daelemans, [153] A.G. Tartakovsky, G.V. Moustakides, State-of-the-art in Bayesian
B. Goethals, K. Morik (Eds.), Machine Learning and Knowledge changepoint detection, Seq. Anal. 29 (2) (2010) 125–145.
Discovery in Databases, Lecture Notes in Computer Science, vol. [154] J. Chen, A.K. Gupta, Parametric Statistical Change Point Analysis:
5211, Springer, Berlin/Heidelberg, 2008, pp. 505–519. with Applications to Genetics, Birkhäuser Boston, Medicine, and
[127] D. Chen, M. Meng, Health status detection for patients in physio- Finance, Boston, 2012.
logical monitoring, in: Proceedings of the Annual International [155] S. Forrest, A. Perelson, L. Allen, R. Cherukuri, Self-nonself discrimi-
Conference of the IEEE Engineering in Medicine and Biology nation in a computer, in: Proceedings of the IEEE Computer Society
Society, 2011, pp. 4921–4924. Symposium on Research in Security and Privacy, IEEE, 1994,
[128] T. Kanamori, S. Hido, M. Sugiyama, A least-squares approach to pp. 202–212.
direct importance estimation, J. Mach. Learn. Res. 10 (2009) [156] F. Angiulli, C. Pizzuti, Fast outlier detection in high dimensional
1391–1445. spaces, in: Proceedings of the 6th European Conference on
246 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

Principles of Data Mining and Knowledge Discovery, PKDD 0 02, techniques, in: Proceedings of the IEEE Aerospace Conference,
Springer-Verlag, London, UK, 2002, pp. 15–26. IEEE, 2006, pp. 1–17.
[157] S. Bay, M. Schwabacher, Mining distance-based outliers in near [182] H. Sun, Y. Bao, F. Zhao, G. Yu, D. Wang, CD-trees: an efficient index
linear time with randomization and a simple pruning rule, in: structure for outlier detection, Adv. Web-Age Inf. Manage. 3129
Proceedings of the 9th ACM International Conference on Knowl- (2004) 600–609.
edge Discovery and Data Mining (SIGKDD), ACM, 2003, pp. 29–38. [183] Z. Syed, M. Saeed, I. Rubinfeld, Identifying high-risk patients
[158] S. Boriah, V. Chandola, V. Kumar, Similarity measures for catego- without labeled training data: anomaly detection methodologies
rical data: a comparative evaluation, in: Proceedings of the 8th to predict adverse outcomes, in: AMIA Annual Symposium Pro-
SIAM International Conference on Data Mining, 2008, pp. 243–254. ceedings, vol. 2010, American Medical Informatics Association,
[159] M. Breunig, H. Kriegel, R. Ng, J. Sander, LOF: identifying density- 2010, pp. 772–776.
based local outliers, in: Proceedings of the ACM SIGMOD Interna- [184] C.-H. Wang, Outlier identification and market segmentation using
tional Conference on Management of Data, vol. 29, ACM, 2000, kernel-based clustering techniques, Exp. Syst. Appl. 36 (2) (2009)
pp. 93–104. 3744–3750.
[160] V. Chandola, S. Boriah, V. Kumar, Understanding Categorical Simi- [185] J. Yang, W. Wang, CLUSEQ: efficient and effective sequence cluster-
larity Measures for Outlier Detection, Technical Report 08-008, ing, in: Proceedings of the 19th International Conference on Data
University of Minnesota, 2008. Engineering, IEEE, 2003, pp. 101–112.
[161] S. Chawla, P. Sun, SLOM: a new measure for local spatial outliers, [186] S.-P. Yong, J.D. Deng, M.K. Purvis, Novelty detection in wildlife
Knowl. Inf. Syst. 9 (4) (2006) 412–429. scenes through semantic context modelling, Pattern Recognit.
[162] A. Ghoting, M. Otey, S. Parthasarathy, Loaded: link-based outlier 45 (9) (2012) 3439–3450.
and anomaly detection in evolving data sets, in: Proceedings of the [187] S. Yong, J. Deng, M. Purvis, Wildlife video key-frame extraction
4th IEEE International Conference on Data Mining, ICDM’04, IEEE, based on novelty detection in semantic context, Multimed. Tools
2004, pp. 387–390. Appl. 62 (2) (2013) 359–376.
[163] A. Ghoting, S. Parthasarathy, M. Otey, Fast mining of distance-based [188] D. Yu, G. Sheikholeslami, A. Zhang, Findout: finding outliers in very
outliers in high-dimensional datasets, Data Min. Knowl. Discov. large datasets, Knowl. Inf. Syst. 4 (4) (2002) 387–412.
16 (3) (2008) 349–364. [189] K. Zhang, S. Shi, H. Gao, J. Li, Unsupervised outlier detection in
[164] V. Hautamaki, I. Karkkainen, P. Franti, Outlier detection using sensor networks using aggregation tree, Adv. Data Min. Appl. 4632
k-nearest neighbour graph, in: Proceedings of the 17th Interna- (2007) 158–169.
tional Conference on Pattern Recognition, ICPR 2004, vol. 3, IEEE, [190] E. Knorr, R. Ng, Algorithms for mining distance-based outliers in
2004, pp. 430–433. large datasets, in: Proceedings of the International Conference on
[165] F. Jiang, Y. Sui, C. Cao, Outlier detection using rough set theory, in: Very Large Data Bases, Citeseer, 1998, pp. 392–403.
D. Slezak, J. Yao, J. Peters, W. Ziarko, X. Hu (Eds.), Rough Sets, Fuzzy [191] Y. Tao, X. Xiao, S. Zhou, Mining distance-based outliers from large
Sets, Data Mining, and Granular Computing, Lecture Notes in Com- databases in any metric space, in: Proceedings of the 12th ACM
puter Science, vol. 3642, Springer, Berlin Heidelberg, 2005, pp. 79–87. International Conference on Knowledge Discovery and Data Mining
[166] Y. Kou, C. Lu, D. Chen, Spatial weighted outlier detection, in: (SIGKDD), ACM, 2006, pp. 394–403.
Proceedings of the SIAM Conference on Data Mining, 2006. [192] L. Wei, W. Qian, A. Zhou, W. Jin, J. Yu, Hot: hypergraph-based
[167] M. Otey, A. Ghoting, S. Parthasarathy, Fast distributed outlier outlier test for categorical data, Adv. Knowl. Discov. Data Min. 2637
detection in mixed-attribute data sets, Data Min. Knowl. Discov. (2003) 562.
12 (2) (2006) 203–228. [193] R. Agrawal, R. Srikant, Fast algorithms for mining association rules,
[168] G. Palshikar, Distance-based outliers in sequences, Distrib. Comput. in: Proceedings of the 20th International Conference on Very Large
Internet Technol. 3816 (2005) 547–552. Data Bases, VLDB, vol. 1215, 1994, pp. 487–499.
[169] D. Pokrajac, A. Lazarevic, L. Latecki, Incremental local outlier [194] S. Papadimitriou, H. Kitagawa, P. Gibbons, C. Faloutsos, LOCI: fast
detection for data streams, in: Proceedings of the IEEE Symposium outlier detection using the local correlation integral, in: Proceed-
on Computational Intelligence and Data Mining (CIDM), IEEE, 2007, ings of the 19th International Conference on Data Engineering,
pp. 504–515. IEEE, 2003, pp. 315–326.
[170] M. Wu, C. Jermaine, Outlier detection by sampling with accuracy [195] A. Chiu, A. Fu, Enhancements on local outlier detection, in:
guarantees, in: Proceedings of the 12th ACM International Con- Proceedings of the 7th International Database Engineering and
ference on Knowledge Discovery and Data Mining (SIGKDD), ACM, Applications Symposium, IEEE, 2003, pp. 298–307.
2006, pp. 767–772. [196] J. Tang, Z. Chen, A. Fu, D. Cheung, Enhancing effectiveness of outlier
[171] J. Zhang, H. Wang, Detecting outlying subspaces for high- detections for low density patterns, Adv. Knowl. Discov. Data Min.
dimensional data: the new task, and performance, Knowl. Inf. Syst. 2336 (2002) 535–548.
10 (3) (2006) 333–355. [197] D. Ren, B. Wang, W. Perrizo, RDF: a density-based outlier detection
[172] D. Barbará, Y. Li, J. Couto, COOLCAT: an entropy-based algorithm for method using vertical data representation, in: Proceedings of the
categorical clustering, in: Proceedings of the 11th International 4th IEEE International Conference on Data Mining, ICDM’04, IEEE,
Conference on Information and Knowledge Management, ACM, 2004, pp. 503–506.
2002, pp. 582–589. [198] J. Yu, W. Qian, H. Lu, A. Zhou, Finding centric local outliers in
[173] D. Barbará, Y. Li, J. Couto, J. Lin, S. Jajodia, Bootstrapping a data categorical/numerical spaces, Knowl. Inf. Syst. 9 (3) (2006) 309–338.
mining intrusion detection system, in: Proceedings of the ACM [199] J. Tang, Z. Chen, A. Fu, D. Cheung, Capabilities of outlier detection
Symposium on Applied Computing, ACM, 2003, pp. 421–425. in large datasets, framework and methodologies, Knowl. Inf. Syst.
[174] S. Budalakoti, A. Srivastava, R. Akella, E. Turkov, Anomaly Detection 11 (1) (2007) 45–84.
in Large Sets of High-Dimensional Symbol Sequences, Technical [200] P. Sun, S. Chawla, On local spatial outliers, in: Proceedings of the
Report NASA TM-2006-214553, NASA Ames Research Center, 2006. 4th IEEE International Conference on Data Mining, IEEE, 2004,
[175] D. Clifton, P. Bannister, L. Tarassenko, Learning shape for jet engine pp. 209–216.
novelty detection, Adv. Neural Netw. (ISNN) 3973 (2006) 828–835. [201] P. Sun, S. Chawla, B. Arunasalam, Mining for outliers in sequential
[176] D. Clifton, P. Bannister, L. Tarassenko, A framework for novelty databases, in: Proceedings of the 6th SIAM International Confer-
detection in jet engine vibration data, Key Eng. Mater. 347 (2007) ence on Data Mining, vol. 124, Society for Industrial Mathematics,
305–310. 2006.
[177] M. Filippone, F. Masulli, S. Rovetta, Applying the possibilistic [202] P. Chan, M. Mahoney, M. Arshad, A Machine Learning Approach to
c-means algorithm in kernel-induced spaces, IEEE Trans. Fuzzy Anomaly Detection, Technical Report, Department of Computer
Syst. 18 (3) (2010) 572–584. Science, Florida Institute Technology Melbourne, 2003.
[178] Z. He, X. Xu, S. Deng, Discovering cluster-based local outliers, [203] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function
Pattern Recognit. Lett. 24 (9) (2003) 1641–1650. Algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981.
[179] D. Kim, P. Kang, S. Cho, H. Lee, S. Doh, Machine learning-based [204] R. Krishnapuram, J. Keller, A possibilistic approach to clustering,
novelty detection for faulty wafer detection in semiconductor IEEE Trans. Fuzzy Syst. 1 (2) (1993) 98–110.
manufacturing, Expert Syst. Appl. 39 (4) (2011) 4075–4083. [205] A. Pires, C. Santos-Pereira, Using clustering and robust estimators
[180] A. Srivastava, B. Zane-Ulman, Discovering recurring anomalies in to detect outliers in multivariate data, in: Proceedings of the
text reports regarding complex space systems, in: Proceedings of International Conference on Robust Statistics, 2005.
the IEEE Aerospace Conference, IEEE, 2005, pp. 3853–3862. [206] A. Vinueza, G. Grudic, Unsupervised Outlier Detection and Semi-
[181] A. Srivastava, Enabling the discovery of recurring anomalies in Supervised Learning, Technical Report CU-CS-976-04, University of
aerospace problem reports using high-dimensional clustering Colorado at Boulder, 2004.
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 247

[207] N. Wu, J. Zhang, Factor analysis based anomaly detection, in: [231] D. Kit, B. Sullivan, D. Ballard, Novelty detection using growing
Proceedings of the Information Assurance Workshop, IEEE Sys- neural gas for visuo-spatial memory, in: Proceedings of the IEEE/
tems, Man and Cybernetics Society, IEEE, 2003, pp. 108–115. RSJ International Conference on Intelligent Robots and Systems
[208] L. Ertöz, M. Steinbach, V. Kumar, Finding topics in collections of (IROS), IEEE, 2011, pp. 1194–1200.
documents: a shared nearest neighbor approach, Clust. Inf. Retr. 11 [232] S. Marsland, J. Shapiro, U. Nehmzow, A self-organising network that
(2003) 83–103. grows when required, Neural Netw. 15 (8–9) (2002) 1041–1058.
[209] B. Schölkopf, R. Williamson, A. Smola, J. Shawe-Taylor, J. Platt, [233] S. Marsland, U. Nehmzow, J. Shapiro, On-line novelty detection for
Support vector method for novelty detection, Adv. Neural Inf. autonomous mobile robots, Robot. Auton. Syst. 51 (2) (2005)
Process. Syst. 12 (3) (2000) 582–588. 191–206.
[210] J. Zhou, Y. Fu, C. Sun, Y. Fang, Unsupervised distributed novelty [234] M. Ramadas, S. Ostermann, B. Tjaden, Detecting anomalous net-
detection on scientific simulation data, J. Comput. Inf. Syst. 7 (5) work traffic with self-organizing maps, in: Recent Advances in
(2011) 1533–1540. Intrusion Detection, Springer, 2003, pp. 36–54.
[211] E. Spinosa, A. deLeon, F. de Carvalho, J. Gama, Novelty detection [235] F. Wu, T. Wang, J. Lee, An online adaptive condition-based main-
with application to data streams, Intell. Data Anal. 13 (3) (2009) tenance method for mechanical systems, Mech. Syst. Signal Pro-
405–422. cess. 24 (8) (2010) 2985–2995.
[212] A.F. Hassan, H.M.O. Mokhtar, O. Hegazy, A heuristic approach for [236] Y. Chen, B. Malin, Detection of anomalous insiders in collaborative
sensor network outlier detection, Int. J. Res. Rev. Wirel. Sensor environments via relational analysis of access logs, in: Proceedings
Netw. (IJRRWSN) 1 (4) (2012) 66–72. of the 1st ACM Conference on Data and Application Security and
[213] T. Idé, S. Papadimitriou, M. Vlachos, Computing correlation anom- Privacy, ACM, 2011, pp. 63–74.
aly scores using stochastic nearest neighbors, in: Proceedings of the [237] Y. Chen, S. Nyemba, B. Malin, Detecting anomalous insiders in
7th IEEE International Conference on Data Mining (ICDM), IEEE, collaborative information systems, IEEE Trans. Dependable Secur.
2007, pp. 523–528. Comput. 9 (3) (2012) 332–344.
[214] K. Onuma, H. Tong, C. Faloutsos, Tangent: a novel,‘surprise me’, [238] S. Günter, N. Schraudolph, S. Vishwanathan, Fast iterative kernel
recommendation algorithm, in: Proceedings of the 15th ACM principal component analysis, J. Mach. Learn. Res. 8 (2007)
International Conference on Knowledge Discovery and Data Mining 1893–1918.
(SIGKDD), ACM, 2009, pp. 657–666. [239] H. Hoffmann, Kernel PCA for novelty detection, Pattern Recognit.
[215] M. Augusteijn, B. Folkert, Neural network classification and novelty 40 (3) (2007) 863–874.
detection, Int. J. Remote Sens. 23 (14) (2002) 2891–2902. [240] A. Lakhina, M. Crovella, C. Diot, Mining anomalies using traffic
[216] S. Singh, M. Markou, An approach to novelty detection applied to feature distributions, ACM SIGCOMM Comput. Commun. Rev. 35
the classification of image regions, IEEE Trans. Knowl. Data Eng. 16 (4) (2005) 217–228.
(4) (2004) 396–407. [241] R. Kassab, F. Alexandre, Incremental data-driven learning of a
[217] P. Crook, S. Marsland, G. Hayes, U. Nehmzow, A tale of two filters- novelty detection model for one-class classification with applica-
on-line novelty detection, in: Proceedings of the IEEE International tion to high-dimensional noisy data, Mach. Learn. 74 (2) (2009)
Conference on Robotics and Automation, ICRA’02, vol. 4, IEEE, 2002, 191–234.
pp. 3894–3899. [242] J. McBain, M. Timusk, Feature extraction for novelty detection as
[218] I. Diaz, J. Hollmen, Residual generation and visualization for under- applied to fault detection in machinery, Pattern Recognit. Lett. 32
standing novel process conditions, in: Proceedings of the Interna- (7) (2011) 1054–1061.
tional Joint Conference on Neural Networks, IJCNN’02, vol. 3, IEEE, [243] A. Perera, N. Papamichail, N. Bârsan, U. Weimar, S. Marco, On-line
2002, pp. 2070–2075. novelty detection by recursive dynamic principal component
[219] S. Hawkins, H. He, G. Williams, R. Baxter, Outlier detection using analysis and gas sensor arrays under drift conditions, IEEE Sens. J.
replicator neural networks, Data Wareh. Know. Discov. 2454 (2002) 6 (3) (2006) 770–783.
113–123. [244] T. Ide, H. Kashima, Eigenspace-based anomaly detection in com-
[220] N. Japkowicz, Supervised versus unsupervised binary-learning by puter systems, in: Proceedings of the 11th ACM International
feedforward neural networks, Mach. Learn. 42 (1) (2001) 97–122. Conference on Knowledge Discovery and Data Mining (SIGKDD),
[221] L. Manevitz, M. Yousef, One-class document classification via ACM, 2004, pp. 440–449.
neural networks, Neurocomputing 70 (7) (2007) 1466–1481. [245] M. Shyu, S. Chen, K. Sarinnapakorn, L. Chang, A Novel Anomaly
[222] B. Thompson, R. Marks, J. Choi, M. El-Sharkawi, M. Huang, C. Bunje, Detection Scheme Based on Principal Component Classifier, Tech-
Implicit learning in autoencoder novelty assessment, in: Proceed- nical Report, DTIC Document, 2003.
ings of the International Joint Conference on Neural Networks, [246] M. Thottan, C. Ji, Anomaly detection in IP networks, IEEE Trans.
IJCNN0 02, vol. 3, IEEE, 2002, pp. 2878–2883. Signal Process. 51 (8) (2003) 2191–2204.
[223] G. Williams, R. Baxter, H. He, S. Hawkins, L. Gu, A comparative [247] J. Toivola, M. Prada, J. Hollmén, Novelty detection in projected
study of RNN for outlier detection in data mining, in: Proceedings spaces for structural health monitoring, Adv. Intell. Data Anal. IX
of the IEEE International Conference on Data Mining, IEEE, 2002, 6065 (2010) 208–219.
pp. 709–712. [248] Y. Xiao, H. Wang, W. Xu, J. Zhou, L1 norm based KPCA for novelty
[224] S. Jakubek, T. Strasser, Fault-diagnosis using neural networks with detection, Pattern Recognit. 46 (1) (2013) 389–396.
ellipsoidal basis functions, in: Proceedings of the American Control [249] S. Haggett, D. Chu, I. Marshall, Evolving a dynamic predictive
Conference, vol. 5, IEEE, 2002, pp. 3846–3851. coding mechanism for novelty detection, Knowl. Based Syst.
[225] Y. Li, M. Pont, N. Barrie Jones, Improving the performance of radial 21 (3) (2008) 217–224.
basis function classifiers in condition monitoring and fault diag- [250] T. Hosoya, S. Baccus, M. Meister, Dynamic predictive coding by the
nosis applications where unknown faults may occur, Pattern retina, Nature 436 (7047) (2005) 71–77.
Recognit. Lett. 23 (5) (2002) 569–577. [251] T. Kohonen, The self-organizing map, Proc. IEEE 78 (9) (1990)
[226] M.K. Albertini, R.F. de Mello, A self-organizing neural network for 1464–1480.
detecting novelties, in: Proceedings of the 2007 ACM Symposium [252] K. Labib, R. Vemuri, NSOM: a real-time network-based intrusion
on Applied Computing, SAC 0 07, ACM, New York, NY, USA, 2007, detection system using self-organizing maps, Netw. Secur. (2002)
pp. 462–466. 1–6.
[227] G. Barreto, L. Aguayo, Time series clustering for anomaly detection [253] D. Alahakoon, S. Halgamuge, B. Srinivasan, Dynamic self-organizing
using competitive neural networks, in: J. Príncipe, R. Miikkulainen maps with controlled growth for knowledge discovery, IEEE Trans.
(Eds.), Advances in Self-Organizing Maps, Lecture Notes in Com- Neural Netw. 11 (3) (2000) 601–614.
puter Science, vol. 5629, Springer, Berlin Heidelberg, 2009, [254] J. Blackmore, R. Miikkulainen, Incremental grid growing: encoding
pp. 28–36. high-dimensional structure into a two-dimensional feature map,
[228] D. Deng, N. Kasabov, On-line pattern analysis by evolving self- in: Proceedings of the IEEE International Conference on Neural
organizing maps, Neurocomputing 51 (2003) 87–103. Networks, vol. 1, 1993, pp. 450–455.
[229] J. García-Rodríguez, A. Angelopoulou, J. García-Chamizo, A. Psarrou, [255] B. Fritzke, Growing cell structures – a self-organizing network for
S. Orts Escolano, V. Morell Giménez, Autonomous growing neural unsupervised and supervised learning, Neural Netw. 7 (9) (1994)
gas for applications with time constraint: optimal parameter 1441–1460.
estimation, Neural Netw. 32 (2012) 196–208. [256] B. Fritzke, A growing neural gas network learns topologies, Adv.
[230] D. Hristozov, T. Oprea, J. Gasteiger, Ligand-based virtual screening Neural Inf. Process. Syst. 7 (1995) 625–632.
by novelty detection with self-organizing maps, J. Chem. Inf. [257] I. Jolliffe, MyiLibrary, Principal Component Analysis, vol. 2, Wiley
Model. 47 (6) (2007) 2044–2062. Online Library, 2002.
248 M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249

[258] R. Fujimaki, T. Yairi, K. Machida, An approach to spacecraft anomaly [282] P.F. Evangelista, M.J. Embrechts, B.K. Szymanski, Taming the curse
detection problem using kernel feature space, in: Proceedings of of dimensionality in kernels and novelty detection, in: Applied Soft
the 11th ACM International Conference on Knowledge Discovery in Computing Technologies: The Challenge of Complexity, Springer
Data Mining (SIGKDD), ACM, 2005, pp. 401–410. Verlag, 2006, pp. 431–444.
[259] B. Schölkopf, A. Smola, K. Müller, Nonlinear component analysis [283] A. Gardner, A. Krieger, G. Vachtsevanos, B. Litt, One-class novelty
as a kernel eigenvalue problem, Neural Comput. 10 (5) (1998) detection for seizure analysis from intracranial EEG, J. Mach. Learn.
1299–1319. Res. 7 (2006) 1025–1044.
[260] N. Kwak, Principal component analysis based on l1-norm max- [284] D.R. Hardoon, L.M. Manevitz, fMRI analysis via one-class machine
imization, IEEE Trans. Pattern Anal. Mach. Intell. 30 (9) (2008) learning techniques, in: Proceedings of the 19th International Joint
1672–1680. Conference on Artificial intelligence, IJCAI’05, Morgan Kaufmann
[261] C. Noble, D. Cook, Graph-based anomaly detection, in: Proceedings Publishers Inc., San Francisco, CA, USA, 2005, pp. 1604–1605.
of the 9th ACM International Conference on Knowledge Discovery [285] P. Hayton, S. Utete, D. King, S. King, P. Anuzis, L. Tarassenko, Static
and Data Mining (SIGKDD), ACM, 2003, pp. 631–636. and dynamic novelty detection methods for jet engine health
[262] J. Sun, H. Qu, D. Chakrabarti, C. Faloutsos, Neighborhood formation monitoring, Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 365
and anomaly detection in bipartite graphs, in: Proceedings of the (1851) (2007) 493–514.
5th IEEE International Conference on Data Mining, IEEE, 2005, [286] K. Heller, K. Svore, A. Keromytis, S. Stolfo, One class support vector
pp. 418–425. machines for detecting anomalous windows registry accesses, in:
[263] J. Sun, Y. Xie, H. Zhang, C. Faloutsos, Less is more: compact matrix Proceedings of the Workshop on Data Mining for Computer
decomposition for large sparse graphs, in: Proceedings of the 7th Security, 2003.
SIAM International Conference in Data Mining, 2007. [287] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. Srivastava, A compara-
[264] V. Chatzigiannakis, S. Papavassiliou, M. Grammatikou, B. Maglaris, tive study of anomaly detection schemes in network intrusion
Hierarchical anomaly detection in distributed large-scale detection, in: Proceedings of the 3rd SIAM International Confer-
sensor networks, in: Proceedings of the 11th IEEE Symposium ence on Data Mining, vol. 3, SIAM, 2003, pp. 25–36.
on Computers and Communications, ISCC0 06, IEEE, 2006, pp. 761– [288] H. Lee, S. Cho, Application of LVQ to novelty detection using outlier
767. training data, Pattern Recognit. Lett. 27 (13) (2006) 1572–1579.
[265] V. Lämsä, T. Raiko, Novelty detection by nonlinear factor analysis [289] J. Ma, S. Perkins, Online novelty detection on temporal sequences,
for structural health monitoring, in: Proceedings of the IEEE in: Proceedings of the Ninth ACM International Conference
International Workshop on Machine Learning for Signal Processing on Knowledge Discovery and Data Mining (SIGKDD), ACM, 2003,
(MLSP), IEEE, 2010, pp. 468–473. pp. 613–618.
[266] M. Timusk, M. Lipsett, C. Mechefske, Fault detection using transient [290] J. Ma, S. Perkins, Time-series novelty detection using one-class
machine signals, Mech. Syst. Signal Process. 22 (7) (2008) support vector machines, in: Proceedings of the International Joint
1724–1749. Conference on Neural Networks, vol. 3, IEEE, 2003, pp. 1741–1745.
[267] D. Tax, R. Duin, Support vector domain description, Pattern [291] A. Rabaoui, H. Kadri, N. Ellouze, New approaches based on one-
Recognit. Lett. 20 (11) (1999) 1191–1199. class SVMs for impulsive sounds recognition tasks, in: Proceedings
[268] Q. Song, W. Hu, W. Xie, Robust support vector machine with bullet of the IEEE Workshop on Machine Learning for Signal Processing,
hole image classification, IEEE Trans. Syst. Man Cybern. Part C: IEEE, 2008, pp. 285–290.
Appl. Rev. 32 (4) (2002) 440–448. [292] L. Zhuang, H. Dai, Parameter optimization of kernel-based one-
[269] W. Hu, Y. Liao, V. Vemuri, Robust anomaly detection using support class classifier on imbalance learning, J. Comput. 1 (7) (2006)
vector machines, in: Proceedings of the International Conference 32–40.
on Machine Learning, 2003, pp. 282–289. [293] Z. Wu, W. Xie, J. Yu, Fuzzy c-means clustering algorithm based on
[270] G. Li, C. Wen, Z. Li, A new online learning with kernels method in kernel method, in: Proceedings of the 5th International Conference
novelty detection, in: Proceedings of the 37th Annual Conference on on Computational Intelligence and Multimedia Applications (ICCIMA),
IEEE Industrial Electronics Society (IECON), IEEE, 2011, pp. 2311–2316. IEEE, 2003, pp. 49–54.
[271] L. Manevitz, M. Yousef, One-class SVMs for document classification, [294] V. Roth, Outlier detection with one-class kernel fisher discrimi-
J. Mach. Learn. Res. 2 (2002) 139–154. nants, in: Proceedings of the Conference on Advances in Neural
[272] C. Campbell, K. Bennett, A linear programming approach to novelty Information Processing Systems (NIPS), 2004.
detection, in: Proceedings of the Conference on Advances in Neural [295] V. Roth, Kernel fisher discriminants for outlier detection, Neural
Information Processing Systems, vol. 13, The MIT Press, 2001, Comput. 18 (4) (2006) 942–960.
pp. 395–401. [296] V. Sotiris, P. Tse, M. Pecht, Anomaly detection through a Bayesian
[273] T. Le, D. Tran, W. Ma, D. Sharma, An optimal sphere and two large support vector machine, IEEE Trans. Reliab. 59 (2) (2010) 277–286.
margins approach for novelty detection, in: Proceedings of the [297] A. Munoz, J. Moguerza, Estimation of high-density regions using
International Joint Conference on Neural Networks (IJCNN), IEEE, one-class neighbor machines, IEEE Trans. Pattern Anal. Mach.
2010, pp. 1–6. Intell. 28 (3) (2006) 476–480.
[274] T. Le, D. Tran, W. Ma, D. Sharma, Multiple distribution data [298] Y. Li, A surface representation approach for novelty detection, in:
description learning algorithm for novelty detection, Adv. Knowl. Proceedings of the International Conference on Information and
Discov. Data Min. 6635 (2011) 246–257. Automation (ICIA), IEEE, 2008, pp. 1464–1468.
[275] Y.-H. Liu, Y.-C. Liu, Y.-J. Chen, Fast support vector data descriptions [299] Z. He, S. Deng, X. Xu, An optimization model for outlier detection in
for novelty detection, IEEE Trans. Neural Netw. 21 (8) (2010) categorical data, Adv. Intell. Comput. 3644 (2005) 400–409.
1296–1313. [300] Z. He, S. Deng, X. Xu, J. Huang, A fast greedy algorithm for outlier
[276] Y.-H. Liu, Y.-C. Liu, Y.-Z. Chen, High-speed inline defect detection for mining, Adv. Knowl. Discov. Data Min. 3918 (2006) 567–576.
TFT-LCD array process using a novel support vector data descrip- [301] S. Ando, Clustering needles in a haystack: an information theoretic
tion, Exp. Syst. Appl. 38 (5) (2011) 6222–6231. analysis of minority and outlier detection, in: Proceedings of the
[277] X. Peng, D. Xu, Efficient support vector data descriptions for 7th IEEE International Conference on Data Mining, ICDM’07, IEEE,
novelty detection, Neural Comput. Appl. 21 (8) (2012) 2023–2032. 2007, pp. 13–22.
[278] M. Wu, J. Ye, A small sphere and large margin approach for novelty [302] E. Keogh, S. Lonardi, C. Ratanamahatana, Towards parameter-free
detection using training data with outliers, IEEE Trans. Pattern data mining, in: Proceedings of the 10th ACM International Con-
Anal. Mach. Intell. 31 (11) (2009) 2088–2092. ference on Knowledge Discovery and Data Mining (SIGKDD), ACM,
[279] Y. Xiao, B. Liu, L. Cao, X. Wu, C. Zhang, Z. Hao, F. Yang, J. Cao, Multi- 2004, pp. 206–215.
sphere support vector data description for outliers detection on [303] E. Keogh, J. Lin, S. Lee, H. Herle, Finding the most unusual time
multi-distribution data, in: Proceedings of the IEEE International series subsequence: algorithms and applications, Knowl. Inf. Syst.
Conference on Data Mining Workshops (ICDMW), IEEE, 2009, 11 (1) (2007) 1–27.
pp. 82–87. [304] J. Lin, E. Keogh, A. Fu, H. Van Herle, Approximations to magic:
[280] L. Clifton, H. Yin, Y. Zhang, Support vector machine in novelty finding unusual medical time series, in: Proceedings of the 18th
detection for multi-channel combustion data, Adv. Neural Netw. IEEE Symposium on Computer-Based Medical Systems, IEEE, 2005,
(ISNN) 3973 (2006) 836–843. pp. 329–334.
[281] L. Clifton, H. Yin, D. Clifton, Y. Zhang, Combined support vector [305] A. Fu, O. Leung, E. Keogh, J. Lin, Finding time series discords based
novelty detection for multi-channel combustion data, in: Proceed- on haar transform, Adv. Data Min. Appl. 4093 (2006) 31–41.
ings of the IEEE International Conference on Networking, Sensing [306] M. Gamon, Graph-based text representation for novelty detection,
and Control, IEEE, 2007, pp. 495–500. in: Proceedings of the First Workshop on Graph Based Methods
M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 249

for Natural Language Processing, Association for Computational [309] M. Filippone, G. Sanguinetti, Novelty Detection in Autoregressive
Linguistics, Stroudsburg, PA, USA, 2006, pp. 17–24. Models Using Information Theoretic Measures, Technical Report
[307] M. Filippone, G. Sanguinetti, Information theoretic novelty detec- CS-09-06, Department of Computer Science, University of Sheffield,
tion, Pattern Recognit. 43 (3) (2010) 805–814. 2009.
[308] M. Filippone, G. Sanguinetti, A perturbative approach to novelty [310] L. Itti, P. Baldi, Bayesian surprise attracts human attention, Vis. Res.
detection in autoregressive models, IEEE Trans. Signal Process. 49 (10) (2009) 1295–1306.
59 (3) (2011) 1027–1036.

You might also like