Feature Selection
Feature Selection
VIA
JOINT LIKELIHOOD
2012
By
Adam C Pocock
School of Computer Science
Contents
Abstract 11
Declaration 13
Copyright 14
Acknowledgements 15
Notation 16
1 Introduction 17
1.1 Prediction and data collection . . . . . . . . . . . . . . . . . . . . 17
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Publications and Software . . . . . . . . . . . . . . . . . . . . . . 23
2 Background 26
2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.1 Probability, Likelihood and Bayes Theorem . . . . . . . . 31
2.1.2 Classification algorithms . . . . . . . . . . . . . . . . . . . 34
2.2 Cost-Sensitive Classification . . . . . . . . . . . . . . . . . . . . . 36
2.2.1 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . 37
2.2.2 Constructing a cost-sensitive classifier . . . . . . . . . . . . 37
2.3 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.1 Shannon’s Information Theory . . . . . . . . . . . . . . . . 39
2.3.2 Bayes error and conditional entropy . . . . . . . . . . . . . 42
2.3.3 Estimating the mutual information . . . . . . . . . . . . . 44
2.4 Weighted Information Theory . . . . . . . . . . . . . . . . . . . . 45
2
2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3
5.4 An Information Theoretic View of Strong and Weak Relevance . . 115
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Bibliography 163
5
List of Tables
5.1 Datasets used in experiments. The final column indicates the dif-
ficulty of the data in feature selection, a smaller value indicating a
more challenging problem. . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Datasets from Peng et al. [86], used in small-sample experiments. 106
5.3 Column 1: Non-dominated Rank of different criteria for the
trade-off of accuracy/stability. Criteria with a higher rank (closer
to 1.0) provide a better trade-off than those with a lower rank.
Column 2: As column 1 but using Kuncheva’s Stability Index.
Column 3: Average ranks for accuracy alone. . . . . . . . . . . . 110
5.4 Datasets from the NIPS challenge, used in experiments. . . . . . . 112
5.5 NIPS FS Challenge Results: GISETTE. . . . . . . . . . . . . . . 114
5.6 NIPS FS Challenge Results: MADELON. . . . . . . . . . . . . . 115
6
6.4 Win/Draw/Loss results on Hailfinder . . . . . . . . . . . . . . . . 133
6.5 Win/Draw/Loss results for Insurance. . . . . . . . . . . . . . . . . 134
7
List of Figures
3.1 A Bayesian network, with the target node shaded in red, and the
Markov Blanket of that node shaded in cyan. . . . . . . . . . . . . 63
4.1 The graphical model for the likelihood specified in Equation (4.1). 76
5.1 Figure 5.1a shows the scatter plot between X1 and X2 . Figure 5.1b
shows the scatter plot between X1 and X2 when Y = 1. Figure
5.1c shows the scatter plot between X1 and X2 when Y = 2. . . . 90
5.2 The full space of linear filter criteria, describing several examples
from Table 3.1. Note that all criteria in this space adopt Assump-
tion 1. Additionally, the γ and β axes represent the criteria belief
in Assumptions 2 and 3, respectively. The left hand axis is where
the mRMR and MIFS algorithms sit. The bottom left corner,
MIM, is the assumption of completely independent features, us-
ing just marginal mutual information. Note that some criteria are
equivalent at particular sizes of the current feature set |S|. . . . . 95
5.3 Kuncheva’s Stability Index [67] across 15 datasets. The box in-
dicates the upper/lower quartiles, the horizontal line within each
shows the median value, while the dotted crossbars indicate the
maximum/minimum values. For convenience of interpretation, cri-
teria on the x-axis are ordered by their median value. . . . . . . 102
8
5.4 Yu et al.’s Information Stability Index [111] across 15 datasets. For
comparison, criteria on the x-axis are ordered identically to Figure
5.3. A similar general picture emerges to that using Kuncheva’s
measure, though the information stability index is able to take
feature redundancy into account, showing that some criteria are
slightly more stable than expected. . . . . . . . . . . . . . . . . . 103
5.5 Relations between feature sets generated by different criteria, on
average over 15 datasets. 2D visualisation generated by classical
multi-dimensional scaling. . . . . . . . . . . . . . . . . . . . . . . 104
5.6 Average ranks of criteria in terms of test error, selecting 10 fea-
tures, across 11 datasets. Note the clear dominance of criteria
which do not allow the redundancy term to overwhelm the rele-
vancy term (stars) over those that allow redundancy to grow with
the size of the feature set (circles). . . . . . . . . . . . . . . . . . 105
5.7 LOO results on Peng’s datasets: Colon, Lymphoma, Leukemia,
Lung, NCI9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.8 Stability (y-axes) versus Accuracy (x-axes) over 50 bootstraps for
the final quarter of the UCI datasets. The pareto-optimal rankings
are summarised in Table 5.3. . . . . . . . . . . . . . . . . . . . . . 109
5.9 Significant dominance partial-order diagram. Criteria are placed
top to bottom in the diagram by their rank taken from column
3 of Table 5.3. A link joining two criteria means a statistically
significant difference is observed with a Nemenyi post-hoc test at
the specified confidence level. For example JMI is significantly su-
perior to MIFS (β = 1) at the 99% confidence level. Note that the
absence of a link does not signify the lack of a statistically signifi-
cant difference, but that the Nemenyi test does not have sufficient
power (in terms of number of datasets) to determine the outcome
[29]. It is interesting to note that the four bottom ranked criteria
correspond to the corners of the unit square in Figure 5.2; while
the top three (JMI/mRMR/DISR) are all very similar, scaling the
redundancy terms by the size of the feature set. The middle ranks
belong to CMIM/ICAP, which are similar in that they use the
min/max strategy instead of a linear combination of terms. . . . 111
5.10 Validation Error curve using GISETTE. . . . . . . . . . . . . . . 113
9
5.11 Validation Error curve using MADELON. . . . . . . . . . . . . . 113
6.1 Toy problem, 5 feature nodes (X1 . . . X5 ) and their estimated mu-
tual information with the target node Y on a particular data sam-
ple. X1 , X2 , X5 form the Markov Blanket of Y . . . . . . . . . . . . 129
6.2 Average results: (a) Small sample, correct prior; (b) Large sam-
ple, correct prior; (c) Small sample, misspecified prior; (d) Large
sample, misspecified prior. . . . . . . . . . . . . . . . . . . . . . . 131
7.1 The average pixel values across all 4s in our sample of MNIST. . . 148
7.2 MNIST Results, with “4” as the costly digit. . . . . . . . . . . . . 148
7.3 MNIST Results, with “4” as the costly digit, using w(y = 4) =
{1, 5, 10, 15, 20, 25, 50, 100} and (w)JMI. LEFT: Note that as costs
for mis-classifying “4” increase, the weighted FS method increases
F-measure, while the weighted SVM suffers a decrease. RIGHT:
The black dot is the cost-insensitive methodology. Note that the
weighted SVM can increase recall above the 90% mark, but it does
so by sacrificing precision. In contrast, the weighted FS method
pushes the cluster of points up and to the right, increasing both
recall and precision. . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4 MNIST Results, with both “4” and “9” as the costly digits, us-
ing w(y = (4 ∨ 9)) = {1, 5, 10, 15, 20, 25, 50, 100} and (w)JMI, F-
Measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.5 MNIST Results, with both “4” and “9” as the costly digits, using
w(y = (4 ∨ 9)) = {1, 5, 10, 15, 20, 25, 50, 100} and (w)JMI, Preci-
sion/Recall plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.6 MNIST Results, using wMIM comparing against SpreadFX, with
“4” as the costly digit. . . . . . . . . . . . . . . . . . . . . . . . . 152
7.7 Ohscal results. Cost of mis-predicting class 9 is set to ten times
more than other classes. The weighted SVM and oversampling
approaches clearly focus on producing high recall, far higher than
our method, however, this is only achievable by sacrificing preci-
sion. Our approach improves precision and recall, giving higher
F-measure overall. . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
10
Abstract
We study the nature of filter methods for feature selection. In particular, we ex-
amine information theoretic approaches to this problem, looking at the literature
over the past 20 years. We consider this literature from a different perspective, by
viewing feature selection as a process which minimises a loss function. We choose
to use the model likelihood as the loss function, and thus we seek to maximise
the likelihood. The first contribution of this thesis is to show that the problem of
information theoretic filter feature selection can be rephrased as maximising the
likelihood of a discriminative model.
From this novel result we can unify the literature revealing that many of
these selection criteria are approximate maximisers of the joint likelihood. Many
of these heuristic criteria were hand-designed to optimise various definitions of
feature “relevancy” and “redundancy”, but with our probabilistic interpretation
we naturally include these concepts, plus the “conditional redundancy”, which is
a measure of positive interactions between features. This perspective allows us
to derive the different criteria from the joint likelihood by making different inde-
pendence assumptions on the underlying probability distributions. We provide
an empirical study which reinforces our theoretical conclusions, whilst revealing
implementation considerations due to the varying magnitudes of the relevancy
and redundancy terms.
We then investigate the benefits our probabilistic perspective provides for the
application of these feature selection criteria in new areas. The joint likelihood
11
automatically includes a prior distribution over the selected feature sets and so
we investigate how including prior knowledge affects the feature selection process.
We can now incorporate domain knowledge into feature selection, allowing the
imposition of sparsity on the selected feature set without using heuristic stopping
criteria. We investigate the use of priors mainly in the context of Markov Blanket
discovery algorithms, in the process showing that a family of algorithms based
upon IAMB are iterative maximisers of our joint likelihood with respect to a
particular sparsity prior. We thus extend the IAMB family to include a prior for
domain knowledge in addition to the sparsity prior.
Next we investigate what the choice of likelihood function implies about the
resulting filter criterion. We do this by applying our derivation to a cost-weighted
likelihood, showing that this likelihood implies a particular cost-sensitive filter
criterion. This criterion is based on a weighted branch of information theory and
we prove several novel results justifying its use as a feature selection criterion,
namely the positivity of the measure, and the chain rule of mutual information.
We show that the feature set produced by this cost-sensitive filter criterion can be
used to convert a cost-insensitive classifier into a cost-sensitive one by adjusting
the features the classifier sees. This can be seen as an analogous process to
that of adjusting the data via over or undersampling to create a cost-sensitive
classifier, but with the crucial difference that it does not artificially alter the data
distribution.
Finally we conclude with a summary of the benefits this loss function view
of feature selection has provided. This perspective can be used to analyse other
feature selection techniques other than those based upon information theory, and
new groups of selection criteria can be derived by considering novel loss functions.
12
Declaration
No portion of the work referred to in this thesis has been
submitted in support of an application for another degree
or qualification of this or any other university or other
institute of learning.
13
Copyright
14
Acknowledgements
First I would like to thank my supervisors Gavin Brown and Mikel Luján, for
their guidance and help throughout my PhD. Even when they both insist on
sneaking up on me.
Thanks to everyone in the MLO lab, though you will have to fix your own
computers when I am gone. Special thanks to Richard S, Richard A, Joe, Peter,
Freddie, Nicolò, Mauricio, and Kevin for ensuring I retained a small measure of
sanity. Thanks also to the iTLS team, even if I never got chance to play with all
that data.
Thanks are due to my flatmates Alex and Jon, even if none of us were partic-
ularly acquainted with the vacuum cleaner. Thanks to the rest of my friends in
Manchester for sticking around as my life was slowly sacrificed to the PhD gods,
being ready for some gaming or a trip to Sandbar as and when I escaped the lab.
And finally thanks to my parents, my brother Matt and Sarah, without whom
I would not be capable of anything.
15
Notation
16
Chapter 1
Introduction
If we wish to make a decision about an important subject, the first step is of-
ten to gather data about the problem. For example, if we need to decide which
University to study at for a degree, it would be sensible to gather relevant in-
formation like the courses offered by different Universities, what the surrounding
environment is like, and what the job prospects are for graduates of the insti-
tutions. If we instead chose to base our decision on irrelevant information like
the colour of the cars parked outside each University, then our choice would not
be well informed and the course might be entirely unsuitable. The problem of
choosing what information to base our decisions on must be solved before we can
even attempt the larger problem of making a decision. The task of automatically
making decisions falls under the remit of Artificial Intelligence, and specifically
the field of Machine Learning. Machine Learning is about constructing models
which make predictions based upon data, and in this thesis we focus on that first
step in producing a prediction, namely selecting the relevant data.
In this chapter we motivate the problem of feature selection, giving a brief
introduction to the relevant topics and explaining the importance of the problem.
We state the specific questions that this thesis answers, explaining the relevancy
of those questions to the field. We then provide a brief outline of the thesis
structure, and detail the publications which have resulted from this thesis.
17
18 CHAPTER 1. INTRODUCTION
If we consider the task of feature selection, we can separate out features into
three different classes: relevant, redundant and irrelevant (we will give precise
definitions of each of these concepts in Chapter 3). Relevant features are those
which contain information which will help solve the prediction problem. Without
these features it is likely we will have incomplete information about the problem
we are trying to solve, therefore our predictions will be poor. As mentioned
previously, the proportion of graduates who have jobs or gone on to further study
would be a very relevant feature when choosing a degree course. Redundant
features are those which contain useful information that we already know from
another source. If we have two features, one which tells us a university has 256
Computer Science (CS) undergraduates, and the other tells us the same university
has more than 200 CS undergraduates, then the second is clearly redundant as the
first feature tells us exactly how many students there are (assuming both features
give true information). If we didn’t know the first feature, then the second would
give useful information about the popularity of the CS course, but it doesn’t add
any extra information to the first feature. Irrelevant features are those which
contain no useful information about the problem. It is easy to think of irrelevant
features for any given problem, for example the equatorial circumference of the
Moon is unlikely to be relevant to the problem of choosing a degree course. If
we could go through our data, and label each feature relevant, redundant or
irrelevant then the feature selection task would be easy. We would simply select
all the relevant features, and discard the rest. We could then move on to the
complex problem of learning how to predict based upon those features.
Unfortunately, as with many things in machine learning, the high level view
1.1. PREDICTION AND DATA COLLECTION 19
sounds simple but the details of implementing such a process can be very hard.
In the extreme examples described above it was simple to determine if a feature
was relevant, irrelevant, or redundant in the presence of other features. In real
problems it is usually much more difficult. For example, if one of our features
is the number of bricks in the Computer Science building at the university, will
that feature help us decide if the Computer Science undergraduate course will
be good? Intuitively we might expect this feature to be irrelevant, but this is
not necessarily the case. We might imagine that a large number means there
is a large CS building, (hopefully) full of fascinating lectures and the latest in
computer hardware. Equally the number could be very small, if the building is
modern and built from glass and steel, and yet still contain the truly relevant
(but hard to quantify) things which will make a good undergraduate course. The
number could also be somewhere in the middle, indicating a smaller building
with fewer students. This feature could interact with other features we have
collected, changing the meaning of that feature. Our prospective undergraduate
could prefer going to a large department with many students, where there is more
potential for socialising and working with different people. Or they could prefer
a small department which is less busy and where it is possible to know all the
students and staff. These two features (the number of bricks, and our prospective
student’s preference) interact, so a student who wants a large department might
prefer a course with a large number of bricks (or a very small number) denoting
a large CS building. The student who favours a small department might prefer a
course with a medium number of bricks, denoting a smaller CS building. However
without considering both of these features together our feature selection process
might believe the number of bricks to be irrelevant, as small, medium and large
values all lead to good and bad choices of undergraduate course, because the
choice also depends on the student’s preference.
information, and so does not make the number of bricks completely redundant.
If analysing features to determine these properties is so difficult for humans,
how can we construct algorithms to perform the task automatically? The stan-
dard approach is to use supervised learning where we take a dataset of features,
label them with the appropriate class, and then try to learn the relationship be-
tween the features and the class. In our example we would measure all of the
features for each different university course, gather a sample of prospective under-
graduates, survey their preferences for courses and finally record their choice of
course and whether they were happy with that choice. For each student we would
have a list of features pertaining to their chosen university course, their individ-
ual preferences, and a label which stated if they were happy with the course. We
then test each feature to see how relevant it is to the happiness of the student.
The choice of the relevancy test (or selection criterion) is one of the principle
areas of research in feature selection, and forms the topic of this thesis.
One further question we might ask is why analyse the features at all. Surely
if we find a sufficiently sophisticated machine learning algorithm to make our
prediction about the choice of course then we do not need to separately anal-
yse the features. While many modern machine learning methods are capable of
learning in the presence of irrelevant features, this comes at the expense of extra
run time and requires additional computer power. In addition to these strictly
computational benefits, there are also reductions in cost by not collecting the
irrelevant features. If collecting each feature has a material or time cost (such as
interviewing lecturers at a university, or surveying the surrounding towns), then
if we do not need those features we can avoid that collection cost. We can see
that even if we assume our learning algorithm can cope with irrelevant features
there are benefits to removing them through feature selection.
The literature surrounding feature selection contains many different kinds of se-
lection criteria [50]. Most of these criteria have been constructed on an ad-hoc
basis, attempting to find a trade off between the relevancy and redundancy of
the selected feature set. The heuristic nature of selection criteria is particularly
apparent in the field of information theoretic feature selection, upon which this
1.3. CONTRIBUTIONS OF THIS THESIS 21
thesis focuses. There has been little work which aims to derive the optimal fea-
ture selection criterion based upon a particular evaluation measure that we wish
to minimise/maximise (e.g. we might wish to minimise the number of mistakes
our prediction algorithm makes when given a particular feature set). Therefore
the first question is “Can we derive a feature selection criteria which minimises
the error rate (or a suitable proxy for the error rate)?”. Having achieved this
we wish to understand the confusing literature of information theoretic feature
selection criteria, by relating them to our optimal criterion. The next question is
therefore “What implicit assumptions are made by the literature on information
theoretic criteria, and how do they relate to the optimal criterion?”. Once we
have understood the literature we should look at what other benefits a princi-
pled framework for feature selection might provide, and how we could extend the
framework to other interesting areas, such as cost-sensitivity. We therefore ask
one final question, “How should we construct a cost-sensitive feature selection
algorithm?”.
• Empirical study of the same criteria, showing how they behave in terms of
stability and accuracy across a range of datasets (15 UCI, 5 gene expression,
and 2 from the NIPS-FS Challenge). This study shows how the different
criteria respond to differing amounts of data, and how the theoretical points
influence empirical performance (Chapter 5).
22 CHAPTER 1. INTRODUCTION
In collaboration with other members of the iTLS project, several other papers
were published which contain work which is not relevant to this thesis.
[89] — A. Pocock, P. Yiapanis, J. Singer, M. Luján and G. Brown. Online Non-
Stationary Boosting. In J. Kittler, N. El-Gayar, F. Roli, editors, 9th International
Workshop on Multiple Classifier Systems, (MCS 2010).
[57] — N. Ioannou, J. Singer, S. Khan, P. Xekalakis, P. Yiapanis, A. Pocock,
G. Brown, M. Luján, I. Watson, and M. Cintra. Toward a more accurate under-
standing of the limits of the TLS execution paradigm. 2010 IEEE Symposium
on Workload Characterisation, (IISWC 2010).
[98] — J. Singer, P. Yiapanis, A. Pocock, M. Luján, G. Brown, N. Ioannou,
and M. Cintra. Static Java program features for intelligent squash prediction.
Statistical and Machine learning approaches to ARchitecture and compilaTion
(SMART10).
Software
To support the experimental studies in this thesis several libraries were developed:
Background
In the previous chapter we briefly investigated the notions of feature selection and
prediction. We now provide a fuller treatment of those areas, and providing the
background material necessary to understand the ideas presented in this thesis.
We also introduce much of the common notation used throughout the thesis.
As this material is common to many branches of machine learning it forms the
basis of many textbooks, the particular references used for this chapter (unless
otherwise stated) are Bishop [10], Duda et al. [35] and over & Thomas [24].
We begin by revisiting the classification problem in Section 2.1, giving a for-
mal definition of the problem before detailing some of the common methods for
evaluating classification problems. We also explore the notions of likelihood and
probabilistic modelling, which are central to the results in the later chapters.
We review the literature around cost-sensitive classification algorithms in Section
2.2. Finally we introduce Information Theory in Section 2.3, and explore the
two variants we use in this thesis, Shannon’s original formulation of Entropy and
Information [97], and Guiaşu’s formulation of the Weighted Entropy [46]. We
then review the links between information theory and the classification problem
in the current literature.
2.1 Classification
The most common task in machine learning is predicting an unknown property
of an object. We base the predictions on a set of inputs or features, and the
predictions themselves come in two main kinds. Classification is the process
of predicting an integer or categorical label from the features. Regression is
26
2.1. CLASSIFICATION 27
the process of predicting a real-numbered value from the features. Whilst these
processes are similar, as regression can be seen as classification in an ordered
space, the two processes are usually treated separately. For the remainder of this
thesis we will focus on classification problems.
We can formally express a classification problem as an unknown mapping
φ : X → Y from a d-dimensional vector of the real numbers, X ∈ Rd , to a member
of the set of class labels, Y ∈ {y1 , y2 , · · · }. The problem for any given classification
algorithm is to learn the general form of this mapping. In supervised learning
tasks we learn this mapping from a set of data examples which have been labelled
beforehand. This is a difficult task as labelled data is more expensive to acquire
than unlabelled data, as it requires more processing (and usually some human
oversight). Each data example forms a tuple {x, y} of a feature vector x and
the associated class label y. In general we will assume our data is independently
and identically distributed (i.i.d. ) which means that each training example tuple
is drawn independently from the same underlying distribution. We can think
of this mapping as producing a decision boundary which separates the feature
space into subspaces based upon what class label is mapped to a subspace. A
classification model is the estimated mapping function f , which takes in x and
some parameters τ and returns a predicted class label ŷ,
ŷ = f (x, τ ). (2.1)
The task is then to find the model parameters τ which give the best predictions
ŷ. We will look at how to measure the quality of the predictions in more detail
later.
In general we wish to minimise the number of parameters which are fitted by
the classification algorithm. This is an application of Occam’s Razor, we wish
to find the simplest rule which explains all the data, as we expect this will lead
to the best performance on unseen data. As the number of parameters increases
there are more different ways to fit the available training data, and any given
classification rule (which is a function of the parameters) becomes more complex.
In Figure 2.1 we can see two example classification boundaries which separate the
circles and stars. The solid line is a linear boundary, and thus has few parameters
(as any straight line in d dimensions has d + 1 parameters). This line does not
perfectly separate the two classes, it incorrectly classifies 6 training examples.
The dashed line is a non-linear boundary, and thus has comparatively many
28 CHAPTER 2. BACKGROUND
Figure 2.1: Two example classification boundaries. The solid line is from a linear
classifier, and the dashed line is from a non-linear classifier.
parameters to control the different line segments. This line perfectly separates
the two classes on the training data, but we would expect it to perform poorly
on testing data as it has overfit to the training data. Each class in the figure
is drawn from a 2-dimensional Gaussian distribution with unit variance, if we
drew some testing data from those Gaussians the non-linear boundary would
perform very poorly compared to the linear one. This phenomenon of overfitting
is something which feature selection can help reduce, as it reduces the dimension
of the problem which in turn reduces the potential for overfitting.
We can measure the performance of classification algorithms in multiple dif-
ferent ways, with the most common being the error rate, i.e. the number of mis-
classified examples divided by the total number of examples tested. Using ŷ i to
denote the predicted label for the ith example and y i to denote the corresponding
true label we express the error rate for a dataset D and model f as,
1
N
err(D, f ) = 1 − δyi ,ŷi (2.2)
N i=1
where δyi ,ŷi is the Kronecker delta function, which returns one if the arguments
are identical (i.e. if y i = ŷ i ), and zero otherwise (i.e. y i = ŷ i ). The Bayes Error
2.1. CLASSIFICATION 29
True Label
+ -
+ TP FP
Prediction
- FN TN
(or Bayes Rate) of a dataset or problem is the error achieved by the optimal
classifier and represents the theoretical minimum error for that dataset. It is
usually described as a function of the noise in the data, and thus noisy datasets
where the unknown mapping φ contains an additional random element in general
have higher Bayes error. Alternatively the features x may not have sufficient
discriminative power to determine the class label, which also results in a high
Bayes error. The Bayes error of a problem is difficult to determine but there
exists a bound on it in terms of estimable values (see Section 2.3.2).
The error rate masks several important properties of the classification perfor-
mance that we may wish to examine separately. To explain this we introduce
the notions of true positives, true negatives, false positives and false negatives.
The definitions below strictly apply in two class problems, but in Chapter 7 we
will use multi-class versions by defining one class to be the positive class, and
the remaining classes defined as the negative classes. In two class problems we
usually refer to one class as the positive class, and the other as the negative class,
with the positive class denoting the one we are interested in (e.g. in a medical
situation the positive class is presence of disease, and the negative class is the
absence of disease).
The different kinds of classifications are neatly summarised in Table 2.1. From
these definitions we can define several new functions which we will use to measure
different aspects of classification performance. Many of these functions treat the
positive class differently to the negative class (or classes in the case of multi-class
30 CHAPTER 2. BACKGROUND
problems), so throughout the thesis we will take care to define the positive class
when using these measures. These functions come in pairs, and we first detail
the precision and recall.
Definition 2. Precision and Recall.
Precision: the fraction of predicted positives which are actually positive.
TP
Precision = (2.3)
TP + FP
Recall: the fraction of actual positives which are correctly predicted positive.
TP
Recall = (2.4)
TP + FN
The precision and recall can be used in multi-class problems to measure the
predictive performance of the classifier for a particular class, as they do not require
the calculation of the number of true negatives, which is not well defined for multi-
class problems. Another common pair of error functions are the sensitivity and
specificity.
Definition 3. Sensitivity and Specificity.
Sensitivity: the fraction of actual positives which are correctly predicted positive.
Also known as the true positive rate.
TP
Sensitivity = (2.5)
TP + FN
Specificity: the fraction of actual negatives which are correctly predicted nega-
tive. Also known as the true negative rate.
TN
Specificity = (2.6)
TN + FP
We note that the sensitivity and recall are different names for the same func-
tion, but in this thesis we will use the appropriate name depending on the other
measure in use. There are two further functions that we will use to evaluate clas-
sification performance, the balanced error rate and the F-Measure (or F-score).
Definition 4. Balanced Error Rate and F-Measure.
Balanced Error Rate (BER): the mean of the sensitivity and specificity.
Sensitivity + Specificity
BER = (2.7)
2
2.1. CLASSIFICATION 31
2 × TP
F-Measure = (2.8)
2 × TP + FN + FP
The balanced error rate is useful to determine the classification performance when
the data has a class-imbalance, e.g. there are many more negative examples than
positive examples. The F-Measure is useful to summarise the predictive perfor-
mance of a particular class, as it takes into account the true positives, the false
positives, and the false negatives.
any given classifier. If our classifier performs well on the majority of data, but
poorly on rare examples (i.e. ones with a small p(x)) then it will perform well in
expectation. Similarly, good performance across most of the states of X does not
guarantee good performance in expectation if the classifier is poor at predicting
in the most common states. To incorporate this information we need a joint
probability p(y, x), which is the probability of both events y and x happening
together. The joint probability distribution over Y and X can be constructed
from p(Y |X) by multiplying by p(X), so
We can now construct one of the most important formulae in Machine Learn-
ing, Bayes’ Theorem (or Bayes’ Rule), which we can use to convert probabilities
we can estimate from the data into the probability of a particular class label given
that data. Bayes’ Theorem follows from the commutativity of probability (i.e.
p(x, y) = p(y, x)) and the definition of joint probability given in Equation (2.9),
resulting in,
p(X|Y )p(Y )
p(Y |X) = . (2.10)
p(X)
We can estimate the probability of the data, p(X), and the probability of the
class label, p(Y ), from our training dataset. We can also estimate p(X|Y ) by
partitioning the dataset by class label and separately estimating p(X|Y = y) for
each y. Then we can use Bayes’ Theorem to produce the probability of each of
the possible labels for a particular datapoint x. In the context of Bayes’ Theorem
we call p(Y ) the prior, which reflects our belief in the probability of Y a priori
(i.e. before seeing any data), and we call p(Y |X) the posterior, which reflects our
updated belief in the probability of Y once we have seen the data X. If we use
Bayes’ Theorem and return the most likely class label as our prediction, we have
chosen the mode of the distribution p(Y |X) which is the maximum a posteriori
(MAP) solution. This is an alternative to the Maximum Likelihood solution,
which is simply the largest p(X|Y ), otherwise known as the model likelihood.
We return to the concept of the MAP solution in Chapter 4 where we explore the
MAP solution to the feature selection problem.
In general we do not have access to the true probability distribution p, and
so this needs to be estimated from our training data. There are many different
approaches to this estimation process but in this thesis we will focus on discrete
2.1. CLASSIFICATION 33
probability distributions rather than probability densities, and thus we use sim-
ple histogram estimators. These count the number of occurrences of a state in a
dataset, and return the normalised count as a probability. This approach returns
the correct distribution in the limit of infinite data, and is a reasonable approxi-
mation with finite data. As the distribution becomes higher dimensional (e.g. as
we model more and more features) our estimate of the distribution takes longer to
converge to the true distribution. This problem is commonly known as the “curse
of dimensionality” [35], and is the reason many machine learning algorithms seek
low dimensional approximations to the true joint distribution p(x, y). Other ap-
proaches commonly used involve assuming each distribution follows a particular
functional form, such as a Gaussian, and then estimating the parameters which
control that distribution by fitting it to the data (e.g. the mean and the variance
for the Gaussian distribution). This approach is more popular when taking the
fully Bayesian approach to modelling a system [10], where everything is expressed
in terms of a probability distribution or density, and there are no hard values re-
turned. We will look at the Bayesian approach to modelling systems more in
Chapter 3 when we look at Bayesian Networks.
When dealing with classification algorithms we often have parameters we can
tune to alter the behaviour of the algorithm. These are referred to as “hyper-
parameters” of the model, in contrast to the model parameters which are fit
by the training process. We would like to express the probability of our model
parameters given the data, and to optimise those parameters to maximise the
probability, which in turn minimises our error rate. When used in this way we re-
fer to the likelihood, L, of the data given the parameters, where parameters with
higher likelihood fit the data better. We then construct our prior distribution
over the parameters and use Bayes’ Theorem to calculate the probability of our
parameters given the data.
The likelihood of a model is a central concept in this thesis, as it represents
how well a model fits a given dataset. The likelihood of a set of model parameters
is
N
L(τ ) = q(y i , xi |τ ). (2.11)
i=1
Here q is our predictive model which returns a probability for a given {y i , xi } pair,
based upon the parameters τ . The likelihood takes the form of a product of these
probabilities over the datapoints due to the i.i.d. assumption made about the
34 CHAPTER 2. BACKGROUND
N
L(y, x, τ ) = p(τ ) q(y i , xi |τ ). (2.12)
i=1
Maximising this value means we find the parameters τ that both model the data,
and are a priori most likely. We use a variant of this joint likelihood in Chapter
4 to investigate feature selection.
In classification problems we care about maximising the discriminative per-
formance, i.e. how well our model predicts y i from a given xi . This is represented
by the conditional likelihood of the labels with respect to the parameters,
N
L(τ |D) = q(y i |xi , τ ). (2.13)
i=1
d
arg max{p(Y = y) p(Xi |Y = y)}. (2.14)
y∈Y
i=1
The conditional independence factorises the joint distribution into the product
of marginal distributions for each feature. As we shall see when we consider fea-
ture selection, the assumption of class-conditional independence is not generally
a valid one, and thus the Naı̈ve Bayes classifier is suboptimal in many cases.
However it provides surprisingly good classification performance even when the
naı̈ve assumption is provably untrue [33]. In addition, it is fast to train on a given
dataset, and is equally fast at classifying a test dataset. In the next chapter we
will look at Bayesian Networks, and see how the Naı̈ve Bayes classifier can be
interpreted as a simple Bayesian Network.
Support Vector Machines (SVMs) [22] are an optimal way of drawing a linear
classification boundary which maximises the margin (the distance between the
classification boundary and the closest datapoints of each class). The problem
of finding the maximum margin boundary is solved by identifying the support
vectors which control the position of the boundary (usually the examples on the
convex hull of each class). SVMs have become ubiquitous in Machine Learning,
36 CHAPTER 2. BACKGROUND
as the problem formulation has a convex solution (so there is a unique minima)
which can be found using quadratic programming solvers so the optimal solution
is always returned. While the SVM only finds linear boundaries and thus has in-
sufficient complexity to model non-linear functions, it is possible to transform the
feature space into a higher dimension which might permit a linear solution. This
is called the kernel trick as the mapping is done through a kernel function, and
it is a powerful property of the SVM algorithm. When the (high-dimensional)
linear boundary is mapped back into the original (low-dimensional) space it pro-
duces a non-linear boundary, though one which is still optimal in terms of the
margin. The SVM is a two-class classifier, in contrast to the k-NN and Naive
Bayes methods described above which can deal with multi-class problems, though
there are extensions to the SVM which give multi-class classifiers.
These are the three classifiers we will use throughout the remainder of the
thesis, though of course there exist many other classifiers tailored to different
problems. One important class of algorithms are ensemble techniques which com-
bine multiple classification models into one classification system [66]. We refer
the reader to Kuncheva [66] for more detail on ensemble algorithms, but note
that these algorithms are popular when dealing with complex cost-sensitive clas-
sification problems and we now review the literature surrounding cost-sensitive
problems.
In many classification problems one kind of error can be more costly than other
kinds, e.g. false negatives are usually much more costly than false positives in
medical situations, as the cost of not treating the disease is generally higher than
the cost of unnecessary treatment. If we have asymmetric costs (where one class
is more important than another) we would like to train a classifier which could
focus on correctly classifying examples of that class. A closely related problem is
classifying in unbalanced datasets, where there are vastly more examples of one
class than another. In these datasets it is simple to achieve a low error rate by
continually predicting the majority class, though the classifier has learned little
about the structure of the problem beyond the asymmetry in the class priors.
2.2. COST-SENSITIVE CLASSIFICATION 37
Here c(ŷ, y) is the entry in the cost matrix associated with predicting class ŷ
given that the true class is y. The Bayes risk, is achieved by predicting the class
label which minimises the conditional risk R(ŷ|x). This is the optimal prediction
in terms of reducing the misprediction costs. Elkan [38] shows that in two-class
problems the cost matrix is over-specified as there are only 2 degrees of freedom,
each of which controls the cost for mispredicting one label. While the cost matrix
approach is simple to understand it has some restrictions, as it does not allow
the costs to be example dependent. Elkan proposed a more general framework
[38, 114] which gives each example a weight based upon how important it is to
classify correctly. This allows the weight to be a function of both y and x, allowing
it to vary with the value of the features as well as the label. This approach can
be extended to the multi-class case by giving each example a vector of |Y | − 1
weights, where |Y | is the number of classes.
classifier trained on the new data behaves like a cost-sensitive classifier trained
on the original data. Examples of this are the widely used SMOTE technique
[19], and the Costing ensemble algorithm [114]. These approaches resample the
data according to the cost of each example, before training a standard (cost-
insensitive) classifier on the newly resampled data. However, this strategy has the
consequence of distorting the natural data distribution p(x, y), and so the supplied
training data will not be i.i.d. with respect to the testing data, potentially causing
problems with overfitting. The MetaCost algorithm [32] relabels the data based
upon an ensemble prediction and the risk, before training a standard classifier on
the relabelled data. We can view these approaches as distorting how the classifier
“sees” the world – encouraging it to focus on particular types of problems in the
data. The popular LibSVM [17] implementation of the SVM classifier uses an
analogous system where the internal cost function of the classifier is changed so
some examples are more costly to classify incorrectly.
Dmochowski et al. [31] investigate using a weighted likelihood function to
integrate misclassification costs into the (binary) classification process. Each
example is assigned a weight based upon the tuple {x,y}, and the likelihood of
that example is raised to the power of the assigned weight. They prove that
the negative weighted log likelihood forms a tight, convex upper bound on the
empirical loss, which is the expected conditional risk across a dataset. This
property is used to argue that maximising the weighted likelihood is the preferred
approach in the case where the classifier cannot perfectly fit the true model. We
will look at this weighted likelihood in more detail in Chapter 7.
The logarithm base defines the units of entropy, with log2 using bits, and loge
using nats. In general the choice of base does not matter provided it is consistent
throughout all calculations. In this thesis we will calculate entropy using log2
unless otherwise stated. High values of entropy mean the state of x is very uncer-
tain (and thus highly random), and low values mean the state of x is more certain
(and thus less random). Entropy also increases with the number of possible states
of X, as if X has 4 states there are more possibilities for the state of x than if
X had 2 states. Entropy is maximised when all states of X are equally likely,
40 CHAPTER 2. BACKGROUND
as then the state of x is most uncertain/hardest to predict, and H(X) = log |X|
where |X| denotes the number of states of X.
The Conditional Entropy of X conditioned on Y measures the expected un-
certainty of the state of a sample of x when Y is known. It is averaged over
the possible states of Y so it gives a useful measure in the abstract when Y is
unknown. This has two equivalent definitions, in terms of the joint probability
distribution p(x, y),
H(X|Y ) = p(y)H(X|Y = y) (2.17)
y∈Y
=− p(x, y) log p(x|y). (2.18)
x∈X y∈Y
The conditional entropy can also be defined as the entropy of the joint variable
XY minus the entropy of Y ,
The conditional entropy is upper bounded by the marginal entropy, and lower
bounded by 0.
0 ≤ H(X|Y ) ≤ H(X) (2.20)
The mutual information can also be expressed as the KL-Divergence2 between the
joint distribution p(x, y) and the product of both marginal distributions p(x)p(y),
defined as follows
p(x, y)
I(X; Y ) = p(x, y) log . (2.24)
x∈X y∈Y
p(x)p(y)
With this formulation we can see how the mutual information reaches its max-
imal and minimal values more easily. The maximal value is the minimum of
the two entropies H(X) and H(Y ), and occurs when knowledge of one variable
allows perfect prediction of the state of the other. In the above equation this
is due to p(x, y) becoming equal to either p(x) or p(y) for all values, and then
the KL-Divergence cancels down to Equation (2.16), the standard entropy. The
minimal value is 0, which occurs when X and Y are independent. In the above
equation this causes p(x, y) to factorise into p(x)p(y), thus the fraction is unity,
and the logarithm becomes zero. In the field of feature selection it is common to
use a normalised function of the mutual information, termed the Symmetric Un-
certainty. This is the mutual information between the two variables normalised
by the sum of their marginal entropies, therefore
I(X; Y )
SU (X; Y ) = (2.25)
H(X) + H(Y )
2
The KL-Divergence is also referred to as the relative entropy.
42 CHAPTER 2. BACKGROUND
The conditional mutual information measures the dependency between two vari-
ables when the state of a third is known, and is defined in terms of an expected
KL-Divergence as follows,
I(X; Y |Z) = p(z)I(X; Y |Z = z) (2.29)
z∈Z
p(x, y|z)
= p(z) p(x, y|z) log . (2.30)
z∈Z x∈X y∈Y
p(x|z)p(y|z)
Unlike the conditional entropy, the conditional mutual information can be larger
than the unconditioned mutual information. This is due to a positive interaction
between the variables, where the dependency between two variables is increased
by the knowledge of the state of a third. Again like the mutual information, the
conditional mutual information is zero when the two variables are independent,
conditioned on the presence of the third variable. The maximal value of the
conditional mutual information is the minimum of the two conditional entropies
H(X|Z) and H(Y |Z), again achieved when knowledge of one variable (and the
conditioning variable) allows perfect prediction of the state of the other. We will
see an important use of the conditional mutual information when we investigate
structure learning in Bayesian Networks in the next chapter. In Chapters 4 and
5 we will see how the conditional mutual information is particularly important in
the context of filter feature selection.
One of the reasons for the popularity of Information Theory in Machine Learning
is the existence of bounds on the Bayes Error of a classification problem, defined
in terms of the Conditional Entropy. There exists both an upper and lower
bound in terms of the conditional entropy and the range between these bounds
shrinks as the conditional entropy decreases. The lower bound was proved by
Fano in 1961 [39], and the upper bound was proved by Hellman and Raviv in
1970 [55]. They form the informal basis for the use of mutual informations as
feature selection criteria, which we explore in the next chapter. We will derive an
alternate justification for the use of mutual information based criteria in Chapter
4.
For a two class problem the Bayes error is bounded by the conditional entropy
2.3. INFORMATION THEORY 43
as follows:
H(Y |X) − 1 1
≤ ebayes ≤ H(Y |X). (2.31)
log |Y | 2
With these bounds we can see that the conditional entropy represents the con-
straint placed upon the class label by the data. If the data does not constrain
the choice of class label then for any given feature vector, there may be multiple
different classes which could be assigned, so there is still a large amount of un-
certainty in the choice of class label, and thus the Bayes error will be high. If the
data tightly constrains the choice of class label then on average a given feature
vector will only have one possible class, and thus the Bayes error will be low.
Therefore a large value of H(Y |X) implies the features alone do not have enough
information to create a good classification boundary, whereas a small value of
H(Y |X) implies the features contain sufficient information to produce a good
boundary.
We can see the link between mutual information and feature selection by con-
sidering how the mutual information decomposes into sums of entropies, as in
Equation (2.21). As the entropy of the class label H(Y ) is constant, maximising
the mutual information I(X; Y ) between the features and the class label is equiv-
alent to minimising the conditional entropy H(Y |X), which in turn minimises the
Bayes Error. Therefore finding a feature set Xθ which maximises I(Xθ ; Y ) will
minimise the bound on the Bayes rate, and thus provide an informative feature
set for any future classification process. We will look at the literature in informa-
tion theoretic feature selection more closely in the next chapter, before deriving a
more concrete link between the mutual information and the classification process
in Chapter 4.
One further important point about the conditional entropy is that it is the
limit of the scaled conditional log-likelihood of the labels given the data,
1 N
lim log p(y i |xi ) = H(Y |X). (2.32)
N →∞ N
i=1
We can estimate this from data, as the Strong Law of Large Numbers assures us
that the sample estimate using p̂ converges almost surely to the expected value
— for an i.i.d. dataset of N observations (xi , y i ),
1
N
p̂(xi , y i )
I(X; Y ) ≈ I(X;
ˆ Y)= log . (2.34)
N i=1 p̂(xi )p̂(y i )
In order to calculate this we need the estimated distributions p̂(x, y), p̂(x), and
p̂(y). The computation of entropies for continuous or ordinal data is highly
non-trivial, and requires an assumed model of the underlying distributions —
to simplify experiments throughout this thesis, we use discrete data and estimate
distributions with histogram estimators using fixed-width bins. The maximum
likelihood estimate of the probability of an event p(X = x) is given by the fre-
quency of occurrence of the event X = x divided by the total number of events
(i.e. datapoints). There exist many other methods for estimating entropy, though
they fall into two main approaches: plug-in estimators which first estimate the
probability distributions p̂, and direct entropy estimators which calculate the en-
tropy from data without constructing probability distributions (e.g. Póczos and
Schneider [90]). For more information on alternative entropy estimation proce-
dures, we refer the reader to Paninski [83].
At this point we must note that the approximation above holds only if N is
large relative to the dimension of the distributions over x and y. For example
if x, y are binary, N ≈ 100 should be more than sufficient to get reliable esti-
mates; however if x, y are multinomial, this will likely be insufficient. If we wish
2.4. WEIGHTED INFORMATION THEORY 45
Due to the presence of the weights, it is no longer bounded with respect to the
number of states in X, but by wmax log |X|, where wmax denotes the largest weight
in w. This measure is non-negative like the Shannon Entropy, and reduces to the
46 CHAPTER 2. BACKGROUND
Shannon Entropy when all the weights are equal to one. The weights w do not
have to be normalised into the range 0 ≤ w ≤ 1, though any such normalisa-
tion will not change the relative values of the entropy, due to the linearity of
expectation.
In the same book [46] Guiaşu briefly defines a “weighted entropic measure
of cohesion”, which is the sum of two marginal weighted entropies, minus the
joint weighted entropy. Luan et al. [75] also defined a similar quantity as the
“Quantitative-Qualitative measure of mutual information”, and Schaffernicht and
Gross [96] define this quantity as the Weighted Mutual Information. This lat-
ter term is the one we will use throughout this thesis. The weighted mutual
2.5. CHAPTER SUMMARY 47
The final definition is in the form of a weighted relative entropy [102], between
p(x, y) and p(x)p(y). However there is a flaw with this weighted information
measure, which was shown for the weighted relative entropy by Kvålseth in 1991
[68], namely that the measure can take negative values, i.e. for some X and Y ,
wI(X; Y ) < 0. This is a problem with the weighted mutual information which
has restricted its use in the literature, as there is no intuitive understanding of
what a negative information might mean. We provide examples of situations
with negative weighted mutual informations in Chapter 7 and there define an
new weighted information measure which is provably non-negative.
In the next chapter we have a detailed review of the literature in feature selec-
tion and structure learning which gives the landscape in which the contributions
exist, detailing the state of the art in feature selection and pinpointing areas
where this thesis advances the state of the art.
Chapter 3
49
50 CHAPTER 3. WHAT IS FEATURE SELECTION?
process.
Feature selection algorithms of all kinds rely upon a single assumption about
the data, that the feature set contains irrelevant and/or redundant features. Ir-
relevant features contain no useful information about the classification problem,
and redundant features contain information which is already present in more in-
formative features. If we can select a feature set which does not include these
irrelevant or redundant features then we can reduce the collection cost of the
feature set, and we can investigate the remaining relevant features to determine
how they relate to the class label. We may also improve classification perfor-
mance by reducing the potential for overfitting when shrinking the feature set.
These heuristic notions of relevancy and redundancy were formalised by Kohavi
& John [62] into three classes: strongly relevant, weakly relevant, and irrelevant.
In the definitions below, Xi indicates the ith feature in the overall set X, and X\i
indicates the set {X\Xi }, i.e. all features except the ith.
The strongly relevant features are not redundant as each one contains use-
ful information that is not present in any other combination of features. The
weakly relevant features contain information which is either already present in
the strongly relevant features, or exists in other weakly relevant features. The
irrelevant features contain no useful information about the problem. We would
like a feature selection algorithm to return the set of strongly relevant features,
plus a (potentially empty) subset of the weakly relevant features, excluding the
irrelevant features. Unfortunately these definitions do not lend themselves to
the construction of a feature selection algorithm, as they require checking expo-
nentially many combinations of features to ascertain weak relevancy. They also
3.1. FEATURE SELECTION 51
are quite abstract quantities, and we will relate their notions of relevancy and
irrelevancy to concrete information theoretic quantities in Chapter 5.
We now introduce a small amount of notation and thus formally define the
feature selection problem. Throughout this thesis we will use θ as a binary vector
of length d where d is the number of features, θi = 1 means the i’th feature is
selected and θi = 0 means it is ignored. We shall define ¬θ as the negation of
the vector θ, thus denoting the unselected features. We formally define feature
selection by stating the mapping φ : X → Y uses a subset of the features θ rel , with
the remaining features ¬θ rel being redundant or irrelevant. Therefore φ : X → Y
is equivalent to φ : Xθrel → Y where Xθrel is the subset of X containing the
features selected by θ rel . The concept of sparsity in the statistics literature [10]
is related to the process of feature selection. A sparse solution to a prediction
problem means that few of the features are involved in the prediction, hence we
can think of feature selection as enforcing sparsity on the solution. The term
arises from matrices where a sparse matrix is one with few non-zero elements. In
our case we say θ is sparse if it has few set bits, thus the feature set selected by
θ is also sparse. The choice of search function is important in feature selection
problems as the space of possible feature subsets is the powerset of the dimension,
thus the number of possible feature sets is 2d . Therefore examining all possible
feature sets is intractable for any reasonably sized problem.
Feature selection algorithms are a subset of the more general feature extraction
algorithms. Feature extraction techniques produce a new feature set by processing
the original set. This can be by combining multiple features, or projecting the
feature set into a higher or lower dimensional space. Such techniques are useful
when the classification algorithm cannot understand the current representation,
e.g. we can see the kernel trick in SVM classifiers as a form of feature extraction
(see Section 2.1.2). In feature selection we extract features by returning a relevant
subset of the input feature set. For the remainder of this thesis we will focus on
feature selection algorithms, of which there are three main kinds: filters, wrappers
and embedded methods. Filters and wrappers are defined by two things, the
evaluation function or criterion which scores the utility of a feature or a feature
set (which we term J), and the search algorithm which generates new candidate
features or feature sets for evaluation. We show the general form of a filter
or wrapper algorithm in Algorithm 1. Embedded methods are feature selection
techniques which are integrated into a particular classification algorithm therefore
52 CHAPTER 3. WHAT IS FEATURE SELECTION?
separating out the evaluation function from the search procedure is more difficult.
We explore the main differences between the three approaches below.
3.1.1 Filters
Filter approaches use a measure of relevancy or separation between the candidate
feature or feature set and the class label as the scoring function for that feature
or feature set. These measures range from simple correlation measures such as
Pearson’s Correlation Coefficient [85], through complex correlation measures such
as the mutual information (discussed in Section 2.3), to complex measures of inter
and intra class distances, used in the Relief algorithm [60]. All these measures
return a number which represents the strength of the relationship between the
candidate feature set and the class label. This relationship might be a measure of
information, or a measure of the separability of the classes based on that feature
set. Other common measures, are based upon probabilistic independence, which
is a topic we discuss in more detail in Section 3.5. Much of the early work in filter
feature selection focused on different kinds of distance or divergence metrics, both
probabilistic and based directly upon the data [30]. We call the scoring function
a feature selection criterion, and the study of the criteria based on information
theoretic measures is the topic of this thesis.
Filter algorithms can be further divided into those which measure univariate
relationships or multivariate relationships. Univariate measures only consider the
relationship between the candidate feature and the class label, ignoring any inter-
actions with the previously selected features. In contrast, multivariate methods
measure the relationship in the context of previously selected features. This can
either increase or decrease the strength of the relationship between the candidate
3.1. FEATURE SELECTION 53
feature and the class label, but (in general) better represents how it would work
in a classifier. A more detailed review of scoring criteria is found in Duch [34]. In
this thesis we focus on a set of filter algorithms which use variations on informa-
tion theoretic measures (see Section 2.3 for a description of Information Theory)
to measure the interaction between features, and to measure the correlation with
the class label. We review the information theoretic filter literature in more detail
in Section 3.2.
The scoring criteria is coupled with a search method which describes how
the candidate feature sets are selected (see Reunanen [93] for a review of search
strategies for both filter and wrapper approaches). The complexity of the scoring
criteria usually dictates the complexity of the search method, as univariate criteria
will not benefit from complex search methods, due to their inability to consider
feature interactions. Many common filters (e.g. Relief [60] or mRMR [86]) use
greedy forward or backward searches, testing each feature in turn for inclusion
(exclusion) and adding (removing) the feature which scores the highest (lowest).
More complex searches include “floating” methods [91] which dynamically adjust
the size of the selected feature set, adding features when they improve the scoring
criteria and removing those features which do not decrease the criteria. There
exist optimal search strategies based upon Branch & Bound methods [99] which
can exclude groups of features from consideration if they can never improve in
performance. However such complex search algorithms are unnecessary in certain
situations, notably in the case of Bayesian Networks where the Markov Blanket
can be recovered using a greedy search (see Section 3.6). For the remainder of
this thesis we will use greedy searches to test our scoring criteria, and leave the
investigation of more complex search methods to future work.
The choice of stopping criterion is also an important question in filter methods,
as there is no error measure to monitor the performance. In general either a fixed
number of features is selected or the search continues until the score measure
for all remaining unselected features drops below a threshold. Again as we are
interested in the scoring criterion itself we will leave the investigation of stopping
criteria to future work, and simply select a fixed number of features.
One benefit of filter methods is due to the use of abstract measures of correla-
tion between variables, they return a feature set which should perform well across
a wide range of classification algorithms, assuming the filter does not have dras-
tically different assumptions to the classifier (a topic we investigate in Chapter
54 CHAPTER 3. WHAT IS FEATURE SELECTION?
5). Filter methods are usually the fastest of the three kinds of feature selection
technique mentioned here as they do not require the training of a classification
algorithm to score the features.
3.1.2 Wrappers
In contrast to filters, wrapper approaches (e.g. Kohavi & John [62]) use the
performance of a classification algorithm on a particular testing dataset as the
evaluation function. This means the feature set is closely tailored to the classifi-
cation algorithm used, and may not perform well if used with a different classifier.
The name comes from the feature selection process being “wrapped” around a
particular classification algorithm.
To evaluate the utility of a particular feature, it is included in the current
selected feature set and the performance of the feature set as a whole is tested
on the training data. This is a time intensive process as it involves training the
classifier, and then testing on all the datapoints. However it does fully capture
all the interactions between the features (that the classifier can use), and is thus
unlikely to select redundant features.
In general the classification algorithm used in the wrapper is the same as the
final classification algorithm which will form the final system, and so the main
issue with wrapper methods is the choice of search method and stopping criterion.
Much of what was mentioned in the previous section on filter methods is valid
for the choice of search method and stopping criterion with wrappers.
the LASSO regression algorithm [104] which drives the weights of irrelevant fea-
tures to zero in the process of constructing a linear model. We will look at Lasso
in more detail when we discuss feature selection algorithms which include prior
knowledge in Section 3.3. Like wrapper algorithms, embedded methods produce
feature sets which are closely tied to the classification algorithm used. Again this
may cause the feature sets to perform poorly when used with other classification
algorithms, or if the feature set is used for further analysis of the underlying
problem.
We have now reviewed the three main approaches to feature selection. Now
we will explore the literature surrounding information theoretic filter algorithms,
which are the topic of this thesis.
We refer to this feature scoring criterion as ‘MIM’, standing for Mutual Infor-
mation Maximisation. This heuristic, which considers a score for each feature
independently of others, has been used many times in the literature, the first
mention of such a scoring procedure is in Lewis (1962) [73] though it is not ex-
plicitly referred to as a mutual information. It reappears in more recent work
under different guises, e.g. Lewis (1992) [72]. This criterion is very simple, and
thus the choice of stopping condition in the search is more important than the
search algorithm itself. This is because it is a univariate measure, and so each
feature’s score is independent of the other selected features. If we wish to select
56 CHAPTER 3. WHAT IS FEATURE SELECTION?
k features using MIM we will pick the top k features, ranked according to their
mutual information with the class. We could also select features until we had
reached a predefined threshold of mutual information or another condition, but
the choice of search would make little difference. We saw in Chapter 2 how the
mutual information can be used to construct both an upper and lower bound on
the Bayes error rate [39, 55], and these bounds form the justification for the use
of this simple criterion. Unfortunately the independence assumed by measuring
each feature’s score without considering the currently selected features is a se-
rious limitation of this approach. If all the features are independent then the
assumption holds, and MIM will select a useful feature set. However the assump-
tion of independent features is untrue in the vast majority of cases, and thus this
approach is usually suboptimal.
We saw in the previous section that a useful and parsimonious set of features
should not only be individually relevant, but also should not be redundant with
respect to each other — selected features should not be highly correlated to other
selected features. We note that while this statement is appealingly intuitive, it
is not strictly correct, as we shall see in the next chapter. In spite of this, several
criteria have been proposed that attempt to pursue this ‘relevancy-redundancy’
goal. For example, Battiti [6] presents the Mutual Information Feature Selection
(MIFS) criterion:
Jmif s (Xk ) = I(Xk ; Y ) − β I(Xk ; Xj ), (3.2)
Xj ∈S
where S is the set of currently selected features. This includes the I(Xk ; Y )
term to ensure feature relevance, but introduces a penalty to enforce low corre-
lations with features already selected in S. MIFS was constructed (like many of
the criteria in this chapter) to use a simple sequential forward search, greedily
selecting the best feature in each iteration. The β in the MIFS criterion is a
configurable parameter, which must be set experimentally. Using β = 0 is equiv-
alent to Jmim (Xk ), selecting features independently, while a larger value will place
more emphasis on reducing inter-feature dependencies. In experiments, Battiti
found that β = 1 is often optimal, though with no strong theory to explain why.
The MIFS criterion focuses on reducing redundancy; an alternative approach was
proposed by Yang and Moody [110], and also later by Meyer et al. [80] using the
3.2. INFORMATION THEORETIC FEATURE SELECTION 57
This is the information between the target and a joint random variable Xk Xj ,
defined by pairing the candidate Xk with each feature previously selected. The
idea is if the candidate feature is ‘complementary’ with existing features, we
should include it.
The MIFS and JMI schemes were the first of many criteria that attempted to
manage the relevance-redundancy trade-off with various heuristic terms, however
it is clear they have very different motivations. The criteria identified in the lit-
erature 1992-2011 are listed in Table 3.1. One of the more popular criteria which
trades off relevancy and redundancy is the Fast Correlation Based Filter (FCBF)
of Yu & Liu [112]. This criterion selects a feature provided its individual rele-
vancy I(Xi ; Y ) is greater than the redundancy between Xi and any of the selected
features Xj ∈ S, I(Xi ; Xj ). The standard approach to creating filter criteria is to
hand-design them, constructing a criterion from different terms where each term
deals with a different part of the selection problem. It is common to see a rele-
vancy term for an individual feature combined with some number of redundancy
terms between that feature and the currently selected feature set. This has lead to
many different criteria, each of which aims to manage the relevancy-redundancy
trade-off in a different way. Several questions arise here: Which criterion should
we believe? What do they assume about the data? Are there other useful criteria,
as yet undiscovered? We explore these questions in Chapter 5 and the answers
provide the basis for much of the work in this thesis.
There has been some previous work aiming to answer some of these questions;
one of the first attempts to unify these varying criteria is due to Brown [12],
which viewed them as approximations to the full mutual information I(S; Y )
between the set of selected features S and the label Y . This can be expanded
into a series of terms based upon McGill’s Interaction Information [78], and many
of the criteria were found to follow a similar functional form which curtailed
the expansion at a particular point. Balagani and Phoha [5] investigate the
links between the mRMR, MIFS and CIFE criteria showing how each trades off
different information theoretic terms to produce a scoring criteria and how these
58 CHAPTER 3. WHAT IS FEATURE SELECTION?
terms affect the selected feature set. We present a different view of all these
criteria in Chapter 5 where we investigate the links between them at a deeper
level, namely what assumptions are they making about the underlying probability
distribution p.
We now investigate an area with little relation to the information theoretic
algorithms we have discussed, namely that of feature selection incorporating prior
(domain) knowledge. In Chapter 6 we will see how to construct information
theoretic algorithms which naturally incorporate such knowledge.
N
d
i 2
E(D, w) = (y − w x ) + λ
i T
|wj |. (3.4)
i=1 j=1
This is the error for a given training dataset D and weight vector w, where λ
is a parameter which controls the strength of the regularisation, and thus the
amount of non-zero weights in w. There have been many extensions to the basic
framework, incorporating different kinds of knowledge in addition to a general
sparsity prior. Helleputte and Dupont [54] present a good example of this recent
work in informative priors for regularized linear models. They develop a prior
for an approximate zero-norm minimization, constraining some of the dimensions
according to knowledge gained from the biological literature.
Krupka et al. [65] define meta-features, where each feature is qualified by addi-
tional information, and a mapping is learnt from meta-features to some measure
of feature ‘quality’. Knowledge can then be transferred between tasks by learning
the feature-quality mapping for similar problems; however the challenge remains
to define a good quality measure and reliably learn the mapping.
issue with this technique is that it is unclear how to adapt it to multivariate rank-
ing algorithms like MIFS or JMI, where the score given to a particular feature is
dependent upon the previously selected features.
The FCCF algorithm proposed by Chidlovskii and Lecerf [21] approaches the
multi-class problem from a different perspective. It is based upon the FCBF
algorithm discussed in Section 3.2, but modified so the exclusion heuristic takes
into account the presence of multiple classes. In addition to the standard FCBF
exclusion heuristic that the candidate feature Xi has shares more information
with a selected feature than the class label, it must also have less class specific
information than that selected feature. Therefore the feature Xi is excluded iff
∃ y ∈ Y, Xj ∈ S s.t. SU (Y = y; Xj ) ≥ SU (Y = y; Xi ) and SU (Xi ; Xj ) ≥
SU (Y ; Xi ), where SU (X; Y ) is the symmetric uncertainty between X and Y (see
Section 2.3).
Chapelle and Keerthi [18] present a modification of L1 -SVM to extend it to
the multi-class case. L1 -SVM is a regularised version of the standard binary
SVM algorithm which like LASSO mentioned in the previous section performs
an L1 regularisation of the feature weight vector driving most of those weights to
zero, resulting in a sparse selected feature set. As SVMs are binary classification
algorithms the standard RFE and L1 methods select independent feature subsets
for each possible class. The proposed method shares the regularisation penalty
across all the individual SVM optimisation functions, thus ensuring that each
binary problem is properly penalised by the number of features in use.
The final algorithm we review is an explicit cost-sensitive feature selection al-
gorithm, which aims to select features which minimise the misclassification cost.
This approach is due to Robnik-Šikonja [95], and works with many univariate fea-
ture ranking procedures used in decision trees as splitting criteria. The proposed
technique adjusts the probabilities of each split criteria, altering the probability
for each class based upon the expected risk of misclassification as follows,
|Y |
1
y = p(ŷ)C(y, ŷ) (3.5)
1 − p(y) ŷ=y
p(y) y
p (y) = |Y | . (3.6)
ŷ p(ŷ) ŷ
In the equation above C(y, ŷ) is the cost of misclassifying class y as ŷ, and p(y) is
the marginal probability of the label y. This approach can be seen as extending
62 CHAPTER 3. WHAT IS FEATURE SELECTION?
the work of Elkan [38] to feature selection, as it effectively resamples the data
according to its expected misclassification cost. The new probability values can
then be incorporated into a function such as the mutual information to score the
features.
We now move to a different but related section of the literature, that relating
to Bayesian Networks and specifically the learning of network structure.
Figure 3.1: A Bayesian network, with the target node shaded in red, and the
Markov Blanket of that node shaded in cyan.
• Symmetry: X ⊥⊥ Y |Z ⇔ Y ⊥⊥ X|Z.
The first algorithm which attempted to learn the Markov Blanket of a node was
strictly a feature selection approach, presented in Koller & Sahami [63]. It pro-
ceeded via backwards elimination from the full feature set, removing the feature
which had the smallest interaction with the class label, given its (approximate)
Markov Blanket. This approximate algorithm is closest to the CMIM algorithm
we discuss as part of Chapter 5, in that each feature is scored by its minimal
interaction when conditioned on the selected feature set. This algorithm does
not return the Markov Blanket of the target node (usually the class label), but it
was used as a baseline in the development of other structure learning algorithms.
The first true structure learning algorithm is the Grow-Shrink (GS) algorithm
by Margaritis and Thrun [77]. This adopted the two stage algorithm which is
common to many local learning algorithms, first growing the candidate Markov
Blanket by adding new features until all remaining unselected features are condi-
tionally independent of the target, then shrinking the candidate Markov Blanket
to remove any false positives which may have been selected. This local algorithm
for constructing Markov Blankets is then ran over each node of the network, and
the resulting subgraphs are combined to construct a DAG for the entire network.
The choice of conditional independence test used in the algorithm is left unspec-
ified, as there are multiple ones which are suitable. The IAMB algorithm by
Tsamardinos and Aliferis [107] is a refinement of the GS algorithm. It uses a
66 CHAPTER 3. WHAT IS FEATURE SELECTION?
different ordering heuristic to choose in what order to select and remove features,
compared to the GS algorithm. This heuristic simply selects the feature which
has the largest conditional mutual information, and removes the feature which
has the smallest conditional mutual information. They state that this heuristic
improves the runtime of the algorithm by prioritising features which most improve
the quality of the returned Markov Blanket. The algorithm for IAMB is given in
Algorithm 2, though the general structure is true for most of the local structure
learning algorithms we explore in this section. The f (X; Y |CMB) in Algorithm
2 represents the conditional independence test used, and is a parameter of the
algorithm.
The authors of IAMB proceeded to publish several more data efficient variants
of the IAMB algorithm, which require fewer samples to accurately estimate the
necessary tests. The most common variants are the MMMB [106] and HITON [4]
algorithms. MMMB first finds the set of parents and children of a target node, by
finding the minimal set of nodes which makes each node most independent of the
target. This is similar to the heuristic proposed by Koller & Sahami for Markov
Blanket recovery in classification problems. Once the set of parents and children
have been found these are added to the candidate Markov Blanket. Then the
parents and children are found for each node in the candidate Markov Blanket,
and these are tested to see if they are spouse nodes of the original target. This
approach reduces the computation by only conditioning on the minimal set of
3.6. STRUCTURE LEARNING 67
nodes (i.e. the set of parents and children of the target). The HITON algorithm is
essentially similar to the MMMB algorithm, but is tailored for prediction tasks. It
uses a final wrapper to prune the candidate Markov Blanket by removing features
which do not affect classification performance.
One further local structure learning algorithm is the new variant of GS devel-
oped by Margaritis [76]. This extends the GS framework beyond the learning of
Markov Blankets to the general case of learning Markov Boundaries. This is still
a set of variables which make a particular node probabilistically independent from
the rest, but it includes cases which cannot be expressed as Bayesian Networks,
such as in the TIED dataset [101], where there are multiple different Markov
Blankets (which is not possible in a true Bayesian Network). The algorithm is
extended to test sets of variables for dependence in the inclusion phase. This
makes the algorithm exponential rather than polynomial in the number of nodes,
which precludes its usage in practice. Margaritis thus proposes a randomised ver-
sion, which randomly samples sets of a fixed size, tests each set for dependence,
and adds the set with the largest measured dependence (similar to the heuristic
ordering added in the IAMB algorithm).
One of the first global structure learning algorithms was based upon the condi-
tional independence testing methodology, namely the PC algorithm by Spirtes
et al. [100]. This first constructs an undirected graph where the absence of a
link between two nodes indicates they are conditionally independent. After this
construction process each link is oriented (if possible) by analysing the graph
structure, as a Bayesian Network is an acyclic graph, so there are no directed
paths starting at one node and returning to that node. The PC algorithm is not
usually able to orient all the edges, and thus returns a partially directed graph
where some edges are left undirected. These partially directed graphs are usually
called “patterns” or “essential graphs”.
The PC algorithm is unusually reliant upon the quality of the independence
test used, as it uses the current graph state to construct the set of independence
tests needed for the next node. This causes errors to cascade through the graph
structure as one mistake produces an incorrect structure to base the next test
on, which in turn causes more failures [27]. This problem is partially solved by
the LGL family of algorithms developed by Aliferis et al. [2, 3] which learn the
68 CHAPTER 3. WHAT IS FEATURE SELECTION?
local structure for each node independently before combining each local structure
to learn the full graph. A local learning algorithm such as an IAMB variant is
used to learn the set of parents and children for each node in turn. Those local
structures are then used to produce an undirected graph for the whole structure
(as it cannot differentiate between the parents and the children) before using
another algorithm to orient the edges. In the paper they explore the use of the
BDeu score to orient the edges, which is an example of the score and search
methods we review in the next section.
2. That the data generating parameters are both globally independent across
the network, and locally independent with respect to the state of the parent
set of a node.
3. That each node’s parameters only depend upon the parent set of that node.
The most popular member of this family is the BDe (Bayesian Dirichlet equiva-
lence) metric [53], which imposes an additional assumption, that equivalent net-
work structures should have the same score. This effectively replaces the Dirichlet
assumption, as the assumption of equivalence implies that all node parameters
are Dirichlet distributed. The BDe score for any given graph G given a training
dataset D is
d
Πi
Γ(Nij ) ri
Γ(Nijk + Nijk )
p(D, G) = p(G)
. (3.7)
i j=1
Γ(N ij + N ij ) k=1
Γ(N ijk )
Mukherjee and Speed [82] construct different kinds of structure priors for use
with the BDe score. This allows the prior knowledge to constrain abstract graph
features such as the number of parents of a node, or the connectivity between
two groups of nodes. They couple the BDe score and the structure priors with an
MCMC based search. The search aims to sample from the distribution p(G|D),
so the probability of various graph features can be calculated, along with the
most probable graph (i.e. the MAP solution). The proposal distribution for the
Metropolis-Hastings sampler used is based on the neighbourhood of the graph,
70 CHAPTER 3. WHAT IS FEATURE SELECTION?
only allowing proposals which are within the neighbourhood of the current graph.
This proposal distribution can optionally be modified to incorporate the prior
distributions over graph features. This in turn means graphs which are a priori
more likely to be proposed more often.
Grossman and Domingos [45] present a different take on the score and search
paradigm, as they wish to learn the structure of a Bayesian Network to construct a
classifier for a particular node. This can be seen as generalising the construction of
a Naı̈ve Bayes classifier (see Chapter 2) to arbitrary network structures around the
class node Y . They do this by aiming to maximise the conditional log-likelihood
of the structure, inferring the individual node parameters from their maximum
likelihood estimates. This takes a discriminative approach to the learning of
structure which is a perspective we will examine more closely in the next chapter.
Each network structure is scored by the likelihood, L, of Y given the inputs X
as follows,
N
L(G|D) = log p(y i |xi1 , . . . , xid ). (3.8)
i=1
This scoring function is coupled with the simple hillclimbing search used in BDe,
except it starts from the empty network which has no arcs. Unfortunately this
scoring function requires the estimation of the complex distribution p(y|x), which
requires a relatively large amount of data to calculate when there are many nodes
connected to Y . Carvalho et al. [16] present a new scoring function approximat-
ing the conditional log-likelihood which decomposes over the network structure,
allowing it to be calculated on a per node basis (thus reducing the complexity of
each estimated distribution). This function is an approximate factorised condi-
tional log-likelihood, and can be expressed as the sum of information theoretic
components plus a fixed constant which does not depend upon the network,
d
fˆL(G|D) = (α + β)N I(Xi ; Π(Xi )|Y ) + βλN I({Y ; Xi ; Π(Xi )}) + const. (3.9)
i=1
4 & 6, though our results provide a more general framework, and are applicable
to the general feature selection problem.
One final relevant score and search approach is that of de Campos [28]. This
paper describes a scoring function which is based upon measures of conditional
independence as calculated by the mutual information. Each node is scored by
the mutual information with it’s current parent set, penalised by a function of
the χ value expected if the parents are independent of the node. This leads to
a scoring function which is not a probability or log-probability unlike the other
measures we have discussed and is expressed as,
d
|Π(Xi )|
f (G, D) = 2N I(Xi ; Π(Xi )) − max χα,li,σi . (3.10)
σi
i=1 j=1
In the above equation α is the confidence level of the χ2 test, |Π(Xi )| is the
number of parents of Xi , σi are the possible permutations of the parent set, and
li,σi is the degrees of freedom, based upon the number of possible states for each
permutation. Again this scoring function can be decomposed over the nodes, and
can be interpreted under the Minimum Description Length (MDL) framework
[94] as the decrease in description length between the candidate graph G and the
empty graph.
The twin approaches of conditional independence testing and score & search
initially appear to be very different, as the CB methods add links based upon
the value of an independence test, and S&S measures the change in global score
as the links are added, removed or inverted. In fact in the global case instances
of these two approaches are equivalent. Cowell [25] showed that for a given node
ordering with complete data, independence testing based upon the KL-Divergence
is identical to scoring networks by their log-likelihood. In Chapter 6 we present a
similar result for local structure learning with mutual information, showing that
this is exactly equivalent to hill-climbing the joint likelihood of the local structure
(which is a greedy search on the neighbourhood of the local structure graph).
72 CHAPTER 3. WHAT IS FEATURE SELECTION?
• We looked at feature selection, and the three main paradigms: filters, which
select features based upon a statistical measure of correlation; wrappers,
which select features based upon the performance of a classification algo-
rithm; and embedded methods, a wide group of algorithms which select
features as part of the classification process.
In the next chapter we present the central result of this thesis, a deriva-
tion of mutual information based feature selection criteria from a clearly defined
loss function, namely the joint likelihood of a discriminative model. We derive
update rules for this loss function which allow us to iteratively maximise the
3.7. CHAPTER SUMMARY 73
joint likelihood by selecting the feature which maximises the conditional mutual
information. In Chapter 5 we use this probabilistic framework to understand
the different approximations and assumptions made by the information theoretic
heuristics we have reviewed in this chapter. In Chapter 6 we link our probabilistic
framework to the problem of Markov Blanket discovery, showing in the process
that some common local structure learning algorithms are actually maximising
the joint likelihood under a flat prior over network structure.
Chapter 4
In Chapter 3 we briefly outlined the three main classes of feature selection algo-
rithms: filters, wrappers, and embedded methods. We saw how many of the filter
techniques are heuristic, combining different kinds of correlation terms without
any understanding of the objective function they were optimising. We now recast
the feature selection problem as a constrained optimisation problem in a high
dimensional space, so our task is to find a n-dimensional bit vector θ with at
most k bits set, representing our selected features. We should then choose θ to
optimise some evaluation criterion. There has been much attention given to this
problem of search in high dimensional spaces (see Reunanen [93] for a review of
search strategies in feature selection), and a good search technique is an integral
part of any feature selection algorithm. There has been a similarly large amount
of attention given to the construction of evaluation criteria, but many of these
criteria are based upon heuristics, with little investigation into the underlying
metrics which they are attempting to optimise. We will focus on the derivation
of these evaluation criteria, and leave the question of constructing a search al-
gorithm which best optimises our criteria for future work. In particular we will
derive the optimal evaluation criterion for a particular loss function which we
wish to optimise.
74
4.1. MINIMISING A LOSS FUNCTION 75
this thesis. In Chapter 5 we will investigate the links between our formal frame-
work based upon the likelihood and the literature of feature selection heuristics
based upon mutual information. In Chapter 6 we will look at what benefits our
likelihood framework gives when we wish to extend these criteria to incorporate
prior information. In Chapter 7 we will use a similar derivation to construct
cost-sensitive feature selection criteria.
Figure 4.1: The graphical model for the likelihood specified in Equation (4.1).
N
L(D, θ, τ, λ) = p(θ, τ )p(λ) q(y i |xi , θ, τ )q(xi |λ). (4.1)
i=1
This is the joint likelihood for the graphical model specified in Figure 4.1. In dis-
criminative models we wish to maximise our classification performance, therefore
we maximize L with respect to θ (our feature selection parameters) and τ (our
model parameters). We therefore ignore the generative parameters λ as they do
not directly influence the classification performance. Excluding the generative
terms gives
N
L(D, θ, τ, λ) ∝ p(θ, τ ) q(y i |xi , θ, τ ). (4.2)
i=1
We wish to find the Maximum a Posteriori (MAP) solution, the mode of the
distribution L, with respect to the parameters {θ, τ }. This means we will find
a single feature set which maximises our likelihood. We leave a fully Bayesian
treatment of the feature selection problem, where we calculate a probability dis-
tribution over possible feature sets, to future work.
We choose to work with the scaled negative log-likelihood, −, converting
our maximization problem into a minimization problem, without changing the
position of the optima. This gives
N
1
− = − log q(y |x , θ, τ ) + log p(θ, τ )
i i
(4.3)
N i=1
4.1. MINIMISING A LOSS FUNCTION 77
which is the function we will minimize with respect to {θ, τ }; the scaling term is
to simplify exposition later.
These terms are: the log-likelihood ratio between the true model and our predic-
tive model, the log-likelihood of the true model, and the prior term. The first
term is small when our predictive model is a good approximation to the true
distribution. We can see that the middle term is a finite sample approximation
to the conditional entropy H(Y |X) as follows,
1
N
H(Y |X) = − p(x, y) log p(y|x) ≈ − log p(y i |xi ). (4.5)
x∈X,y∈Y
N i=1
This represents the total amount of uncertainty there is about the class label
given the data. The conditional entropy is large when the features we have do
not constrain the class label well, i.e. it is hard to accurately predict the label
from the features. The conditional entropy is the log-likelihood of the true model
when taking the limit of data points, and thus provides a lower bound on our
performance, as our model cannot perform better than the data allows.
We are concerned with separating out the influence of feature selection and
i |xi ,θ)
classification in our model, and thus introduce an extra ratio p(y p(y i |xi ,θ)
into the
first term. This is the probability of the correct class given the features we have
selected with θ, and thus represents how useful a set of features we have selected.
78 CHAPTER 4. DERIVING A SELECTION CRITERION
We then bring the minus sign inside the brackets, and flip the ratios inside the
logarithms, which gives
N N
1 p(y i |xi , θ) p(y i |xi )
− = log + log
N i=1
q(y i |xi , θ, τ ) i=1 p(y i |xi , θ)
N
− log p(y |x ) − log p(θ, τ ) .
i i
(4.7)
i=1
As before the last two terms are the log likelihood of the true model and the
prior term. We have now separated out the first log likelihood ratio into two
terms. The first term is the log ratio between our predictive model and the true
distribution of the labels given our selected subset of features. This represents
how well our model fits the data given the current set of features. When the ratio
is 1, the log ratio is 0 and our model has the best possible fit given the features
selected. The second term is the log ratio between the true distribution given
the selected features, and the true distribution of the labels given all the data.
This measures the quality of the selected feature set θ, based on how close the
conditional distribution of y is to the one conditioned on all the data. We can see
that this term is a finite sample approximation to the expected KL-Divergence
between p(y|x) and p(y|x, θ) as follows
p(y|x) 1
N
p(y i |xi )
Ex,y {p(y|x)||p(y|x, θ)} = p(x, y) log ≈ log .
x∈X,y∈Y
p(y|x, θ) N i=1 p(y i |xi , θ)
(4.8)
This KL-Divergence has appeared before in the feature selection literature. Koller
and Sahami [63] introduce this divergence as a sensible objective for a feature se-
lection algorithm to minimise. With our expansion we show it to be a direct
consequence of optimising a discriminative model likelihood, though their ap-
proach ignores the prior over the features. As x = {xθ , x¬θ }, we can further
4.1. MINIMISING A LOSS FUNCTION 79
This is the conditional mutual information between the class label and the re-
maining features, given the selected features. We can now write the negative
log-likelihood as the sum of information theoretic quantities plus the prior over
{θ, τ },
p(y i |xi , θ) 1
− ≈ Ex,y log + I(X¬θ ; Y |Xθ ) + H(Y |X) − log p(θ, τ ). (4.10)
q(y |x , θ, τ )
i i N
Assuming for the moment that we have the optimal feature set or a superset
thereof (i.e. X ∗ ⊆ Xθ ) then p(y|x, θ) = p(y|x). Then as the expectation in the
first term is over p(y, x), the first term can be seen as a finite sample approxima-
tion to the expected KL-Divergence over p(x) representing how well the predictive
model fits the true distribution, given a superset of the optimal feature set.
The first term is a measure of the difference between the predictive model q,
and the true distribution p given the selected features. When a superset of the
optimal feature set has been found, it becomes the KL-Divergence between p and
q. The second term is I(X¬θ ; Y |Xθ ), the conditional mutual information between
the class label and the unselected features, given the selected features. The size of
this term depends solely on the choice of features, and will decrease as the selected
feature set Xθ explains more about Y , until eventually becoming zero when the
remaining features X¬θ contain no additional information about Y in the context
of Xθ . Note that due to the chain rule, I(AB; Y ) = I(A; Y ) + I(B; Y |A), and
X = Xθ ∪ X¬θ ,
I(X; Y ) = I(Xθ ; Y ) + I(X¬θ ; Y |Xθ ). (4.11)
I(Xθ ; Y ), which is the goal of many mutual information based filter algorithms.
The third term in Eq (4.10) is H(Y |X), the conditional entropy of the labels
given all features; this is an irreducible constant, dependent only on the dataset
D. As mentioned previously this term represents the quality of the data, and
how predictive X is of Y . When this value is small the data constrains the choice
of Y , and thus it has a low Bayes rate, whereas when this value is large the data
does not constrain the choice of Y , and the Bayes rate is higher.
This expansion makes explicit the effect of the feature selection parameters
θ, separating them from the effect of the parameters τ in the model that uses
those features. If we somehow had the optimal feature subset θ ∗ , which perfectly
captured the underlying process p, then I(X¬θ ; Y |Xθ ) would be zero. The re-
maining (reducible) error is then down to the KL divergence p||q, expressing how
well the predictive model q can make use of the provided features. Of course,
different models q will have different predictive ability: a good feature subset will
not necessarily be put to good use if the model is too simple to express the under-
lying function. This perspective was also considered by Tsamardinos and Aliferis
[107], and earlier by Kohavi and John [62] — the above results place these in
the context of a precise objective function, the joint likelihood of a discriminative
model.
We now make an assumption made implicitly by all filter methods, that model
fitting can be separated from the feature selection process. For completeness we
formalise this assumption as:
This optimisation problem relies upon the prior to ensure there is a unique global
optimum, as with a flat prior any superset of an optimal feature set is also optimal.
In the next section we will tighten our definition of the feature selection task to
remove this problem.
We note that in contrast to filter algorithms, wrapper and embedded ap-
proaches optimise jointly over the first two terms in Equation (4.10), by fitting
the classification model together with the feature selection.
We now turn to the problem of how to optimise the parameter θ, i.e. how do
we choose a feature. We show how the commonly used simple greedy searches are
iterative maximisers of the likelihood, and under what conditions such a search
returns the optimal solution.
Under the filter assumption in Definition 8, Equation (4.12) specifies the optimal
feature set, in the sense of maximising the joint likelihood. However there may of
course be multiple global optima giving multiple optimal feature sets, in addition
to the trivial optima of selecting all features. In fact, due to the nature of the
mutual information, any feature set which contains the optimal feature set will
be a global optima of the likelihood. As we wish to perform feature selection,
we express a preference for the smallest such feature set which maximises the
likelihood.
With this in mind, we can introduce a minimality constraint on the size of
82 CHAPTER 4. DERIVING A SELECTION CRITERION
This is the smallest feature set Xθ , such that the mutual information I(X¬θ ; Y |Xθ )
plus the prior is minimal, and thus the joint likelihood is maximal. It should be
remembered that the likelihood is only our proxy for classification error, and the
minimal feature set in terms of classification could be smaller than that which
optimises likelihood.
A common heuristic approach is a sequential search considering features one-
by-one for addition/removal; this is used for example in Markov Blanket learning
algorithms such as IAMB [108] (we will return to the IAMB algorithm in Chap-
ter 6). We will now demonstrate that this sequential search heuristic is in fact
equivalent to a greedy iterative optimisation of Equation (4.13). First we derive
the appropriate update rules for a iterative optimisation of Equation (4.13) and
then in Section 5.1 we show how different assumptions coupled with these greedy
update rules generates many of the different criteria in the literature.
We first introduce extra notation, θ t and θ t+1 , denoting the selected feature
set at timesteps t and t+1. We use J t to denote our objective function (Equation
4.12) at the timestep t. We define a simple greedy sequential search, so only one
feature is added/removed at each timestep, so there is exactly one bit different
between θ t and θ t+1 . The flipped bit we denote as θk .
We define a greedy forward step as the selection of the feature, Xk , which
most improves our selected feature set θ t . Therefore:
We now derive the update for a forward search which optimises the likelihood.
1
Jt+1 = I(X¬θt+1 ; Y |Xθt+1 ) − log p(θ t+1 ) (4.16)
N
We wish to add the feature Xk that minimizes Jt+1 , and thus maximizes the
difference Jt − Jt+1 ,
1 1
Jt − Jt+1 = I(X¬θt ; Y |Xθt ) − log p(θ t ) − I(X¬θt+1 ; Y |Xθt+1 ) + log p(θ t+1 ).
N N
(4.17)
A subtle (but important) implementation point for this selection heuristic is that
it should not add another feature if
1 p(θ t+1 )
∀Xk , I(Xk ; Y |Xθ ) + log ≤ 0. (4.19)
N p(θ t )
This ensures we will not unnecessarily increase the size of the feature set.
We define a greedy backwards step along similar lines. We remove the feature
84 CHAPTER 4. DERIVING A SELECTION CRITERION
We now derive the update for a backward search which optimises the likelihood.
1
Jt+1 = I(X¬θt+1 ; Y |Xθt+1 ) − log p(θ t+1 ) (4.22)
N
We wish to remove the feature Xk that maximises Jt+1 , and thus minimises the
difference Jt − Jt+1 ,
Xk∗ = arg min Jt − Jt+1
Xk ∈Xθt
1 p(θ t+1 )
= arg min I(Xk ; Y |Xθt \ Xk ) + log .
Xk ∈Xθt N p(θ t )
To strictly achieve our optimization goal, a backward step should only remove a
4.2. CHAPTER SUMMARY 85
feature if
1 p(θ t+1 )
I(Xk ; Y |{Xθt \Xk }) + log ≤ 0. (4.24)
N p(θ t )
With an uninformative prior p(θ) ∝ 1, the prior ratio in both of the updates
cancels, and we recover the maximum likelihood estimate of the optimal feature
set, with the forward iterative minimization update becoming
The prior ratio similarly disappears from the backward iterative update, resulting
in
Xk∗ = arg min I(Xk ; Y |Xθt \ Xk ). (4.26)
Xk ∈Xθt
These two updates look very familiar in the context of the different criteria we
reviewed in the previous chapter. The next chapter explores the links between
optimising the model likelihood and the information theoretic feature selection
literature, showing many common techniques to be approximate maximisers of
the model likelihood. We return to the full model including priors over the feature
selection parameters in Chapter 6.
or removing features. We now take the insights from our probabilistic model
and apply them to the literature surrounding information theoretic feature selec-
tion, linking our iterative maximisers of the likelihood to feature selection criteria
found in the literature.
Chapter 5
87
88 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
that model by assessing features based on a simple scoring criterion on the utility
of including a feature Xk ∈ X¬θ . We now consider how we can relate various
criteria which have appeared in the literature to our framework. None of these
criteria include the prior term, so we assume a flat prior with p(θ) ∝ 1, which
gives the maximum likelihood updates in Equations (4.25) and (4.26). We will
consider the use of priors with these criteria in next Chapter. From the previous
chapter we note that the ML scoring criterion for a feature Xk is,
where cmi stands for conditional mutual information, and for notational brevity
we now use S = Xθ for the currently selected set. We wish to investigate how
Equation (5.1) relates to existing heuristics in the literature, such as the MIFS cri-
terion we discussed in Chapter 3? Repeating the definition of the MIFS criterion
for clarity,
Jmif s (Xk ) = I(Xk ; Y ) − β I(Xk ; Xj ). (5.2)
Xj ∈S
We can see that we first need to rearrange Equation (5.1) into the form of a simple
relevancy term between Xk and Y , plus some additional terms, before we can
compare it to MIFS. Using the identity I(A; B|C)−I(A; B) = I(A; C|B)−I(A; C)
(a variant of the chain rule of mutual information), we can re-express Equation
(5.1) as,
features can be useful, provided the correlation within classes is stronger than the
overall correlation. We note that this is a similar observation to that of Guyon
[50], that “correlation does not imply redundancy” — Equation (5.3) embodies
this statement in information theoretic terms.
The sum of the last two terms in Equation (5.3) represents the three-way
interaction between the existing feature set S, the target Y , and the candidate
feature Xk being considered for inclusion in S. This is known as the Interac-
tion Information [78], I({X, Y, Z}), which measures dependencies within a set of
variables. To further understand this, we can note the following property:
We see that if I(Xk ; S) > I(Xk ; S|Y ), then the total utility when including Xk ,
that is I(Xk S; Y ), is less than the sum of the individual relevancies I(S; Y ) +
I(Xk ; Y ). This can be interpreted as Xk having unnecessary duplicated informa-
tion, which means the total information is less than the sum of the parts, hence we
call this a negative interaction. In the opposite case, when I(Xk ; S) < I(Xk ; S|Y ),
then Xk and S combine well and provide more information together than by the
90 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
Figure 5.1: Figure 5.1a shows the scatter plot between X1 and X2 . Figure 5.1b
shows the scatter plot between X1 and X2 when Y = 1. Figure 5.1c shows the
scatter plot between X1 and X2 when Y = 2.
sum of their parts, I(S; Y ), and I(Xk ; Y ), hence we call this a positive interaction.
The important point to take away from this expression is that the terms are
in a trade-off — we do not require a feature with low redundancy for its own
sake, but instead require a feature that best trades off the three terms so as to
maximise the score overall. Much like the bias-variance dilemma [35], attempting
to decrease one term is likely to increase another.
We will investigate what assumptions about the underlying distribution p(x, y)
are sufficient to derive MIFS (Equation 5.2) from the optimal maximum likelihood
criterion (Equation 5.1). We begin by writing the latter two terms of Equation
(5.3) as entropies:
This states that the selected features Xθ are independent and class-conditionally
independent given the unselected feature Xk under consideration.
Jcmi (Xk ) = I(Xk ; Y )
− H(S) + H(Xj |Xk )
j∈S
+ H(S|Y ) − H(Xj |Xk Y ). (5.9)
j∈S
Jcmi (Xk ) = I(Xk ; Y )
− I(Xj ; Xk ) + H(Xj ) − H(S)
j∈S j∈S
+ I(Xj ; Xk |Y ) + H(Xj |Y ) − H(S|Y ). (5.10)
j∈S j∈S
Several of the terms in (5.10) are constant with respect to Xk — therefore re-
moving them will have no effect on the choice of feature as our goal is to find the
highest scoring feature not the score itself. Removing these terms, we have an
equivalent criterion,
Jcmi (Xk ) = I(Xk ; Y ) − I(Xj ; Xk ) + I(Xj ; Xk |Y ). (5.11)
j∈S j∈S
92 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
This has in fact already appeared in the literature as a filter criterion, orig-
inally proposed by Lin et al. [74], as Conditional Infomax Feature Extraction
(CIFE), though it has been repeatedly rediscovered by other authors [36, 48]. It
is particularly interesting as it represents a sort of ‘root’ criterion, from which
several others can be derived. For example, the link to MIFS can be seen with
one further assumption, that the features are pairwise class-conditionally inde-
pendent.
With this assumption, the term I(Xj ; Xk |Y ) will be zero, and Equation (5.11)
becomes Equation (5.2), the MIFS criterion, with β = 1. The β parameter in
MIFS can be interpreted as encoding a strength of belief in another assumption,
that of unconditional independence.
A β close to zero implies very strong belief in the independence statement, indi-
cating that any measured association I(Xj ; Xk ) is in fact spurious, possibly due
to noise in the data. A β value closer to 1 implies a lesser belief, that any mea-
sured dependency I(Xj ; Xk ) should be incorporated into the feature score exactly
as observed. Since the MIM criterion is produced by setting β = 0, we can see
that MIM also adopts Assumption 3. The same line of reasoning can be applied
to a very similar (and very popular) criterion proposed by Peng et al. [86], the
Minimum-Redundancy Maximum-Relevance criterion,
1
Jmrmr (Xk ) = I(Xk ; Y ) − I(Xk ; Xj ). (5.14)
|S| j∈S
5.1. RETROFITTING SUCCESSFUL HEURISTICS 93
Since mRMR omits the conditional redundancy term entirely, it is implicitly using
Assumption 2. The β coefficient has been set inversely proportional to the size of
the current feature set. If we have a large set S, then β will be extremely small.
The interpretation is then that as the set S grows, mRMR adopts a stronger
belief in Assumption 3. In the original paper, [86, section 2.3], it is claimed that
mRMR is a first order approximation to Equation (5.1). By making explicit the
intrinsic assumptions of the criterion we have clearly illustrated that this claim
is incorrect as it does not include a conditional redundancy term, and in fact the
mRMR criterion does not allow for any positive interactions between features.
Using some relatively simple manipulations (see Brown [12]) this can be re-written
as,
1
Jjmi (Xk ) = I(Xk ; Y ) − I(Xk ; Xj ) − I(Xk ; Xj |Y ) .
|S| j∈S
The criterion (5.16) returns exactly the same set of features as the JMI criterion
(5.15); however in this form, we can see the relation to our proposed framework.
The JMI criterion, like mRMR, has a stronger belief in the pairwise independence
assumptions as the feature set S grows. Similarities can of course be observed
between JMI, MIFS and mRMR — the differences being the scaling factor and
the conditional term — and their subsequent relation to Equation (5.11). It
is in fact possible to identify numerous criteria from the literature that can all
be re-written into a common form, corresponding to different assumptions made
upon Equation (5.11). A space of potential criteria can be imagined, where we
parametrise criterion (5.11) as:
Jcmi = I(Xk ; Y ) − β I(Xj ; Xk ) + γ I(Xj ; Xk |Y ). (5.16)
j∈S j∈S
Figure 5.2 shows how the criteria we have discussed so far can all be fitted
inside this unit square corresponding to β/γ parameters. MIFS sits on the left
94 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
hand axis of the square — with γ = 0 and β ∈ [0, 1]. The MIM criterion, Equa-
tion (3.1), which simply assesses each feature individually without any regard of
others, sits at the bottom left, with γ = 0, β = 0. The top right of the square
corresponds to γ = 1, β = 1, which is the CIFE criterion [74], also suggested by
El Akadi et al. [36] and Guo and Nixon [48]. A very similar criterion, using an
assumption to approximate the terms, was proposed by Cheng et al. [20].
The JMI and mRMR criteria are different from CIFE and MIFS in that they
move linearly within the space as the feature set S grows. As the size of the
set S increases they move closer towards the origin and the MIM criterion. The
particularly interesting point about this property is that the relative magnitude
of the relevancy term to the redundancy terms stays approximately constant as
S grows, whereas with MIFS, the redundancy term will in general be |S| times
bigger than the relevancy term. We explore the practical consequences of this in
Section 5.2 where we see it plays an important role in explaining the experimental
results. Any criterion expressible in the unit square has made independence
Assumption 1. In addition, any criteria that sit at points other than β = 1, γ = 1
have adopted varying degrees of belief in Assumptions 2 and 3.
A further interesting point about this square is simply that it is sparsely popu-
lated, an obvious unexplored region is the bottom right, the corner corresponding
to β = 0, γ = 1; though there is no clear intuitive justification for this point, for
completeness in the experimental section we will evaluate it, as the conditional
redundancy or ‘condred’ criterion.
A paper by Brown [12] explores this unit square, though from a different
perspective gained by expanding the multivariate mutual information. In fact
much of this thesis arises from extensive consideration of the results in that
paper, though the approach presented here is quite different. Our probabilistic
perspective, deriving our selection criteria from the joint likelihood of a specific
model allows the precise specification of the underlying assumptions required to
produce different criteria from Equation (5.16). Further (unpublished) work by
Brown [13] has shown that the space of criteria with fixed β and γ is not in general
competitive with the criteria which move through the space, such as mRMR or
JMI.
All the criteria in Figure 5.2 are linear, as they take linear combinations of
the relevance/redundancy terms. One interesting perspective is to look at the
criteria not as points but as paths through this space, mRMR is a line along the
5.1. RETROFITTING SUCCESSFUL HEURISTICS 95
0.8
0.6
mrmr |S|=3 jmi |S|=3
0.2
mim
0
0 0.2 0.4 0.6 0.8 1
γ
Figure 5.2: The full space of linear filter criteria, describing several examples from
Table 3.1. Note that all criteria in this space adopt Assumption 1. Additionally,
the γ and β axes represent the criteria belief in Assumptions 2 and 3, respectively.
The left hand axis is where the mRMR and MIFS algorithms sit. The bottom left
corner, MIM, is the assumption of completely independent features, using just
marginal mutual information. Note that some criteria are equivalent at particular
sizes of the current feature set |S|.
β axis, and JMI is a line along β = γ. Once we have understood this property
it is natural to ask can there be other paths through this space. Indeed there
exist other criteria which follow a similar form to these linear criteria, except
they include other operations like minimizations, making them take a non-linear
path.
Fleuret [40] proposed the Conditional Mutual Information Maximization cri-
terion,
Jcmim (Xk ) = min I(Xk ; Y |Xj ) . (5.17)
Xj ∈S
Again these manipulations are found in Brown [12]. Due to the max operator, the
probabilistic interpretation is less straightforward. It is clear however that CMIM
adopts Assumption 1, since it evaluates only pairwise feature statistics. Vidal-
Naquet and Ullman [109] propose another criterion used in Computer Vision,
96 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
The authors motivate this criterion by noting that it measures the gain of com-
bining a new feature Xk with each existing feature Xj , over simply using Xj by
itself. The Xj with the least ‘gain’ from being paired with Xk is taken as the score
for Xk . Interestingly, using the chain rule I(Xk Xj ; Y ) = I(Xj ; Y ) + I(Xk ; Y |Xj ),
therefore IF is equivalent to CMIM, i.e. Jif (Xk ) = Jcmim (Xk ), making the same
assumptions. In a similar vein, Jakulin [58] proposed the ICAP criterion,
Jicap (Xk ) = I(Xk ; Y ) − max 0, {I(Xk ; Xj ) − I(Xk ; Xj |Y )} . (5.20)
Xj ∈S
Again, this adopts Assumption 1, using the same redundancy and conditional
redundancy terms, yet the exact probabilistic interpretation is unclear. This
criteria is designed to penalise criteria which interact negatively, but does not
select those which interact positively. It is designed for use with classifiers which
make similar assumptions of class conditional independence, such as the Naı̈ve
Bayes classifier.
An interesting class of criteria use a normalisation term on the mutual infor-
mation to offset the inherent bias toward high arity features [34]. An example
of this is Double Input Symmetrical Relevance [79], a modification of the JMI
criterion:
I(Xk Xj ; Y )
Jdisr (Xk ) = . (5.21)
X ∈S
H(Xk Xj Y )
j
The inclusion of this normalisation term breaks the strong theoretical link to a
likelihood function, but again for completeness we will include this in our empiri-
cal investigations. While the criteria in the unit square can have their probabilis-
tic assumptions made explicit, the non-linearity in the CMIM, ICAP and DISR
criteria make such an interpretation far more difficult.
upper bound each mutual information term by various different entropies (all
Shannon mutual information terms are lower bound by zero). In particular we
can bound J ∗ as follows,
I(Xk ; Y ) ≤ H(Xk )
I(Xk ; Xj ) ≤ |S| · H(Xk )
Xj ∈S
I(Xk ; Xj |Y ) ≤ |S| · H(Xk |Y ) ≤ |S| · H(Xk ). (5.24)
Xj ∈S
However the entire CIFE criterion is not so well behaved, as it is neither bounded
above by H(Xk ) nor bounded below by 0. This is because the redundancy and
conditional redundancy terms are bound by a function of |S|, and so grow in
size compared to the relevancy term. This issue is also apparent in the MIFS
and ICAP criteria, and we will see how it explains many of the differences in
experimental performance.
If we consider the JMI criterion, we can see that the averaging of the redun-
dancy and conditional redundancy terms removes this problem, stopping those
terms from growing as a function of |S|. However the JMI criterion poses another
problem, as it has multiple variants. We presented two versions of this criterion
in Equations (5.15) and (5.16) and stated that they rank features identically.
However we might expect that these different versions would interact in different
ways with the prior term which exists in the derivation. We will return to this
issue when we address informative priors in Chapter 6.
— the criteria are approximations to Equation (5.1), each making different as-
sumptions on the underlying distributions. Since in the previous section we saw
that accepting the top ranked feature according to Equation (5.1) provides the
maximum possible increase in the likelihood, we see now that the criteria are
approximate maximisers of the likelihood under an uninformative prior over the
choice of features. Whether or not they indeed provide the maximum increase
at each step will depend on how well the implicit assumptions on the data dis-
tribution match the true distribution. Also, it should be remembered that even
if we used the optimal stepwise criterion, it is not guaranteed to find the global
optimum of the likelihood, since (a) it is a greedy search, and (b) finite data will
mean distributions cannot be accurately modelled. There are a subset of prob-
lems where this greedy search is optimal, namely in Markov Blanket discovery
algorithms, which we return to in Chapter 6. In this case, we have reached the
limit of what a theoretical analysis can tell us about these criteria, and we must
close the remaining ‘gaps’ in our understanding with an experimental study.
The set of features selected by any procedure will of course depend on the data
provided. It is a plausible complaint if the set of returned features varies wildly
with only slight variations in the supplied data. In general we do not wish the
feature set to have high variance, i.e. small changes in the data should have con-
sequently small changes in the selected feature set. This is an issue reminiscent
of the bias-variance dilemma, where the sensitivity of a classifier to its initial
100 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
Table 5.1: Datasets used in experiments. The final column indicates the difficulty
of the data in feature selection, a smaller value indicating a more challenging
problem.
conditions causes high variance responses. However, while the bias-variance de-
composition is well-defined and understood, the corresponding issue for feature
selection, the “stability”, has only recently been studied. The stability of a fea-
ture selection criterion requires a measure to quantify the similarity between two
selected feature sets. This was first discussed by Kalousis et al. [59], who in-
vestigated several measures, with the final recommendation being the Tanimoto
distance between sets. Such set-intersection measures seem appropriate, but have
limitations; for example, if two criteria selected identical feature sets of size 10,
we might be less surprised if we knew the overall pool of features was of size 12,
than if it was size 12,000. Kuncheva [67] presents a consistency index which takes
this into account, based on the hypergeometric distribution with a correction for
the probability of selecting the same feature set at random.
Definition 9. The consistency for two subsets A, B ⊂ X, such that |A| = |B| =
k, and r = |A ∩ B|, where 0 < k < |X| = d, is
rd − k 2
C(A, B) = . (5.25)
k(d − k)
5.2. EXPERIMENTAL STUDY 101
The consistency takes values in the range [−1, +1], with a positive value in-
dicating similar sets, a zero value indicating a purely random relation, and a
negative value indicating a strong anti-correlation between the features sets.
One problem with the consistency index is that it does not take feature re-
dundancy into account. That is, two procedures could select features which have
different indices, so are identified as “different”, but in fact are so highly corre-
lated that they are effectively identical. A method to deal with this situation
was proposed by Yu et al. [111]. This method constructs a weighted complete
bipartite graph, where the two node sets correspond to two different feature sets,
and weights are assigned to the arcs are the normalized mutual information be-
tween the features at the nodes, also sometimes referred to as the symmetrical
uncertainty. The weight between node i in set A, and node j in set B, is
I(XA(i) ; XB(j) )
w(A(i), B(j)) = . (5.26)
H(XA(i) ) + H(XB(j) )
0.8
0.6
Stability
0.4
0.2
Figure 5.3: Kuncheva’s Stability Index [67] across 15 datasets. The box indi-
cates the upper/lower quartiles, the horizontal line within each shows the median
value, while the dotted crossbars indicate the maximum/minimum values. For
convenience of interpretation, criteria on the x-axis are ordered by their median
value.
probability distributions. It can be noted that the far right of Figure 5.3 consists
of the MIFS, ICAP and CIFE criteria, all of which do not attempt to average the
redundancy terms.
Figure 5.4 shows the same datasets, but instead the information stability
is computed; as mentioned, this should take into account the fact that some
features are highly correlated. Interestingly, the two box-plots show broadly
similar results. MIM is the most stable, and CIFE is the least stable, though here
we see that JMI, DISR, and mRMR are actually more stable than Kuncheva’s
stability index can reflect.
Two criteria can be directly compared with the same methodology: by measuring
the consistency and information consistency between selected feature subsets on
a common set of data. We calculate the mean consistencies between two feature
sets of size 10, repeatedly selected over 50 bootstraps from the original data. This
is then arranged in a similarity matrix, and we use classical multi-dimensional
scaling [26] to visualise this as a 2D map, shown in Figures 5.5a and 5.5b. Note
5.2. EXPERIMENTAL STUDY 103
0.5
0.45
Information Stability
0.4
0.35
0.3
0.25
0.2
0.15
mim jmi disr condred mrmr cmim mifs icap cife
Figure 5.4: Yu et al.’s Information Stability Index [111] across 15 datasets. For
comparison, criteria on the x-axis are ordered identically to Figure 5.3. A similar
general picture emerges to that using Kuncheva’s measure, though the informa-
tion stability index is able to take feature redundancy into account, showing that
some criteria are slightly more stable than expected.
again that while the indices may return different absolute values (one is a nor-
malized mean of a hypergeometric distribution and the other is a pairwise sum
of mutual information terms) they show very similar relative ‘distances’ between
criteria.
Both diagrams show a cluster of several criteria, and 4 clear outliers: MIFS,
CIFE, ICAP and CondRed. The 5 criteria clustering in the upper left of the
space appear to return relatively similar feature sets. The 4 outliers appear
to return quite significantly different feature sets, both from the clustered set,
and from each other. A common characteristic of these 4 outliers is that they
do not scale the redundancy or conditional redundancy information terms. In
these criteria, the upper bound on the redundancy term j∈S I(Xk ; Xj ) grows
linearly with the number of selected features, whilst the upper bound on the
relevancy term I(Xk ; Y ) remains constant. When this happens the relevancy term
is overwhelmed by the redundancy term and thus the criterion selects features
with minimal redundancy, rather than trading off between the two terms. This
leads to strongly divergent feature sets being selected, which is reflected in the
stability of the criteria. Each of the outliers are different from each other as they
have different combinations of redundancy and conditional redundancy. We will
104 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
mifs
mifs
mrmr mrmr
disr disr
cmim
cmim mim jmi
mim
jmi icap
icap
cife cife
condred
condred
Figure 5.5: Relations between feature sets generated by different criteria, on av-
erage over 15 datasets. 2D visualisation generated by classical multi-dimensional
scaling.
9
mim
disr
8
jmi
mrmr
7
mifs
condred
6 cife
cmim
5 icap
Rank
Figure 5.6: Average ranks of criteria in terms of test error, selecting 10 features,
across 11 datasets. Note the clear dominance of criteria which do not allow the
redundancy term to overwhelm the relevancy term (stars) over those that allow
redundancy to grow with the size of the feature set (circles).
Table 5.2: Datasets from Peng et al. [86], used in small-sample experiments.
consider whether the same property may hold in extreme small-sample situations,
when the number of examples is so low that reliable estimation of distributions
becomes extremely difficult. We use data sourced from Peng et al. [86], detailed
in Table 5.2.
Results are shown in Figure 5.7, selecting 50 features from each dataset and
plotting leave-one-out classification error. It should of course be remembered that
on such small datasets, making just one additional datapoint error can result in
seemingly large changes in accuracy. For example, the difference between the best
and worst criteria on Leukemia was just 3 datapoints. In contrast to the UCI
results, the picture is less clear. On Colon, the criteria all perform similarly; this
is the least complex of all the datasets, having the smallest number of classes with
a (relatively) small number of features. As we move through the datasets with
increasing numbers of features/classes, we see that MIFS, CondRed, CIFE and
ICAP start to break away, performing poorly compared to the others. Again, we
note that these do not attempt to balance relevancy/redundancy. This difference
is clearest on the NCI9 data, the most complex with 9 classes and 9712 features.
However, as we may expect with such high dimensional and challenging problems,
there are some exceptions — the Colon data as mentioned, and also the Lung
data where ICAP/MIFS perform well.
pengcolon pengleuk
45 15
mim
40 jmi
disr
35 condred
mrmr
30 10
mifs
icap
25 cife
20
15 5
10
0
0 0 10 20 30 40 50
0 10 20 30 40 50
Number of features selected
Number of features selected
penglymp
penglung 50
50
45
45
40
40
35
35
LOO number of mistakes
30
30
25
25
20 20
15 15
10 10
5 5
0 0
0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 50
Number of features selected Number of features selected
pengnci9
55
50
45
LOO number of mistakes
40
35
30
25
20
0 10 20 30 40 50
Number of features selected
Figure 5.7: LOO results on Peng’s datasets: Colon, Lymphoma, Leukemia, Lung,
NCI9.
108 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
was to take 50 bootstraps from the dataset, each time calculating the out-of-
bag error using the 3-NN. The stability measure was Kuncheva’s stability index
calculated from the 50 feature sets, and the accuracy was the mean out-of-bag
accuracy across the 50 bootstraps. The experiments were also repeated using the
Information Stability measure, revealing almost identical results. Results using
Kuncheva’s stability index are shown in Figure 5.8.
The Pareto-optimal set is defined as the set of criteria for which no other
criterion has both a higher accuracy and a higher stability, hence the members
of the Pareto-optimal set are said to be non-dominated [41]. Thus in each of the
graphs in Figure 5.8, criteria that appear further to the top-right of the space
dominate those toward the bottom left — in such a situation there is no reason
to choose those at the bottom left, since they are dominated on both objectives
by other criteria.
A summary (for both stability and information stability) is provided in the
first two columns of Table 5.3, showing the non-dominated rank of the different
criteria. This is computed per dataset as the number of other criteria which
dominate a given criterion, in the Pareto-optimal sense, then averaged over the
15 datasets. We can see that these rankings are similar to the results earlier, with
MIFS, ICAP, CIFE and CondRed performing poorly. We note that JMI, (which
both balances the relevancy and redundancy terms and includes the conditional
redundancy) outperforms all other criteria.
We present the average accuracy ranks across the 50 bootstraps in the third
column of Table 5.3. These are similar to the results from Figure 5.6 but use
a bootstrap of the full dataset, rather than a small sample from it. Following
Demšar [29] we analysed these ranks using a Friedman test to determine which
criteria are statistically significantly different from each other. We then used
a Nemenyi post-hoc test to determine which criteria differed, with statistical
significances at 90%, 95%, and 99% confidences. These give a partial ordering
for the criteria which we present in Figure 5.9, showing a Significant Dominance
Partial Order diagram. Note that this style of diagram encapsulates the same
information as a Critical Difference diagram [29], but allows us to display multiple
levels of statistical significance. A bold line connecting two criteria signifies a
difference at the 99% confidence level, a dashed line at the 95% level, and a
dotted line at the 90% level. Absence of a link signifies that we do not have the
statistical power to determine the difference one way or another. Reading Figure
5.2. EXPERIMENTAL STUDY 109
congress
breast heart
1
1 mim 0.9
mim
condred jmi
0.9
0.9 jmi 0.8
Kuncheva Stability condred
jmi
0.8
0.8 0.7
condred cife
mrmr
disr
0.7 icap mifs mim
disr
0.7 cmim mifs 0.6
mrmr 0.6 icap cmim
0.6 disr
mrmr
mifs 0.5
0.5 cife
icap
cife cmim
0.5
0.88 0.9 0.92 0.94 0.4 0.4
Mean Accuracy 0.8 0.85 0.9 0.95 1 0.76 0.77 0.78 0.79 0.8
ionosphere krvskp
1 1 landsat
disr 1
0.9
condred jmi
0.9 disr
condred mrmr 0.9 mim
0.8 mifs jmi
0.8 mim
icap
cmim mifs
disr 0.8 icap
cife
0.7 cife
0.7 0.7
0.6 mrmr
mim
jmi
mifs
cmim 0.6 cmim
0.5 0.6
icap
0.4 0.5 0.5
condred mrmr
cife
0.4 0.4
0.84 0.85 0.86 0.87 0.88 0.84 0.86 0.88 0.9 0.92 0.68 0.7 0.72 0.74 0.76
parkinsons semeion
lungcancer
0.45 1 1
jmi
0.9 0.9
0.4 jmi mim
0.8 0.8
cife disr
jmi
mim cmim
0.35 0.7 mim cife 0.7 mrmr
condred cmim
icap 0.6
0.6 condred
0.3 mifs
mrmr cmim
0.5 disr mrmr 0.5
icap disr icap
mifs
0.25
0.4 0.4
mifs condred cife
0.2
0.4 0.45 0.5 0.55 0.6 0.8 0.82 0.84 0.86 0.88 0.2 0.4 0.6 0.8
sonar
0.7 soybeansmall spect
0.9 0.4 mifs
disr
0.6 condred condred jmidisr
mim mrmr 0.35 condred
0.8
mim cmim mim
mrmr
0.5 mifs jmi
jmi cmim
0.7 0.3
0.4
disr
cmim cife
mrmr icap cife
cife 0.6 0.25
0.3 mifs
icap icap
0.2 0.5 0.2
0.65 0.7 0.75 0.8 0.8 0.85 0.9 0.95 1 0.66 0.68 0.7 0.72
splice waveform wine
1 1 condred 1
0.9 mim 0.9
0.9 mim
mrmr
disr
jmi
cmim
condred mrmr
cmim
disr
jmi mifs
0.8 0.8
0.8 0.7 0.7 cife mrmr
icap icap disr
0.6 0.6
0.7
jmi
0.5 0.5 mim
0.6 cife cife
0.4 icap 0.4 condred cmim
mifs
0.5 mifs
0.78 0.8 0.82 0.84 0.86 0.5 0.6 0.7 0.8 0.9 0.9 0.92 0.94 0.96
Figure 5.8: Stability (y-axes) versus Accuracy (x-axes) over 50 bootstraps for the
final quarter of the UCI datasets. The pareto-optimal rankings are summarised
in Table 5.3.
110 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
Table 5.3: Column 1: Non-dominated Rank of different criteria for the trade-off
of accuracy/stability. Criteria with a higher rank (closer to 1.0) provide a better
trade-off than those with a lower rank. Column 2: As column 1 but using
Kuncheva’s Stability Index. Column 3: Average ranks for accuracy alone.
5.9, we see that with 99% confidence JMI is significantly superior to CondRed,
and MIFS, but not statistically significantly different from the other criteria. As
we lower our confidence level, more differences appear, for example mRMR and
MIFS are only significantly different at the 90% confidence level.
Figure 5.9: Significant dominance partial-order diagram. Criteria are placed top
to bottom in the diagram by their rank taken from column 3 of Table 5.3. A link
joining two criteria means a statistically significant difference is observed with
a Nemenyi post-hoc test at the specified confidence level. For example JMI is
significantly superior to MIFS (β = 1) at the 99% confidence level. Note that the
absence of a link does not signify the lack of a statistically significant difference,
but that the Nemenyi test does not have sufficient power (in terms of number of
datasets) to determine the outcome [29]. It is interesting to note that the four
bottom ranked criteria correspond to the corners of the unit square in Figure
5.2; while the top three (JMI/mRMR/DISR) are all very similar, scaling the
redundancy terms by the size of the feature set. The middle ranks belong to
CMIM/ICAP, which are similar in that they use the min/max strategy instead
of a linear combination of terms.
112 CHAPTER 5. UNIFYING INFORMATION THEORETIC FILTERS
0.5
mim
0.45 mifs
cife
0.4 icap
cmim
0.35
cmi
Validation Error
0.3 mrmr
jmi
0.25 disr
0.2
0.15
0.1
0.05
0
0 50 100 150 200
Number of Features
cmi
0.3
mrmr
0.25 jmi
disr
0.2
0.15
0.1
0.05
0
0 5 10 15 20
Number of Features
criteria (note the poor performance of MIFS/mRMR). This is perhaps not sur-
prising given the nature of the MADELON data, constructed precisely to require
features to be evaluated jointly.
It is interesting to note that the challenge organisers benchmarked a 3-NN
using the optimal feature set, achieving a 10% test error [49]. Many of the
criteria managed to select feature sets which achieved a similar error rate using
a 3-NN, and it is likely that a more sophisticated classifier is required to further
improve performance.
Our experimental results have shown another difference between the criteria,
which was less apparent from our theoretical study. The criteria which imple-
mented scaling, or some other method of balancing the size of the relevancy and
redundancy terms have outperformed all others. The criteria which do not, like
CIFE and MIFS, perform poorly across many datasets. We now integrate our
theoretical view of feature relevancy and redundancy into that of Kohavi and
John’s notions of Strong and Weak Relevance from their landmark 1997 paper
[62].
of Kohavi and John (hereafter KJ) in terms of mutual information, and see how
they can fit into our likelihood maximisation framework. In the notation below,
notation Xi indicates the ith feature in the overall set X, and notation X\i
indicates the set {X\Xi }, all features except the ith. The definitions of strong
and weak relevance are taken from KJ’s paper, but the corollaries are novel
restatements in information theoretic terms. We also extend the notion of weak
relevance by separating it into two disjoint definitions.
Given the framework we have presented, we can note that this strong relevance
comes from a combination of three terms,
This view of strong relevance demonstrates explicitly that a feature may be indi-
vidually irrelevant (i.e. p(y|xi ) = p(y) and thus I(Xi ; Y ) = 0), but still strongly
relevant if I(Xi ; X\i |Y ) − I(Xi ; X\i ) > 0.
Proof. This follows immediately from the proof for the strong relevance above.
It could easily be the case that I(Xi ; Y ) = 0, that is a feature is entirely irrel-
evant when considered on its own — but the sum of the two redundancy terms
results in a positive value for I(Xi ; Y |S). We see that if a criterion does not at-
tempt to model both of the redundancy terms, even if only using low dimensional
approximations, it runs the risk of evaluating the relevance of Xi incorrectly.
N
L(D, θ, τ, λ) = p(θ, τ )p(λ) q(y i |xi , θ, τ )q(xi |λ). (6.1)
i=1
120
6.1. MAXIMISING THE JOINT LIKELIHOOD 121
We saw how to expand this likelihood into the sum of several terms,
p(y i |xi , θ) 1
− ≈ Ex,y log + I(X¬θ ; Y |Xθ ) + H(Y |X) − log p(θ, τ ). (6.2)
q(y i |xi , θ, τ ) N
Finally we saw that we can greedily maximise the likelihood by selecting features
one by one according to this optimal criterion,
1 p(θ t+1 )
Xk∗ = arg max I(Xk ; Y |Xθt ) + log . (6.4)
Xk ∈X¬θt N p(θ t )
In the previous chapter we took this framework and looked at how the process
of optimising Equation (6.4) under a flat prior relates to the literature. This
framework explains many common information theoretic feature selection criteria
by showing they derive from making different assumptions about the various
probability distributions involved.
is non-zero. We will see how this threshold value can be interpreted as a spar-
sity prior controlling the number of selected features. We begin by constructing
factorised priors to encode sparsity and domain knowledge.
As mentioned above we treat each feature independently, and assume each p(θi )
is a Bernoulli random variable. Therefore the prior over p(θ) is
d
d
p(θ) = p(θi ) = βiθi (1 − βi )1−θi . (6.5)
i i
eαwi 1
βi = = . (6.6)
1+e αw i 1 + e−αwi
d
θ i (1−θi )
eαwi eαwi
p(θ) = 1− . (6.7)
i
1 + eαwi 1 + eαwi
We can rewrite this prior into a more common form using the fact that θi is
6.2. CONSTRUCTING A PRIOR 123
binary, as follows,
d
eαwi θi eαwi (1−θi )
p(θ) = ( ) (1 − )
i
1 + eαwi 1 + eαwi
d
eαwi θi 1
= ( αw
) ( αw
)(1−θi )
i
1+e i 1+e i
d
1
= (eαwi )θi 1(1−θi )
i
1 + eαwi
d
eαwi θi
=
i
1 + eαwi
d αwi θi
e
= d i
αwi )
i (1 + e
d
e α i wi θi
= d
αwi )
i (1 + e
Tθ
eαw
= d (6.8)
αwi )
i (1 + e
d
As i (1 + e
αwi
) is constant with respect to θ this is equivalent to specifying p(θ)
as
p(θ) ∝ eαw θ .
T
(6.9)
As the prior terms in our greedy maximisation of the likelihood are ratios, then
we do not need the normalisation constant or any other constant factors for this
prior. We note that this formulation is of a similar exponential form to the priors
specified by Mukherjee and Speed [82], and we could extend our framework to
incorporate many of their graph structure priors.
When using this factored prior we can rewrite the update rules in Equations (4.14)
and (4.20). The ratio term simplifies as each update only includes or excludes
a single feature, and most of the terms in the prior are constant and cancel.
Therefore the prior ratio when selecting an additional feature is,
p(θ t+1 )
= eαwk (6.10)
p(θ t )
124 CHAPTER 6. PRIORS FOR FILTER FEATURE SELECTION
where wk denotes the weight of the candidate feature. The prior ratio when
removing a feature is
p(θ t+1 )
= e−αwk . (6.11)
p(θ t )
The full update rule when selecting a new feature (i.e. a forward step) is:
αwk
Xk∗ = arg max I(Xk ; Y |Xθt ) + (6.12)
Xk ∈X¬θt N
with a similar update for the backward step. These updates form a greedy max-
imisation of the joint likelihood given in Equation (6.1), where the α and wk
control the strength and class of knowledge respectively.
We use |θ| to represent the number of selected features in θ. This derives directly
from setting all the wi = −1 in Equation (6.9).
By allowing the wi values to range freely we can encode varying levels of
information into the prior, as these again change the success probability of the
Bernoulli, thus encoding how useful a priori we think a given feature is. A
positive wi denotes that the feature is useful and the domain knowledge suggests
it should be selected, and a negative wi denotes the feature has no value and
should not be selected. When wi = 0 we have no extra information to include
6.3. INCORPORATING A PRIOR INTO IAMB 125
about that particular feature and thus give it an equal probability of selection
and remaining unselected. We will denote such knowledge priors with pd (θ) and
αd leading to an knowledge prior where
pd (θ) ∝ eαd w θ .
T
(6.14)
We have now described two kinds of priors which we can integrate into any crite-
rion derived from our discriminative model assumption. To combine both spar-
sity and domain knowledge into the same prior we will define p(θ) ∝ ps (θ)pd (θ).
When using our greedy updates the normalisation constants again disappear as
the prior is only considered in a ratio. If we use this prior then the sparsity and
domain knowledge terms separate out in the forward update as follows
1 ps (θ t+1 ) 1 pd (θ t+1 )
Xk∗ = arg max I(Xk ; Y |Xθ ) + log + log . (6.15)
Xk ∈X¬θt N ps (θ t ) N pd (θ t )
We now turn to the issue of integrating these priors into a feature selection
algorithm. We choose to look at the IAMB algorithm [107], and show how it
maximises the joint likelihood with a sparsity prior.
We can then see that the threshold is a special case of the sparsity prior in Eq
(6.13) with αs = N , where the strength of the prior is dependent on the number
6.3. INCORPORATING A PRIOR INTO IAMB 127
Theorem 3. Tsamardinos and Aliferis [107] proved that IAMB returns the true
Markov Blanket under the condition of a perfect independence test f (X; Y |CM B).
Given this condition is satisfied, then IAMB is an iterative maximization of the
discriminative model in Equation (6.1), under a specific sparsity prior.
We can now extend IAMB by introducing informative priors into the Markov
Blanket discovery process. First we define p(θ) ∝ ps (θ)pd (θ) where ps (θ) is
the sparsity prior (or threshold), and pd (θ) is our knowledge prior specified in
Equation (6.14). We can ignore the normalisation constant as we only consider
the ratio of the prior terms. We then use
1 ps (θ t+1 ) 1 pd (θ t+1 )
I(Xk ; Y |Xθ ) + log + log >0 (6.18)
N ps (θ t ) N pd (θ t )
as the independence test having expanded out the prior p(θ). Incorporating
pd (θ) into IAMB lowers the “threshold” for features we believe are in the Markov
Blanket and increases it for those we believe are not. We call this modified version
IAMB-IP (IAMB-Informative Prior).
In some cases the knowledge prior, pd , may be larger than the sparsity prior,
ps , causing the algorithm to unconditionally include feature Xk without reference
to the data. We wish to blend the domain knowledge into the statistical evidence
from the data, and so a prior which is strong enough to include features without
reference to the data is undesirable. We therefore recommend a bound on the
strength of the domain knowledge prior, by fixing αd ≤ αs . This bounds the
domain knowledge prior from above and below to ensure it is not strong enough
to blindly include a feature without some evidence from the data.
128 CHAPTER 6. PRIORS FOR FILTER FEATURE SELECTION
Figure 6.1: Toy problem, 5 feature nodes (X1 . . . X5 ) and their estimated mutual
information with the target node Y on a particular data sample. X1 , X2 , X5 form
the Markov Blanket of Y .
more accurately reflect the state of domain knowledge. In all experiments we only
consider nodes with a Markov Blanket containing two or more features and we
assess performance using the F-Measure (harmonic mean of precision & recall),
comparing against the ground truth.
In Figure 6.1 we show a toy problem to illustrate the different effects prior
knowledge can have on the Markov Blanket discovery process. Features X1 , X2 , X5
are in the Markov Blanket of Y and features X3 and X4 are not. IAMB (with the
default threshold) would select only X1 as the MB, based upon the estimated mu-
tual informations given. The performance of IAMB-IP will depend upon what
knowledge is put into the prior. If we upweight X1 it is a true positive, as it
actually lies in the MB, similarly if we downweight X3 it is a true negative. If
we upweight X4 it is a false positive, as it does not lie in the MB of Y , and
similarly downweighting X2 is a false negative as it does lie in the MB of Y . If
we upweighted only X1 IAMB-IP would perform similarly to IAMB, as X1 has a
strong measured association with Y , however upweighting X2 would include that
variable and then X5 , as X2 only has a weak measured association with Y and so
the prior will increase it. If X4 is upweighted, (introducing a false positive into
the prior) then it is unlikely to be included, as it has no measured association
with Y , however X3 would be included if it was upweighted. If we downweight
X2 , (introducing a false negative) we can see this would remove both X2 and X5 ,
as X5 only becomes relevant when X2 is included. We can see that false nega-
tives in the prior are more problematic for IAMB-IP, as they can cause multiple
variables to be incorrectly removed from the candidate MB.
130 CHAPTER 6. PRIORS FOR FILTER FEATURE SELECTION
We use the protocol in Algorithm 4 to test the relative performance for two
groups of sample sizes: 10 to 100 samples in steps of 10 (small sample), and 200 to
1000 samples in steps of 100 (large sample). For the large sample we perform 10
trials over independent data samples, and for the smaller sizes we expect a greater
variance and thus use 30 trials. The wins/draws/losses were assessed using a 95%
confidence interval over the IAMB-IP results, compared to the IAMB result. The
variance in IAMB-IP is due to the random selection of features which are included
in the prior, which was repeated 40 times. We set αd = log 99, except when this
was above the bound αd ≤ αs where we set αd = αs . This is equivalent to setting
individual priors p(θi = 1) = 0.99 for upweighted features and p(θi = 1) = 0.01
for downweighted features. We set αs so t = 0.02 for both IAMB and IAMB-IP.
We average these wins/draws/losses over all valid features in a dataset, where a
valid feature is one with a Markov Blanket containing two or more features.
We first investigate the performance of IAMB-IP when using a correct prior.
We tested priors that included 2 true positives, and 1 true positive and 1 true
negative. The average results over the 4 datasets are in the first two columns of
Figure 6.2. We can see that when incorporating correct priors IAMB-IP performs
better than IAMB or equivalently to it in the vast majority of cases. The draws
between IAMB and IAMB-IP are due to the overlap between the statistical infor-
mation in the data and the information in the prior. When the prior upweights a
feature with a strong signal from the data, then the behavior of IAMB-IP is the
6.4. EMPIRICAL EVALUATION 131
1
Losses
Draws
Wins
0.8
0.6
Fraction
0.4
0.2
0
Small, Correct Large, Correct Small, Flawed Large, Flawed
Experiment
Figure 6.2: Average results: (a) Small sample, correct prior; (b) Large sample,
correct prior; (c) Small sample, misspecified prior; (d) Large sample, misspecified
prior.
same as IAMB. It is when the prior upweights a feature with a weak signal that
the behavior of the two algorithms diverges, and similarly for features that are
downweighted.
We now investigate the more interesting case of misspecified priors, where the
prior contains some incorrect information. We tested priors using 1 true positive
& 1 false negative, and 1 true positive & 1 false positive. These are presented in
the last two columns of Figure 6.2. We can see that IAMB-IP performs equiva-
lently or better than IAMB in four-fifths of the repeats, on average. We present
results for Alarm in Table 6.2, for Barley in Table 6.3, for Hailfinder in Table 6.4
and for Insurance in Table 6.5. We can see that the algorithm is more sensitive
to false negatives than false positives especially when there are small amounts
of data, as the prior knowledge is more important in those situations, hence any
flaws impact performance more. This is because false negatives may remove chil-
dren from the MB, which in turn means no spouse nodes (the other parents of
the common child node) will be included, which magnifies the effect of the false
information.
In summary we can see that the addition of informative priors into IAMB
to give IAMB-IP improves performance in many cases, even when half the prior
132 CHAPTER 6. PRIORS FOR FILTER FEATURE SELECTION
We have focused on adding true positives to the prior, and how they interact
with the false information. In our datasets true positives are rarer than true neg-
atives and thus more important, because the Markov Blankets are much smaller
than the number of features. Therefore when we construct the prior at random,
we are more likely to select true positives where the prior information is useful
(i.e. there is not enough statistical information in the data to include the true
positive) as there are fewer true positives to select from. When including true
negatives the prior only improves performance if the true negative appears to be
statistically dependent on the target (and then it is penalised by the prior and
not included), if it does not appear dependent, then the prior information has no
effect on its inclusion. Therefore when only including true negatives IAMB-IP
performs similarly to IAMB.
6.4. EMPIRICAL EVALUATION 133
In the previous three chapters we looked at how feature selection using infor-
mation theory maximises the joint likelihood of a discriminative model. This
allows us to combine information theoretic measurements from the data, with
prior knowledge from domain experts and therefore select features which best
combines these two sources of information. In this chapter we apply the machin-
ery from Chapter 4 to a different loss function, namely the weighted conditional
likelihood [31]. In the binary classification problem this loss function is a bound
on the empirical risk of the dataset, which measures how costly misclassification
is for each example. We thus derive a cost-sensitive filter feature selection cri-
teria, which coincides with Guiaşu’s definition of weighted information theory
[46]. We prove several novel results with respect to this variant of information
theory, to allow its use as a feature selection criteria. We present experimental
results showing that the cost-sensitive selection criteria can be combined with a
cost-insensitive classifier to create an overall cost-sensitive system.
135
136 CHAPTER 7. COST-SENSITIVE FEATURE SELECTION
the negative weighted log likelihood forms a tight, convex upper bound on the
empirical loss, which is the expected conditional risk across a dataset. This
property is used to argue that maximising the weighted likelihood is the preferred
approach in the case where the classifier cannot perfectly fit the true model.
Unfortunately the weighted likelihood has only been shown to bound the
risk in binary cost-sensitive problems. In multi-class cost-sensitive problems the
likelihood does not accurately represent the misclassification probabilities of the
other classes, as the sum of these values is equal to 1 − p(y i |xi ) and they may
be unevenly weighted by the per example weight vector. Any weighted feature
selection process is likely to work best in the multi-class case as then it is more
probable that each label will have a different set of predictive features. Therefore
we avoid this problem by adjusting the weight of each datapoint in the likelihood.
We propose that if we use the sum of the per example weight vector,
ws (y i , xi ) = w(y i , xi , y) (7.1)
y∈Y,y=y i
we will ensure the weighted likelihood still forms an upper bound on the em-
pirical risk. This takes a pessimistic view of the classification problem, as a
mis-prediction which has a high weight may have a very low probability, but the
weighted likelihood will still be small. We note that this forms a looser upper
bound on the empirical risk than in the binary case, but reduces to the same
likelihood in binary problems. Cost-sensitive feature selection is an interesting
case of the general cost-sensitive problem as when using a filter technique it is
difficult to generate predicted labels with which to calculate the misclassifica-
tion error. This generalisation of the weighted likelihood provides an objective
function which allows the construction of cost-sensitive feature selection.
7.1.1 Notation
We use similar notation to Chapter 4 summarised here for clarity, with exten-
sions to include misclassification costs. We assume an underlying i.i.d. process
p : X → Y , from which we have a sample of N observations. Each observation is
a tuple (x, y, w), consisting of a d-dimensional feature vector x = [x1 , ..., xd ]T , a
target class y, and an associated non-negative |Y | − 1 dimensional weight vector,
w, for that observation, with x and y drawn from the underlying random vari-
ables X = {X1 , ..., Xd } and Y . Furthermore, we assume that p(y|x) is defined
7.1. DERIVING A COST SENSITIVE FILTER METHOD 137
In this section we take the weighted likelihood from Dmochowski et al. [31] and
decompose it into a sum of terms, where each term relates to a different part
of the classification process. This follows a similar process to the derivation in
Chapter 4. If we further make the filter assumption (Definition 8) we derive a
weighted feature selection criterion, which uses the weighted mutual information
(see Section 7.2) to score features. As mentioned previously when working in
the multi-class case, we use the sum of the per-example weight vector, wi as the
weight w(y i ) in the likelihood.
We approximate the true distribution p with our model q, with separate pa-
rameters for the feature selection, θ, and for classification, τ . We define the
138 CHAPTER 7. COST-SENSITIVE FEATURE SELECTION
N
q(y i |xi , θ, τ )w(y ) .
i
Lw (D, θ, τ ) = (7.2)
i=1
We choose to work with the scaled negative log-likelihood, −w , converting our
maximisation problem into a minimisation problem. This gives
1
N
−w = − w(y i ) log q(y i |xi , θ, τ ) (7.3)
N i=1
which is the function we will minimise with respect to {θ, τ } (this is the initial
position of Dmochowski et al. as they only consider the log-likelihood). We first
introduce the ratio p(y|x)
p(y|x)
into the logarithm, this is the probability of the correct
class given all the features. We can then expand the logarithm into two terms,
q(y i |xi , θ, τ )
N N
1
− = − i
w(y ) log + w(y ) log p(y |x ) .
i i i
(7.4)
N i=1
p(y i |xi ) i=1
The first term is the weighted log-likelihood ratio between the true model and our
predictive model, and the second term is the weighted log-likelihood of the true
model. This latter term is a finite sample approximation to the weighted condi-
tional entropy [46]. This represents both the amount of uncertainty in the data,
and how costly that uncertainty is. It forms a bound on the maximum amount
of performance we can extract from our dataset, in terms of the conditional risk.
We now wish to separate out feature selection from the classification process.
We do this by introducing another ratio into the first logarithm, namely p(y|x,θ)
p(y|x,θ)
.
This is the probability of the correct class given the features selected by θ. We
can then further expand the first logarithm as follows,
q(y i |xi , θ, τ )
N N
1 p(y i |xi , θ)
−w = − w(y ) log i
+ w(y i
) log
N i=1
p(y i |xi , θ) i=1
p(y i |xi )
N
+ w(y ) log p(y |x ) .
i i i
(7.5)
i=1
weighted by a per example cost. As our weights are functions of the class label y,
the per-example weight w(y i ) and the per-state weight w(y) are identical. We can
then treat each of the summations above as approximations to the expectation
over each term, replacing the per-example weights with per-state weights. It is
this step which removes some of the power of our weighted likelihood as compared
to the weighted likelihood used in Dmochowski et al. , though we shall see it is
necessary to ensure the information theoretic values remain non-negative.
Again by similar logic to Chapter 4, we can interpret the second term in
Equation (7.5) as a finite sample approximation to the weighted conditional mu-
tual information Iw (X¬θ ; Y |Xθ ),
p(x¬θ , y|xθ )
Iw (X¬θ ; Y |Xθ ) = w(y)p(x, y) log
x,y∈X,Y
p(x¬θ |xθ )p(y|xθ )
1
N
p(y i |xi , θ)
≈− w(y i ) log . (7.6)
N i=1 p(y i |xi )
and the third term as a finite sample approximation to the weighted conditional
entropy Hw (Y |X),
1
N
Hw (Y |X) = − w(y)p(x, y) log p(y|x) ≈ − w(y i ) log p(y i |xi ).
x,y∈X,Y
N i=1
We can now write −w as the sum of weighted information theoretic quantities,
p(y i |xi , θ)
−w ≈ Ex,y w(y) log + Iw (X¬θ ; Y |Xθ ) + Hw (Y |X). (7.7)
q(y i |xi , θ, τ )
We note that the optimal feature set θ ∗ is the set which makes Iw (X¬θ ; Y |Xθ ) =
0. This can only occur when p(x¬θ , y|xθ ) = p(x¬θ |xθ )p(y|xθ ), as the addition of
weights does not change position of the minima of the mutual information.
If we have discovered the optimal feature set or a superset thereof (i.e. X ∗ ⊆
Xθ ) then p(y|x, θ) = p(y|x). The expectation in the first term can also then be
seen as a finite sample approximation to a weighted KL-Divergence (also known
as the weighted relative entropy [102]). This divergence represents how well the
predictive model fits the true distribution, given a superset of the optimal feature
set and how important each prediction is.
We have now decomposed the weighted conditional likelihood w into three
140 CHAPTER 7. COST-SENSITIVE FEATURE SELECTION
We now derive iterative update rules which greedily maximise our formulation of
the weighted likelihood. These derivations are similar to the derivations given in
Chapter 4 but use the weighted mutual information. The literature surrounding
the weighted mutual information has no proof for non-negativity or an equivalent
definition for the chain rule of mutual information, which are necessary for the
derivation of iterative update rules. We provide such proofs in the next section,
and in this section assume the existence of such proofs.
By using the chain rule, we can separate out the most informative feature
from X¬θ and add it to Xθ . We add a superscript t to θ to denote which time
step the feature sets are from, and proceed to derive the forward update that
maximises the weighted likelihood. The proof takes a similar form to the proof
of Theorem 1 from Chapter 4.
7.2. WEIGHTED INFORMATION THEORY 141
Proof. The feature which maximises the weighted likelihood is the feature which
minimises Iw (X¬θt+1 ; Y |Xθt+1 ) from Iw (X¬θt ; Y |Xθt ), where Xθt+1 = Xθt ∪ Xj
and X¬θt+1 = X¬θt \ Xj for some Xj ∈ X¬θt . We can express this as,
Then we see that the feature Xj which minimises Iw (X¬θt+1 ; Y |Xθt+1 ) is,
A similar update can be found for the backwards step, analogous to Theorem 2
from Chapter 4. Both these updates are similar to the updates derived in Chapter
4 (which might be expected due to the machinery involved in generating them)
though they use the weighted mutual information developed by Guiaşu, rather
than Shannon’s formulation. In Section 7.3 we proceed to create a weighted
variant of the JMI filter criterion [110], by substituting the weighted mutual
information for the Shannon mutual information, and supplying appropriate cost
vectors. First we must prove the non-negativity property and the chain rule used
in the above theorem.
The weighted mutual information has appeared several times in the literature
[46, 75, 96] but there has been little investigation of the theoretical properties
of such a measure. As mentioned previously this literature lacks two important
properties necessary for feature selection using the kind of iterative updates used
throughout this thesis. We now review those important properties of the weighted
mutual information, namely, non-negativity and the chain rule.
142 CHAPTER 7. COST-SENSITIVE FEATURE SELECTION
Note that this weight depends on the value of both x and y; in this case, the
information can be negative. Table 7.1 presents an example distribution for X and
Y which gives a negative weighted mutual information; in this case wI(X; Y ) =
−0.0214.
Therefore, knowledge of the variable X reduces what we know about Y 1 . We
can avoid this problem with our measure if we define the weights so that they
are dependent upon a single variable i.e. ∀x, y w(x, y) = w(y). This still gives
valid weights under the conditions of weighted information theory, i.e. p(y)w(y) =
x∈X w(x, y)p(x, y), which is the necessary condition for the definition of a unique
mutual information [46]. We restricted the weights in our weighted likelihood such
that they only depend upon y, therefore this limitation does not affect our earlier
derivation. We therefore define our weighted mutual information as a function
1
An alternative explanation is that knowledge of X may cause us to make more costly
predictions for Y .
7.2. WEIGHTED INFORMATION THEORY 143
As all the weights w(y) are non-negative, p(y) is non-negative, and the KL-
Divergence is non-negative, then our new weighted mutual information, Iw , is
non-negative.
Proof. Due to Theorem 5 we only define the chain rule such that Y is never
conditioned upon, and we define ∀x, z, y w(x, z, y) = w(y). We thus define the
chain rule as follows,
p(x, y, z)
Iw (XZ; Y ) = w(y)p(x, y, z) log
x∈X y∈Y z∈Z
p(x, z)p(y)
p(x, y)p(z|y, x)
= w(y)p(x, y, z) log
x∈X y∈Y z∈Z
p(z|x)p(x)p(y)
p(x, y)
= w(y)p(x, y) log
x∈X y∈Y
p(x)p(y)
p(z|y, x)
+ w(y)p(x, y, z) log
x∈X y∈Y z∈Z
p(z|x)
= Iw (X; Y )
p(z, y|x)
+ w(y)p(x, y, z) log
x∈X y∈Y z∈Z
p(z|x)p(y|x)
We note from our work in the previous section that we can only construct weighted
mutual informations which include the variable Y which our weights are defined
over. This limits the possible feature selection criteria, but fortunately many of
the common ones can be re-expressed in the right form. We focus on adapting
the JMI criterion [110], as this is the best performing criterion from our earlier
empirical study (see Chapter 5). We also use a weighted variant of MIM, which
ranks features according to their univariate weighted mutual information. We
expect that the other criteria we investigated in that chapter could be adapted to
work in this weighted likelihood framework, and the weighted mutual information,
provided they can be expressed using mutual informations between Y and X.
Unfortunately this restriction excludes popular criteria such as mRMR, as they
cannot be written in the necessary form.
We then combine these criteria with a greedy forward search, to produce a fea-
ture selection algorithm. As mentioned previously the use of weights does not
change the minima of the likelihood with respect to θ, however when using an
approximate optimiser such as wJMI the feature sets returned are different. This
is due to the pairwise independence assumption made in the JMI criterion [14],
which stops the criterion from determining when it has reached the optima. We
shall see this effect in the empirical study, as the feature sets selected by JMI and
wJMI (and by MIM and wMIM) diverge.
One important point to note is that as we only consider the rankings of indi-
vidual features the magnitude of the weight vector is irrelevant. If we normalise
the weight vector w so it sums to 1, this will return exactly the same feature set
(using wJMI or wMIM) as if we multiplied each weight by a positive value.
146 CHAPTER 7. COST-SENSITIVE FEATURE SELECTION
In Figures 7.2 and 7.3 we present results from this dataset where the costly
class is the digit “4”. The average 4 in our dataset is presented in Figure 7.1.
Figure 7.2a shows the different algorithms tested with w(y =“4”) = 5, so the
digit “4” is five times more important than the other labels, which were given a
weight of one. Each group of bars represents a different approach to the cost-
sensitive problem: the first group is the standard cost-insensitive approach, where
the JMI algorithm is used to select 50 features, and then an SVM is trained on
those features. The second group is our approach, where we include the weight
in the feature selection process with wJMI and use a standard SVM to fit the
data. The third group uses JMI with a weighted SVM, and finally the fourth
group oversample the costly class according to the weight. We can see that the
introduction of weights does not alter the overall classification performance (the
leftmost bar in each group is approximately the same), but the precision, recall
and F-measure change when the weighted approaches are used. Our method
improves both precision and recall over the baseline, whereas using a weighted
classifier trades off precision for recall (i.e. it predicts many more false positives).
Figure 7.3a shows how the F-Measure on class 4 changes for the three weighted
methods using weights w(y = 4) = {1, 5, 10, 15, 20, 25, 50, 100}. In contrast to
wJMI the weighted classifier degrades performance as it predicts many false pos-
itives. Figure 7.3b shows a precision/recall plot of the various approaches, with
the cost-insensitive approach appearing as a filled black circle. The size of the
marker represents the weight, from the range given above. We can see how the
weighted feature selection improves both precision and recall for all settings of
the weights whereas oversampling only improves recall, and the weighted classifier
improves recall at the cost of precision.
Finally in Figure 7.2b we show the different features selected by the standard
JMI algorithm, and the wJMI algorithm with w(y = 4) = 100. The light gray
features are those which were selected by JMI and not by wJMI, the dark gray
features are those selected by both algorithms, and the black features are those
selected by wJMI alone. This shows how the wJMI algorithm selects features
which are predictive of the presence or absence of a 4. The large black block
at the top of the image represents the area which differentiates a 4 from a 9 in
148 CHAPTER 7. COST-SENSITIVE FEATURE SELECTION
Figure 7.1: The average pixel values across all 4s in our sample of MNIST.
(a) Accuracy is across the whole dataset, (b) Difference in selected features be-
with 95% Confidence across 10-fold CV, us- tween JMI and wJMI, w(y = 4) = 100
ing (w)JMI.
the dataset, and the dark area in the bottom left differentiates a 4 from a 2, 5,
6, or 8 as those digits require pixels in that area. The other black features are
to positively detect a 4. We can see that this weighted approach selects features
which are predictive of the presence of a 4 and discriminate against false positives.
We present the average improvement in precision and recall across all digits
in our MNIST dataset in Table 7.2, where each digit in turn was given a weight
w(y) = 10. We can see that the weighted feature selection improves both precision
and recall, resulting in an improved F-measure, whereas the other approaches
have high recall, but low precision (i.e. predict many false positives), resulting
in little improvement in the F-measure. We also present the improvement in
precision and recall for the digit 4 in Table 7.3.
7.4. EMPIRICAL STUDY OF WEIGHTED FEATURE SELECTION 149
(a) F-Measure with 95% Confidence, across (b) Precision/Recall plot, where the
10-fold CV. marker size indicates the weight w.
Figure 7.3: MNIST Results, with “4” as the costly digit, using w(y = 4) =
{1, 5, 10, 15, 20, 25, 50, 100} and (w)JMI. LEFT: Note that as costs for mis-
classifying “4” increase, the weighted FS method increases F-measure, while the
weighted SVM suffers a decrease. RIGHT: The black dot is the cost-insensitive
methodology. Note that the weighted SVM can increase recall above the 90%
mark, but it does so by sacrificing precision. In contrast, the weighted FS method
pushes the cluster of points up and to the right, increasing both recall and pre-
cision.
Table 7.2: MNIST results, averaged across all digits. Each value is the difference
(x 100) in Precision, Recall or F-Measure, against the cost-insensitive baseline.
Table 7.3: MNIST results, digit 4. Each value is the difference (x 100) in Preci-
sion, Recall or F-Measure, against the cost-insensitive baseline.
(a) F-Measure on the digit “4” with 95% (b) F-Measure on the digit “9” with 95%
Confidence, across 10-fold CV. Confidence, across 10-fold CV.
Figure 7.4: MNIST Results, with both “4” and “9” as the costly digits, using
w(y = (4 ∨ 9)) = {1, 5, 10, 15, 20, 25, 50, 100} and (w)JMI, F-Measure.
(a) Precision/Recall plot for the digit “4”, (b) Precision/Recall plot for the digit “9”,
the marker size indicates the weight w. the marker size indicates the weight w.
Figure 7.5: MNIST Results, with both “4” and “9” as the costly digits, using
w(y = (4 ∨ 9)) = {1, 5, 10, 15, 20, 25, 50, 100} and (w)JMI, Precision/Recall plot.
Figure 7.6: MNIST Results, using wMIM comparing against SpreadFX, with “4”
as the costly digit.
Table 7.4: MNIST results, averaged across all digits. Each value is the difference
(x 100) in Precision, Recall or F-Measure, against the cost-insensitive baseline.
We chose 5 document classification datasets from Han and Karypis [52] (the
datasets are listed in Table 7.5). We tested these datasets using the same proce-
dure as the MNIST datasets, selecting 50 features with either our wJMI criterion,
or the standard JMI criterion and classifying using either a normal or weighted
SVM. We ran each experiment using 10-fold cross validation where each class
was upweighted in turn with w(y) = 10, and present the Wins/Draws/Losses
at the 95% confidence interval in Table 7.6. The first three columns are com-
paring against the cost-insensitive method, and we see that the weighted feature
selection improves the F-measure in some cases, but the weighted SVM and over-
sampling approaches degrade performance in many cases. When comparing the
weighted feature selection against the weighted classifier or oversampling (the last
two columns) we see that the weighted feature selection improves upon weighted
classification in many cases, particularly with the ohscal dataset, which was the
largest dataset we tested, with 11162 examples and 11465 features.
7.5. CHAPTER SUMMARY 153
Table 7.6: Document classification results: F-Measure W/D/L across all labels,
with the costly label given w(y) = 10.
Figure 7.7: Ohscal results. Cost of mis-predicting class 9 is set to ten times
more than other classes. The weighted SVM and oversampling approaches clearly
focus on producing high recall, far higher than our method, however, this is only
achievable by sacrificing precision. Our approach improves precision and recall,
giving higher F-measure overall.
Chapter 8
155
156 CHAPTER 8. CONCLUSIONS AND FUTURE DIRECTIONS
p(y i |xi , θ) 1
− ≈ Ex,y log + I(X¬θ ; Y |Xθ ) + H(Y |X) − log p(θ, τ ).
q(y i |xi , θ, τ ) N
The first term denotes the quality of our predictor compared to the optimal
predictions for the selected feature set, this takes the form of a KL-Divergence,
and is zero when our predictor makes the optimal predictions. The second term
denotes the quality of our selected feature set compared to the full feature set,
this takes the form of a conditional mutual information and is zero when our
selected feature set contains all the relevant information. The third term denotes
the quality of the data itself, and the final term is the prior probability of the
model parameters. We can therefore think of Equation 4.10 as expanding the
likelihood thus,
where we wish to minimise the first three terms, and maximise the fourth. In
general we can do little to adjust the quality of our training data, and we should
not adjust our priors based upon the data, therefore the first two terms are the
quantities we can minimise. Making the filter assumption from Definition 8 allows
us to first select a feature set which maximises the likelihood, and then to build
a classifier which maximises the likelihood based on that feature set.
One further insight based upon this expansion is that filter methods which
choose to maximise this likelihood are in effect wrapper methods using a perfect
classification model. Even the filters themselves appear to be simple classifiers, as
they construct a probability distribution and measure its predictive performance
with a KL-Divergence. This perspective shows that when choosing to maximise
the joint likelihood filters and wrappers differ only in the method by which they
search the feature space, and the assumptions made by the classification model.
Using a classification model with few assumptions and maximising the model
likelihood can be seen as a filter maximising the mutual information of the feature
set.
8.1. WHAT DID WE LEARN IN THIS THESIS? 157
Returning to the specific problem of filter feature selection we can derive the
optimal selection criterion. If we choose to iteratively maximise the likelihood by
selecting (or removing) features greedily then we should select the feature which
maximises (minimises) the conditional mutual information between that feature
and the class label, conditioned upon the selected feature set plus the prior term
for the current feature set. We note that in the iterative updates the prior term
is a ratio therefore if we use a flat uninformative prior, the prior term cancels.
This leads to a maximum likelihood update rule based solely upon the conditional
mutual information. With this derivation and the iterative update rules we are
ready to answer the next question, by linking our derived updates to the selection
criteria from the literature.
We looked at this question in Chapter 5, considering the criteria from the lit-
erature as approximations to our optimal criterion derived in Chapter 4. We
considered the link to our optimal criterion assuming there was a flat prior over
the possible feature sets, as the criteria we investigated did not include prior
knowledge. This link makes explicit the objective function underlying each of the
published criteria, namely the joint likelihood of a discriminative model.
As the optimal criterion is intractable to estimate, each of the published cri-
teria make implicit assumptions which reduce the complexity of the estimation
problem. These assumptions make the criteria approximate iterative maximisers
of the joint likelihood. The main theoretical difference between the criteria is
whether they assume the features are class-conditionally independent. Without
this assumption there is an additional class-conditional term in the criteria. One
other important theoretical point is whether they provide a mechanism to bal-
ance the relative magnitude of the redundancy terms against the relevancy term.
To ascertain how these differences impact the criteria in practice, we conducted
an empirical study of 9 different heuristic mutual information criteria across 22
datasets. We analysed how the criteria behave in large/small sample situations,
how the stability of returned feature sets varies between criteria, and how sim-
ilar criteria are in the feature sets they return. In particular, we looked at the
following empirical questions:
158 CHAPTER 8. CONCLUSIONS AND FUTURE DIRECTIONS
Summarising the performance of the criteria under the above conditions, includ-
ing the class-conditional term is not always necessary. Various criteria, for exam-
ple mRMR, are successful without this term. However, without this term criteria
are blind to certain classes of problems, e.g. the MADELON dataset (see Section
5.3), and will perform poorly in these cases. Balancing the relevancy and redun-
dancy terms is however extremely important — criteria like MIFS, or CIFE, that
allow redundancy to swamp relevancy, are ranked lowest for accuracy in almost
all experiments. In addition, this imbalance makes the criteria unstable, causing
them to become sensitive to small changes in the supplied data.
Several criteria return wildly different feature sets with just small changes in the
data, while others return similar sets each time, hence are ‘stable’ procedures. As
we might expect the most stable was the univariate mutual information as it has
the simplest distributions to estimate, followed closely by JMI [110, 80]; while
among the least stable are MIFS [6] and ICAP [58].
We might ask how the theoretical differences between the criteria impact the se-
lected feature sets, and so using the stability measures we compared the feature
sets returned by each of the criteria with each other criterion. If we visualise the
differences using multi-dimensional scaling (see Figure 5.5) we can see a cluster
of criteria which “balance” the relevancy and redundancy terms (or ignore redun-
dancy completely in the case of MIM), and each criterion which does not balance
the terms is very different from all other criteria. This explains why the empirical
performance of many of the criteria is so similar.
8.1. WHAT DID WE LEARN IN THIS THESIS? 159
Having explained the behaviour and performance of the criteria in the litera-
ture we looked at the other benefits provided by our probabilistic framework for
feature selection.
many of which do not have links to precise objective functions. It may be in-
teresting to investigate the links between a criterion such as the Gini Index [34],
and derive the objective function implied by the choice of such a criterion. It is
interesting to note that the log-likelihood is an instance of a proper scoring func-
tion [44], and other such scoring functions may also have links to feature selection
criteria. One obvious extension would be to develop feature selection from a like-
lihood which combines both priors over θ, and differing misclassification costs,
to unify the methods proposed in Chapters 6 and 7.
The third area is based upon the literature in information theoretic feature
selection which we examined in Chapter 5. There we saw how the majority of
published criteria make specific assumptions about the underlying distributions,
and how all the criteria made the assumption that there are only pairwise in-
teractions between features. Relaxing this assumption leads to more complex
information theoretic terms, which consequently have higher data requirements
to ensure they accurately estimate the mutual information. An interesting topic
of study would be to combine this insight with work on entropy estimation [83] or
analysis of the variance of the mutual information [56], to determine for any given
dataset how many of these complex terms it is possible to estimate accurately,
then adjusting the feature selection criterion accordingly. This would allow an
adaptive feature selection criterion which adjusts to the amount of available data,
behaving like Jcmi when there are many thousands or millions of datapoints, and
scaling back through JMI towards MIM as the number of datapoints shrinks.
The fourth area is more theoretical in nature, and relates to the weighted
mutual information used in Chapter 7. In that chapter we defined the weighted
mutual information so that the weights are a function of a single variable, rather
than both variables in the mutual information. We proved how this ensures the
non-negativity of the measure, which is important for feature selection to ensure
we are maximising the likelihood. However the original weighted conditional
likelihood of Dmochowski et al. [31] does not include this constraint, and the
weights are allowed to depend upon both x and y. In this case it is possible for
the weighted mutual information to take negative values, and a very interesting
subject is the nature of such negative values. This gives rise to some interesting
questions: “What does a negative Iw imply about the nature of the relationship
between X and Y ?”, and “What does it imply about the cost-sensitive problem
that we wish to optimise?”.
162 CHAPTER 8. CONCLUSIONS AND FUTURE DIRECTIONS
The final area involves extending the techniques presented in this thesis to new
problems. The material on priors in Chapter 6 considered very simple priors, and
more performance may be available by utilising more complex and informative
priors. The material on cost-sensitivity is framed in the context of multi-class
problems as those are more likely to exhibit the property where each label has a
different group of predictive features. Exploring two class problems where this is
the case, and extending the framework to cope with multi-label datasets would
provide interesting applications of the weighted feature selection approach.
One topic which we chose not to consider in this thesis is the interaction
between the search method and the feature selection criterion. Preliminary work
suggests that for some of the criteria examined in Chapter 5 the choice of search
method is irrelevant to the feature set returned. Forward searches, backwards
searches and floating searches all returned the same (or very similar) feature sets,
across a number of datasets. Also using the sum of J(Xj ) over the whole selected
feature set as the objective function returned the same feature sets as the more
greedy search methods mentioned earlier. It would be interesting to do a more
extensive empirical study to see if this strange effect is borne out across many
datasets as it implies the choice of search method has little relevance compared
to the choice of selection criterion. This is a rather counter-intuitive result given
the performance gains made with other feature selection criteria when changing
the search method (e.g. Pudil et al. [91]), and suggests there may be a deeper
insight into why information theory does not need complex search techniques.
Bibliography
163
164 BIBLIOGRAPHY
[13] G. Brown. Some thoughts at the interface of ensemble methods and fea-
ture selection (invited talk). In 9th International Workshop on Multiple
Classifier Systems (MCS 2010), volume 5997, page 314. Springer, 2010.
[17] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27,
2011.
[21] B. Chidlovskii and L. Lecerf. Scalable feature selection for multi-class prob-
lems. In European Conference on Machine Learning and Knowledge Dis-
covery in Databases (ECML 2008), pages 227–240. Springer, 2008.
[23] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans-
actions on Information Theory, 13(1):21–27, 1967.
[35] R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons,
2001.
[40] F. Fleuret. Fast Binary Feature Selection with Conditional Mutual Infor-
mation. The Journal of Machine Learning Research (JMLR), 5:1531–1555,
2004.
BIBLIOGRAPHY 167
[43] G. Forman. A pitfall and solution in multi-class feature selection for text
classification. In Proceedings of the 21st International Conference on Ma-
chine Learning (ICML 2004). ACM, 2004.
[44] T. Gneiting and A. Raftery. Strictly proper scoring rules, prediction, and
estimation. Journal of the American Statistical Association, 102(477):359–
378, 2007.
[47] G. Gulgezen, Z. Cataltepe, and L. Yu. Stable and accurate feature selection.
Machine Learning and Knowledge Discovery in Databases, pages 455–468,
2009.
[48] B. Guo and M. S. Nixon. Gait Feature Subset Selection by Mutual Informa-
tion. IEEE Transactions on Systems, Man and Cybernetics, 39(1):36–46,
2009.
[49] I. Guyon. Design of experiments for the NIPS 2003 variable selection bench-
mark. http://www.nipsfsc.ecs.soton.ac.uk/papers/NIPS2003-Datasets.pdf,
2003.
[51] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer
classification using support vector machines. Machine learning, 46(1):389–
422, 2002.
168 BIBLIOGRAPHY
[55] M. Hellman and J. Raviv. Probability of error, equivocation, and the Cher-
noff bound. IEEE Transactions on Information Theory, 16(4):368–372,
1970.
[60] K. Kira and L. Rendell. The feature selection problem: Traditional methods
and a new algorithm. In Proceedings of the National Conference on Artificial
Intelligence, pages 129–129. John Wiley & Sons Ltd, 1992.
[62] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial
intelligence, 97(1-2):273–324, 1997.
[65] E. Krupka, A. Navot, and N. Tishby. Learning to select features using their
properties. The Journal of Machine Learning Research (JMLR), 9:2349–
2376, 2008.
[69] N. Kwak and C. H. Choi. Input Feature Selection for Classification Prob-
lems. IEEE Transactions on Neural Networks, 13(1):143–159, 2002.
[71] Y. LeCun and C. Cortes. The mnist database of handwritten digits, 1998.
[72] D. D. Lewis. Feature selection and feature extraction for text categorization.
In Proceedings of the workshop on Speech and Natural Language, pages 212–
217. Association for Computational Linguistics Morristown, NJ, USA, 1992.
[77] D. Margaritis and S. Thrun. Bayesian network induction via local neighbor-
hoods. In Advances in Neural Information Processing Systems, volume 12,
pages 505–511, 2000.
[88] A. Pocock, M. Luján, and G. Brown. Informative priors for markov blanket
discovery. In 15th International Conference on Artificial Intelligence and
Statistics, volume 22, pages 905–913, 2012.
[99] P. Somol, P. Pudil, and J. Kittler. Fast branch & bound algorithms for opti-
mal feature selection. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26(7):900–912, 2004.
[104] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society. Series B (Methodological), pages 267–288,
1996.
[110] H. Yang and J. Moody. Data Visualization and Feature Selection: New
Algorithms for Non-Gaussian Data. Advances in Neural Information Pro-
cessing Systems, 12, 1999.
[111] L. Yu, C. Ding, and S. Loscalzo. Stable feature selection via dense feature
groups. In Proceedings of the 14th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 803–811, 2008.