Rakotomamonjy SVR Paper
Rakotomamonjy SVR Paper
Abstract
This paper addresses the problem of variable ranking for support vector regression. The ranking criteria that we proposed are based on
leave-one-out bounds and some variants and for these criteria we have compared different search-space algorithms: recursive feature
elimination and scaling factor optimization based on gradient-descent. All these algorithms have been compared on toy problems and
real-world QSAR data sets. Results show that the radius-margin criterion is the most efficient criterion for ranking variables. Using this
criterion can then lead to support vector regressor with improved error rate while using fewer variables. Our results also support the
evidence that gradient-descent algorithm achieves a better variable ranking compared to backward algorithm.
r 2006 Published by Elsevier B.V.
this paper is to investigate the performance of different SVMs where f^ai ; agi¼1;...;‘ are the Lagrangian multipliers asso-
bound-based criteria for variable ranking in the context of ciated to the constraints given in Eq. (2) and their values
regression. In that sense, this work is essentially an extension of are the solution of the following dual problem:
the work described in [18] to the regression case.
X
‘ X
‘
The paper is organized as follows: Section 2 briefly max yi ð^ai ai Þ ð^ai þ ai Þ
describes the support vector regression algorithm and some ai ;^ai
i¼1 i¼1
leave-one-out error bounds of this algorithm. Section 3
1X ‘
1
introduces the variable ranking criteria that we used and the ð^ai ai Þð^aj aj Þ Kðxi ; xj Þ þ di;j
different algorithms that have been considered for selecting 2 i;j¼1 C
these variables. Numerical experiments on toy and real-
world problems are presented in Section 4 while concluding with
remarks and perspectives are described in Section 5. X
‘
ð^ai ai Þ ¼ 0,
2. SVM regression algorithm and bounds on leave-one-out i¼1
This section briefly reviews the SVM regression algo- with Kðxi ; xj Þ ¼ hFðxi Þ; Fðxj ÞiH and di;j being the Kroneck-
rithm that will be used throughout the paper and presents er symbol. This last problem can be efficiently solved by
some performance bounds of such algorithm. using quadratic programming technique associated with
some decomposition methods so that using SVR for very
large data sets becomes possible [8,16].
2.1. Support vector regression
Suppose one has a training data set fðxi ; yi Þgi¼1;...;‘ 2 2.2. Leave-one-out bounds for SVR
X R where xi are the training vectors and yi their target
values. The idea of SVM regression (SVR) is to look for a Support vector machines for classification are known to
function f ðxÞ ¼ hw; FðxÞiH þ b, with w and FðxÞ being be based on strong theoretical analysis of the general-
some vectors of a given reproducing kernel Hilbert space ization performance of learning algorithms [21]. According
H, that minimizes the following regularized risk [21,19]: to that, there were several works that have dealt with
bounds on leave-one-out error which is known to be an
1 X‘
R½f ¼ kwk2 þ C Lðyi ; xi ; f Þ, (1) almost unbiased estimator of the expected generalization
2 i¼1 error. For SVM in classification, the most frequently used
bounds are the so-called radius-margin bound [21] and the
where Lðy; x; f Þ is either the linear -insensitive loss span-estimate bound [22].
function defined by Recently, Chang and Lin have extended these results for
Lðy; x; f Þ ¼ jy f ðxÞj ¼ maxð0; jy f ðxÞj Þ SVR and have proposed the following bounds for
regression [6]:
or the quadratic -insensitive loss function given by Let F be the set of examples xi so that the related
Lagrangian multipliers ai and a^ i satisfy ai þ a^ i 40 then
Lðy; x; f Þ ¼ jy f ðxÞj2 .
Theorem 2.2.1. Under the hypothesis that F is non-empty,
For the sequel of the paper, we consider only SVR with then
quadratic -insensitive loss function and in this case, Eq. (1) P the leave-one-out error for SVR is bounded by
4R~ i2F ðai þ a^ i Þ þ ‘ where R~ is the radius of the smallest
is equivalent to sphere containing the points fFðx~ i Þgi¼1;...;‘ with
X‘ 2 3
1 2 Fðxi Þ
min kwk2 þ C ðx2i þ x^ i Þ (2)
^
w;b;xi ;xi 2 ~ i Þ 4 ei 5
Fðx
i¼1 pffiffiffiffi
C
with
and ei the ith vector of R‘ canonical basis. For the span-
f ðxi Þ yi p þ xi ; i ¼ 1; . . . ; ‘, estimate bound, they proposed the following theorem:
y f ðxi Þp þ x^ i ;
i i ¼ 1; . . . ; ‘.
Theorem 2.2.2. Under the hypothesis that F is non-empty,
The solution of this problem can be obtained by means of and that the set of support vectors remains the same during
Lagrangian theory [5] and it is easy to show that the leave-one-out procedure, then the leave-one-out error for
SVR is bounded by
X
‘
X
w¼ ð^ai ai ÞFðxi Þ, (3) ðai þ a^ i ÞS2i þ ‘,
i¼1 i2F
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1491
3. The problem of variable subset ranking The two criteria G R and GS are bounds on the leave-one-
out error so they fit in the algorithmic framework of Eq. (6)
This section introduces a framework for variable ranking since they are directly related to the performance of the
which is based on the work of Lal et al. [15] on wrapper SVR. The criterion Ga is not a bound but is a term involved
methods. It also details the algorithms we have proposed in both radius-margin and span estimate bounds and thus
for solving this SVR-based variable ranking problem. minimizing this term may also minimize those bounds.
Hence, we propose to use G a as a criterion for variable
ranking because it is cheaper to compute. In fact, compared
3.1. The big picture
to G R and G S , it does not need nor the solution of a
quadratic programming problem (for processing R) ~ neither
Consider that the objective of feature selection is primarily ~ 2
the inversion of a matrix (for processing Si ). Finally, the
to optimize the performance of a classifier by selecting a subset
criterion G w is the norm of the regression function in H.
of variables. Then, given a learning set S ¼ fðxi ; yi Þgi¼1‘
Using this criterion for variable subset ranking may be
2 X Y, drawn iid according to a distribution Pðx; yÞ, the
surprising, but we expect that for a fixed value of C in the
problem of feature selection is equivalent to look for a vector
primal SVR problem (2), the more the function f is
n 2 f0; 1gd and a function of H parametrized by a vector g
‘‘regular’’ according to a set of variables, the more it is
noted f g that minimizes the risk:
Z ‘‘stable’’ and thus, the more this set of variables is relevant
Rn ½f g ¼ Lðy; f g ðn xÞÞ dPðx; yÞ, (5) with regard to the target values. Hence, in that sense, G w
can be considered as an interesting criterion.
where the operator denotes the componentwise vector
product. However, this optimization problem cannot be 3.3. Search-space algorithms
solved since Pðx; yÞ is unknown. Hence, given the training
set S, it is common to minimize an empirical risk based on S. Once a criterion Gðf % ; g% ; n; SÞ has been defined, one needs
For a wrapper-based variable selection problem, this boils an algorithm for minimizing Eq. (6) with respect to n.
down to the following problem [15]: However if sðnÞ ¼ knk0 , this problem is NP hard [1], so one
8 has to rely on an approximation strategies. Furthermore, since
> sðnÞpn0 ; we are interested in variable ranking, we are rather interested
<
inf Gðf ; g ; n; SÞ s.t g ¼ TðH; n; SÞ; in a set of vectors fnk gk¼1;...;d so that for each k, nk is the vector
% %
%
(6)
n2f0;1gd >
: f % ðxÞ ¼ f % ðxÞ; that optimizes Eq. (6) with sðnÞ ¼ knk0 and n0 ¼ k.
g
There are several suboptimal approaches for looking to
where T is an induction algorithm, f % and g% are, respectively, the vector n that optimizes Eq. (6). In this work, we have
the regression function and its parameters returned by T, G is used two of these search-space algorithms:
a function that is related to the performance of f % , and sðÞ a
function that measures the sparsity of the vector n and n0 a A greedy iterative algorithm based on backward
user-defined parameter that denotes the desired sparsity in n. selection [9]. In this case, the variable ranking is
ARTICLE IN PRESS
1492 A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501
Initialization: RANKED ¼ ;; VAR ¼ ½1; . . . ; d Then, one needs to evaluate the gradient of the bounds
while VAR is not empty do with regard to a variable ni . Details of the gradient
for all Variables in VAR do computation for a given criterion can be found in [6],
Remove temporarily variable i in VAR
Compute all parameters needed for G ðiÞ Table 1
Compute ranking criterion G ðiÞ Notation of the different algorithms
end for Criterion Backward Gradient-descent
RANKVAR ¼ arg mini G ðiÞ
Gw Gw B G w GD
Rank variable: RANKED ¼ ½RANKVAR
Ga Ga B G a GD
RANKED GR GR B G R GD
Remove variable RANKVAR from the variable list GS GS B G S GD
VAR
end while Eight algorithms have been obtained by combining all criteria and the
search-space methods.
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1493
and they have been obtained using the results of [2,7]. Here, For a gradient-descent-based algorithm, the vector n has
we only give the final results: to be initialized. And since the minimization problem is
non-convex, it is very likely that the resulting variable
G w gradient: ranking depends on that initial value. One solution for that
qG w X qKðm xk ; m xj Þ problem would be to run the algorithm several times with
¼ ð^ak ak Þð^aj aj Þ . some different initializations and keep the solution which
qni qni
k;j minimizes the criterion but this would be time consuming.
Furthermore, starting with a scaling vector n equals to
½1; . . . ; 1 has a particular meaning since it gives equal
G a gradient:
weights to all variables. We have run some numerical
qG a X qð^aj þ aj Þ experiments that aim at illuminating the issues involved in
¼ ,
qni j2F
qni considering a fix or random initialization of the vector n.
3.6. Relations with other variable selection algorithms simple variable selection method based on correlation
coefficients. The performance measures that have been
In the previous paragraphs, we have described several used for comparison are the mean number of variables
novel criteria for variable ranking in the context of support correctly top ranked (if there are j relevant variables in a
vector regression. These criteria have been associated to problem, all j variables should be ranked in the top j) and a
two different search-space algorithms leading to eight normalized mean absolute error:
novel variable selection methods. P‘
Among all these methods, some of them can be related to jy^ p yt j
Pn i t i
nmaeðyp ; yt Þ ¼ i¼1
other variable selection algorithms proposed in the i¼1 jyi j
literature. For instance, Mukkamala et al. [17] have yti and ypi are, respectively, the true and predicted target
proposed an approach that is similar to our Algorithm 1. values of a given xi .
In fact, they rank variables by using one step of a backward
variable elimination algorithm and their criterion is mainly
based on the classifier accuracy measured on a validation 4.1. Experiment on linear toy data
set. This is a classical wrapper approach applied to SVM
[14]. Our algorithm differs in two points to theirs: (i) we For a sake of sanity checking, we have firstly tested our
retrain the learning machine after variable elimination and different algorithms on a simple linear regression problem.
performs the full backward elimination; (ii) we use a bound For this experiment, the data have been created according
on the learning machine performance as a criterion. Hence, to the following steps. The input data are composed of 55
we do not need to rely on a validation set. normally distributed variables and only the first three of
In the context of SVM classifier, sensitivity-based criteria them are related to the output by the relation:
have also been used for variable ranking. These methods
y ¼ xð1Þ þ 2xð2Þ xð3Þ .
have been originally introduced by Hermes and Buhmann
[13] and Guyon et al. [12]. Hermes et al. have suggested to The experiment that we carried out is then the following.
rank each variable according to the angle between the All the variable ranking algorithms have been run with
gradient of the decision function f ðxÞ and the unit vectors training set of increasing size and we have counted the
ej . It is easy to show (by calculating ðrx Kðxi ; xÞÞT ej ) that number of time the variables xð1Þ ; xð2Þ ; xð3Þ has been ranked
their ranking criterion is strongly related to a criterion in the top three variables. Naturally, the SVR variable
based on: ranking is based on a linear kernel and we have set the
hyperparameters to C ¼ 1000 and ¼ 0:01 and we have
qf ðn xÞ
ðiÞ
G ¼ (7) used them for both variable ranking algorithm and
qni regression function estimation. For obtaining these hyper-
evaluated at n ¼ ½1; . . . ; 1. Hence their idea can be parameters values, we have finely sampled a large range of
extended to the regression case and to other sensitivity hyperparameters value. And for each sample, we have
measures. For instance, we can use as a criterion: measured the average performance over 20 trials of an SVR
trained on 100 examples. In other words, these parameters
qG
G ¼ ,
ðiÞ
(8) correspond to the ‘‘best’’ parameters if all variables are
qni used. Since, this first experiment aims at checking the
where G can be one of the criterion described in Section correctness of the algorithms, we have used a single
3.2. The underlying idea of such algorithm is to rank initialization with n ¼ ½1; . . . ; 1 for the gradient-descent-
variables according to their (positive or negative) influences based search-space algorithms.
on the leave-one-out bounds. For each training set size, the results have been averaged
However, since this approach is strongly related to the over 100 trials with random draws of training set. Fig. 1
work of Hermes et al. and in order to limit the shows the results. The first interesting result is that all the
experimental results for a sake of clarity, we have not algorithms perform correctly and are able to distinguish
reported in this paper, results based on these criteria. relevant variables as soon as training set size is large
enough. Besides, as soon as the training set size increases,
all algorithms perform better than the variable ranking
4. Numerical results
method based on correlation coefficients. It seems that our
algorithms are more robust to noisy variables than the
This section reports the results of our empirical analysis
correlation coefficient which is theoretically well suited for
on the above presented variable selection algorithms. Here,
ranking variables in a linear problem.
we have compared the performance of methods to the
According to this table, it seems that for linear problem,
variable ranking achieved by Sparse SVMs [3]1 and a
gradient-descent-based methods for search space are the
1
Note that here we have directly used the output of the Sparse SVM as a most efficient since regardless to the criterion, they are able
variable ranker and we have not performed bagging as described originally to correctly rank the relevant variables even for small
in the paper of Bi et al. [3]. training set size.
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1495
3
2
20
1
0
3
2
30
1
0
3
2
40
1
0
3
2
50
1
0
L1 CC Gw Ga Gr Gs Gw Ga Gr Gs
Backward Grad Desc
Fig. 1. Linear problem. Number of correct variables ranked in the top 3 with respect to the training set size (from top to bottom) and the algorithms (from
left to right). L1 and CC, respectively, denote sparse SVMs variable ranking algorithm and a filter method based on correlation coefficient.
4.2. Experiments on nonlinear toy data the scaling vector. This iterative process is stopped when
the relative norm difference between scaling vectors n at
The toy data that we used for these series of experiments two consecutive steps is lower than 0:01 or if the number of
are the following: five inputs variables xð1Þ ; . . . ; xð5Þ have iterations reaches 30 (we have checked that this latter
been drawn independently and distributed according to a condition is very rarely met). For this experiment, we have
normal distribution and the output variable y is obtained used 20 random initializations of the vector n and we have
through the relation: used for ranking the optimized vector associated to the
ð5Þ lowest criterion value after optimization.
y ¼ xð1Þ þ 2xð2Þ þ cos ðxð3Þ Þ þ ðxð4Þ Þ2 þ ex The Z parameter of the span estimate has been set to
then 50 other normally distributed variables independent 0:01. The hyperparameters of VS-SSVM are the hyper-
to y has been added to the experimental data sets. This parameters (C ¼ 0:1 and ¼ 0:06) that minimize the mean
yields to a data set fxi ; yi gi¼1;...;‘ 2 Rd R with d ¼ 55. absolute value of the test error of a sparse linear model as
described in [3].
4.2.1. Performance with respect to training set size The results of this experiment are depicted in Figs. 2–5.
The first experiment deals with the performance of the Figs. 2–4 plot the normalized mean absolute error of
different algorithms with regard to the training set size. each algorithm with respects to the training set size.
Our algorithms have been used a variable ranker algo- When comparing performances of criteria G w , G a , G R
rithms and for each method the top five ranked variables and G S regardless of the search-space method, it seems
have been used as predictive variables in a SVR for y. that all criteria, except G w descent, yield to equivalent
Ten thousand examples of artificial data have been performances. A slight advantage for Ga and GR using
generated but only ‘ of them have been used as training set backward or gradient-descent methods can be seen. The
whereas the remaining part is then considered as the test poor performance of G w , as we will show in a following
set. Training set has been normalized to get zero mean and subsection, is essentially due to the several random
unit variance and the test set has also been normalized with initializations of the vector n. In fact, it seems that this
respect to the training data normalization parameters. criterion performs better when n is initialized with a vector
For this experiment, the hyperparameters C, and the of one.
width of the Gaussian kernel s have been set to C ¼ 10, Fig. 4 depicts the performance of the baseline algo-
¼ 0:01 and s ¼ 10. The same procedure than the one rithms. As expected a simple SVR using all variables
described for linear problem has been used for selecting performs badly and that variable selection by means of
these hyperparameter values. Sparse SVMs and correlation coefficient do not compare
For the gradient-descent-based algorithms, the gradient favorably to our methods based on gradient-descent and
have been normalized to a unit vector and a simple line backward algorithms especially when the learning set size
search is performed in order to compute the next iterate of becomes large. This is simply explained by the fact that
ARTICLE IN PRESS
1496 A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501
Gw Gα
0.8 0.8
0.75 0.75
Backward Backward
0.7 0.7
Gradient Desc. Gradient Desc.
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0 20 40 60 80 100 0 20 40 60 80 100
Training set size Training set size
Fig. 2. Nonlinear problem. Plots of the normalized mean absolute error of a SVR fed with the top 5 ranked variables with regard to the training size. The
left and right figure, respectively, depict the performance of an SVR associated with variable selection algorithms using criteria Gw , Ga . The four curves in
each plot correspond to a search-space algorithm: (solid) backward selection; (dotted) gradient-descent.
GR GS
0.8 0.8
0.75 0.75
Normalized mean absolute error
Fig. 3. Nonlinear problem. Plots of the normalized mean absolute error of a SVR fed with the top 5 ranked variables with regard to the training size. The
left and right figure, respectively, depict the performance of an SVR associated with variable selection algorithms using criteria GR , GS . The four curves in
each plot correspond to a search-space algorithm: (solid) backward selection; (dotted) gradient-descent.
Bench
0.8 0.8
0.75 0.75
Normalized mean absolute error
Gα Back L1
0.7 0.7 Corr Coef
GR GD
0.65 0.65 SVM.
L1
0.6 0.6
0.55 0.55
0.5 0.5
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0 20 40 60 80 100 0 20 40 60 80 100
Training set size Training set size
Fig. 4. Nonlinear problem. Plots of the normalized mean absolute error of a SVR fed with the top 5 ranked variables with regard to the training size. The
left figure compares the performances of the SVR combined with the following variable selection algorithms: (solid) Ga with backward selection; (dotted)
GR with gradient-descent. (dashed) VS-SSVM; (dashed-dotted) GS with sensitivity method and C ¼ 1. The right figure plots the performances of SVR
combined with baseline variable selection algorithms such as (solid) variable selection with VS-SSVM; (dotted) correlation coefficient filter method;
(dashed) ‘‘using all variables’’.
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1497
5
20
0
5
30
0
5
40
0
5
50
0
5
75
0
5
100
0
L1 CC Gw Ga Gr Gs Gw Ga Gr Gs
Backward Grad Desc
Fig. 5. Nonlinear problem. Number of correct variables ranked in the top 5 with respect to the training set size (from top to bottom) and the algorithms
(from left to right). The star shows the best performing algorithm for a given training set size.
4.2.3. How to select the ranked variables Again, we have used the same experimental setting
One of the drawbacks of variable ranking comes out as above. At first, we have processed the optimal
when the question of how many of these variables to use hyperparameters according to the above-described proce-
for prediction is raised. This problem has already been dure. It yields to C ¼ 5, s ¼ 7 and ¼ 0:01. For VS-
addressed in the literature and there are many methods to SSVM, the optimal hyperparameters are C ¼ 0:1 and
cope with it. Typically, the easiest way to deal with this ¼ 0:5.
problem is to select the number of variable to use by means Fig. 7 depicts the bar charts of the influence of the cross-
of a validation or a leave-one-out error of the regression terms in the ranking performance. In fact, we have plotted
function. However, this is computationally expensive. the difference (absence minus presence of cross-terms) of
Artificial probe random variables independent to the the number of variables correctly ranked in the top five
output can be added to the training data. Then depending variables. We can clearly see that cross-terms indeed
on how these random variables are ranked, one can select influence negatively performances for all criteria associated
the variables to use for prediction [3]. Here, we have used to the backward and the gradient-descent algorithms.
SVR bounds in order to select the number of variables to However, the impact of these cross-terms are rather benign
use. For instance, Fig. 6 plots the normalized bounds and since on the average over the 100 trials, the maximal
the test error with respect to the number of variables for decrease of the number of correctly ranked variables is of
prediction for different training set size. We can see that the the order of 0.6.
span-estimate bounds is able to correctly estimate the
number of variable to use especially for large data sets.
4.3. Real-world data sets
However, for small data sets, it seems that the span-
estimate bound behavior differs from the one of the test
We have tested our algorithms on some real-world
error leading to a suboptimal number of variables. On the
problems such as QSAR data and DNA microarray
other hand, the radius-margin bounds seems to be too
problems. Informations on these data sets, including
conservative and tends to select too few variables. These
references where they can be found, are given in Table 4.
findings are in agreement with those of Chang and Lin [6],
As described some of these data sets deal with classification
where they conclude that model selection with span
problems. In such cases, we have still addressed such
estimate is more robust.
problems by treating them as a regression problem. We
have naturally considered the class label of a given example
4.2.4. Influence of cross-terms as its target value.
In the above-described experiments, the target function
does not have any cross-terms. So we have also run some 4.3.1. Results
experiments in order to investigate how our algorithms The performance’s measure we used is the normalized
perform in the presence of cross-terms. Hence, we have mean absolute error. All problems have been addressed by a
built the following function: Gaussian SVR expect ‘‘colon’’ and ‘‘lymphoma’’ for which
we have used a linear kernel. The reason for using a linear
ð5Þ solution for these two data sets is that they have been widely
y ¼ xð1Þ þ 2xð2Þ þ cosðxð3Þ Þ þ ðxð4Þ Þ2 þ ex
ð2Þ
used for illustrating variable selection algorithm perfor-
þ xð1Þ xð2Þ þ xð1Þ ex . mance and in all cases [23,18] a linear kernel has been used.
1 1 1
Fig. 6. Nonlinear problem. Using bounds for evaluating the number of ranked variables to be used for training a SVR. The size of the training set varies
from left to right. (left) ‘ ¼ 30; (middle) ‘ ¼ 50; (right) ‘ ¼ 100.
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1499
0.4
20 0.2
0
0.4
30 0.2
0
0.4
40 0.2
0
0.4
50 0.2
0
0.4
75 0.2
0
0.4
100 0.2
0
L1 CC Gw Ga Gr Gs Gw Ga Gr Gs
Backward Grad Desc
Fig. 7. Nonlinear problem. Difference (absence–presence of cross-terms) of number of correct variables ranked in the top 5 with respect to the training set
size (from top to bottom) and the algorithms (from left to right). Positive values mean that presence of cross-terms reduces the ability of an algorithm to
correctly rank variables.
Table 5
Variable selection performance considering that an oracle has selected the optimal number of variables to use
Algo Datasets
Mean Var. p-values Mean Var. p-values Mean Var. p-values Mean Var. p-values
In bold, best performances with respect to the mean normalized absolute error are highlighted.
Table 6
Variable selection performance considering that an oracle has selected the optimal number of variables to use
Mean Var. p-values Mean Var. p-values Mean Var. p-values Mean Var. p-values
In bold, best performances with respect to the mean normalized absolute error are highlighted.
A radius-margin and a span estimate bounds have been criterion since it requires two quadratic programming
used. And in addition to these two criteria, two other ones problems to be solved.
which are related to these bounds have also been derived. In order to process the span estimate or the radius-
Besides, variable ranking algorithms also need a search- margin, one needs to invert a matrix which size is of
space algorithms which are used for selecting variable the order of the number of support vectors involved in
according to a given criterion. In this paper, we have the variable ranking procedure or to solve a QP
compared the classical backward selection algorithm and a problem. If one wants to trade performance against
method based on optimizing through gradient descent a efficiency then it is preferable to use G a as a criterion.
scaling vector. Numerical experiments have then been This criterion is cheaper to compute than the gradient of
performed on toy data and real QSAR problems in order the span estimate and still gets good results on the real
to analyze performance and behavior of each couple of data.
criterion and search-space algorithm. The main conclu- Although gradient-based algorithm gives better perfor-
sions that can be derived from this work are: mance than backward selection algorithm, this latter can
be useful when the number of variables is very large. In
According to our experimental results, if we consider the fact, it is more flexible since at a given iteration, it is
ranking capability of each algorithm, the best variable possible to rank a bunch of variables while for the
ranking algorithm is the radius-margin criterion. It gradient-descent algorithm, gradient according to all
yields to significant performance improvement using variables must be processed.
fewer variables in most of the real-world problems we Since our variable ranking algorithms are based on
used. However, this criterion is a time-demanding support vector regression, the problem of model
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1501
selection has then to be addressed. Ideally, the model [8] R. Collobert, S. Bengio, Svmtorch: support vector machines for
selection should also consider the variable ranking large-scale regression problems, J. Mach. Learning Res. 1 (2001)
143–160.
procedure. For instance, for the backward feature
[9] C. Couvreur, Y. Bresler, On the optimality of the backward greedy
elimination algorithm, one should consider the ranking algorithm for the subset selection problem, SIAM J. Matrix Anal.
and the model selection procedure all together. That Appl. 21 (3) (2000) 797–808.
means that a variable ranking procedure should have [10] Y. Grandvalet, S. Canu, Adaptive scaling for feature selection in
been processed for each hyperparameter setting. This svms, Advances in Neural Information Processing Systems, vol. 15,
can be rapidly intractable if the space of hyperpara- MIT Press, Cambridge, MA, 2003.
[11] I. Guyon, A. Elisseeff, An introduction to variable and feature
meters to be sampled is large. For instance, for say 10 selection, J. Mach. Learning Res. 3 (2003) 1157–1182.
values of hyperparameters C, s (if we use a Gaussian [12] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for
kernel) and , this would have yielded to 1000 ranking cancer classification using support vector machines, Mach. Learning
procedures. And for a backward algorithm with 50 (2000).
variables, each procedure would have involved 50 þ [13] L. Hermes, J.M. Buhmann, Feature selection for svm, in: Proceedings
of 15th International Conference on Pattern Recognition, vol. 2,
49 þ þ 1 QP problems to solve. Hence, in all our 2000, pp. 716–719.
experiments, we have selected these hyperparameters [14] R. Kohavi, G. John, Wrappers for feature subset selection, Artificial
with regard to the performance of a SVR on all Intelligence 97 (1–2) (1997) 273–324.
variables. Although, this is clearly not optimal, variable [15] T. Lal, O. Chapelle, J. Weston, A. Elisseeff, in: I. Guyon, S. Gunn,
M. Nikravesh, L. Zadeh (Eds.), Embedded Methods, Springer,
ranking with these hyperparameters yields improved
Berlin, 2005, pp. 1–20.
performances using fewer variables. [16] S.-P. Liao, H.-T. Lin, C.-J. Lin, A note on the decomposition
methods for support vector regression, Neural Comput. 14 (2002)
The practical experiments that we have carried out 1267–1281.
indicate that most of the algorithms (although we have [17] S. Mukkamala, A. Sung, Feature ranking and selection for intrusion
suggested our preferences to the practitioner) we proposed detection systems, in: Proceedings of International Conference on
Information and Knowledge Engineering, 2002.
are able to select the relevant variables and to improve the [18] A. Rakotomamonjy, Variable selection using SVM-based criteria,
test errors. However, this work still lacks some theoretical J. Mach. Learning Res. 3 (2003) 1357–1370.
analysis and in our future works we plan to address this [19] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern
point through the theoretical setting of Lal et al. [15]. Analysis, Cambridge University Press, Cambridge, 2004.
[20] R. Todeschini, Milano chemometrics and qsar research group data
sets, hhttp://www.disat.unimib.it/chm/download/data sets.htmi,
References 2001.
[21] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.
[1] E. Amaldi, V. Kann, On the approximability of minimizing non zero [22] V. Vapnik, O. Chapelle, Bounds on error expectation for support
variables or unsatisfied relations in linear systems, Theoret. Comput. vector machines, Neural Comput. 12 (9) (2000) 2013–2036.
Sci. 209 (1998) 237–260. [23] J. Weston, A. Elisseeff, B. Scholkopf, M. Tipping, Use of the ‘0 -norm
[2] Y. Bengio, Gradient-based optimization of hyperparameters, Neural with linear models and kernel methods, J. Mach. Learning Res. 3
Comput. 12 (2000) 1889–1900. (2003) 1439–1461.
[3] J. Bi, K. Bennett, M. Embrechts, C. Breneman, M. Song, [24] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, V.
Dimensionality reduction via sparse support vector machines, Vapnik, Feature selection for svms, in: Advances in Neural
J. Mach. Learning Res. 3 (2003) 1245–1264. Information Processing Systems, vol. 13, 2001.
[4] C. Blake, C. Merz, UCI repository of machine learning databases,
hhttp://www.ics.uci.edu/ mlearn/MLRepository.htmli, University of
California, Irvine, Department of Information and Computer Alain Rakotomamonjy graduated in M.S. degree
Sciences, 1998. in Electrical Engineering and Computer Science
from the Ecole Supérieure d’Electronique de
[5] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge
University Press, Cambridge, 2004. l’Ouest in 1993 and then received his Ph.D. from
[6] M.-W. Chang, C.-J. Lin, Leave-one-out bounds for support the University of Orleans in 1997. Since 1999, he
vector regression model selection, Neural Comput. 17 (4) (2005) is assistant professor at the INSA de Rouen
France. Alain’s research interests include ma-
1–26.
[7] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukerjhee, Choosing chine learning, kernel methods, variable selection
multiple parameters for SVM, Machine Learning 46 (1–3) (2002) and signal discrimination.
131–159.