0% found this document useful (0 votes)
18 views13 pages

Rakotomamonjy SVR Paper

This paper investigates variable ranking for support vector regression (SVR) using leave-one-out bounds and compares various search-space algorithms, including recursive feature elimination and gradient-descent optimization. The findings indicate that the radius-margin criterion is the most effective for variable ranking, leading to improved error rates with fewer variables. Additionally, the gradient-descent algorithm outperforms the backward algorithm in achieving better variable rankings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

Rakotomamonjy SVR Paper

This paper investigates variable ranking for support vector regression (SVR) using leave-one-out bounds and compares various search-space algorithms, including recursive feature elimination and gradient-descent optimization. The findings indicate that the radius-margin criterion is the most effective for variable ranking, leading to improved error rates with fewer variables. Additionally, the gradient-descent algorithm outperforms the backward algorithm in achieving better variable rankings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ARTICLE IN PRESS

Neurocomputing 70 (2007) 1489–1501


www.elsevier.com/locate/neucom

Analysis of SVM regression bounds for variable ranking


A. Rakotomamonjy
LITIS Perception, Systemes et Informations INSA de Rouen Avenue de l’Université 76801 Saint Etienne du Rouvray, France
Received 15 February 2005; received in revised form 14 March 2006; accepted 14 March 2006
Communicated by D.R. Musicant
Available online 10 October 2006

Abstract

This paper addresses the problem of variable ranking for support vector regression. The ranking criteria that we proposed are based on
leave-one-out bounds and some variants and for these criteria we have compared different search-space algorithms: recursive feature
elimination and scaling factor optimization based on gradient-descent. All these algorithms have been compared on toy problems and
real-world QSAR data sets. Results show that the radius-margin criterion is the most efficient criterion for ranking variables. Using this
criterion can then lead to support vector regressor with improved error rate while using fewer variables. Our results also support the
evidence that gradient-descent algorithm achieves a better variable ranking compared to backward algorithm.
r 2006 Published by Elsevier B.V.

Keywords: Support vector regression; Variable selection; Kernel methods

1. Introduction discussion on variable selection, the reader can refer to


Guyon and Elisseeff [11].
Given a set of training examples fxi ; yi g 2 X  Y, the Variable selection or variable ranking algorithms can
objective of a supervised pattern recognition algorithm is essentially be separated in three categories: filter methods
to learn a mapping from an input set X to an output set Y. which select variables independently to a classifier, wrapper
Ideally, the function f chosen by a pattern recognition methods for which the way of selecting variables are related
algorithm should minimize a generalization error, that is to to the classifier performance and embedded methods for
say, it should predict, the most accurately as possible, the which variable selection and the training stage are per-
correct y associated to an unseen x. formed in a single process.
Subproblems of pattern recognition are model selection Many works have focused on variable selection algorithms
and variable selection. Variable selection can be under- for SVM classifier. Some of these works dealt with embedded
stood as finding a variable subset (a subset of X) that methods which simultaneously optimize the regular objective
leads to the lowest generalization error. The benefits of function of SVMs and a penalization term on the variables
variable selection are generally three-fold: (i) reducing the [10,23]. Another approach for solving this variable selection
number of useful variables also reduces data measurement problem is to use the sensitivity of SVMs to a given variable
and storage requirements with the side effect of increased as a ranking criterion. This method has been originally
recognition speed; (ii) by selecting appropriate variables, introduced by Hermes and Buhmann [13] and Guyon et al.
the resulting predictor gets rid of spurious variables and [12]. Some variants of this sensitivity method have also been
consequently can enhance its performance; (iii) the knowl- proposed and they mainly consist in using sensitivity of
edge of the ‘‘best’’ variables that explain the relationship SVMs bounds on leave-one-out errors as a variable selection
between the input and output increases the understanding criterion [18]. Some other works also treat this problem by
of the underlying process. For a recent comprehensive scaling each variable with a factor and then optimize a SVM
error bounds with respect to these factors [18,24].
Recently, leave-one-out bounds on SVMs regression algo-
E-mail address: alain.rakotomamonjy@insa-rouen.fr. rithm have been proposed by Chang et al. [6], hence our aim in

0925-2312/$ - see front matter r 2006 Published by Elsevier B.V.


doi:10.1016/j.neucom.2006.03.016
ARTICLE IN PRESS
1490 A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501

this paper is to investigate the performance of different SVMs where f^ai ; agi¼1;...;‘ are the Lagrangian multipliers asso-
bound-based criteria for variable ranking in the context of ciated to the constraints given in Eq. (2) and their values
regression. In that sense, this work is essentially an extension of are the solution of the following dual problem:
the work described in [18] to the regression case.
X
‘ X

The paper is organized as follows: Section 2 briefly max yi ð^ai  ai Þ   ð^ai þ ai Þ
describes the support vector regression algorithm and some ai ;^ai
i¼1 i¼1
leave-one-out error bounds of this algorithm. Section 3  
1X ‘
1
introduces the variable ranking criteria that we used and the  ð^ai  ai Þð^aj  aj Þ Kðxi ; xj Þ þ di;j
different algorithms that have been considered for selecting 2 i;j¼1 C
these variables. Numerical experiments on toy and real-
world problems are presented in Section 4 while concluding with
remarks and perspectives are described in Section 5. X

ð^ai  ai Þ ¼ 0,
2. SVM regression algorithm and bounds on leave-one-out i¼1

error 0pai ; 0p^ai ; i ¼ 1; . . . ; ‘, ð4Þ

This section briefly reviews the SVM regression algo- with Kðxi ; xj Þ ¼ hFðxi Þ; Fðxj ÞiH and di;j being the Kroneck-
rithm that will be used throughout the paper and presents er symbol. This last problem can be efficiently solved by
some performance bounds of such algorithm. using quadratic programming technique associated with
some decomposition methods so that using SVR for very
large data sets becomes possible [8,16].
2.1. Support vector regression

Suppose one has a training data set fðxi ; yi Þgi¼1;...;‘ 2 2.2. Leave-one-out bounds for SVR
X  R where xi are the training vectors and yi their target
values. The idea of SVM regression (SVR) is to look for a Support vector machines for classification are known to
function f ðxÞ ¼ hw; FðxÞiH þ b, with w and FðxÞ being be based on strong theoretical analysis of the general-
some vectors of a given reproducing kernel Hilbert space ization performance of learning algorithms [21]. According
H, that minimizes the following regularized risk [21,19]: to that, there were several works that have dealt with
bounds on leave-one-out error which is known to be an
1 X‘
R½f  ¼ kwk2 þ C Lðyi ; xi ; f Þ, (1) almost unbiased estimator of the expected generalization
2 i¼1 error. For SVM in classification, the most frequently used
bounds are the so-called radius-margin bound [21] and the
where Lðy; x; f Þ is either the linear -insensitive loss span-estimate bound [22].
function defined by Recently, Chang and Lin have extended these results for
Lðy; x; f Þ ¼ jy  f ðxÞj ¼ maxð0; jy  f ðxÞj  Þ SVR and have proposed the following bounds for
regression [6]:
or the quadratic -insensitive loss function given by Let F be the set of examples xi so that the related
Lagrangian multipliers ai and a^ i satisfy ai þ a^ i 40 then
Lðy; x; f Þ ¼ jy  f ðxÞj2 .
Theorem 2.2.1. Under the hypothesis that F is non-empty,
For the sequel of the paper, we consider only SVR with then
quadratic -insensitive loss function and in this case, Eq. (1) P the leave-one-out error for SVR is bounded by
4R~ i2F ðai þ a^ i Þ þ ‘ where R~ is the radius of the smallest
is equivalent to sphere containing the points fFðx~ i Þgi¼1;...;‘ with
X‘ 2 3
1 2 Fðxi Þ
min kwk2 þ C ðx2i þ x^ i Þ (2)
^
w;b;xi ;xi 2 ~ i Þ  4 ei 5
Fðx
i¼1 pffiffiffiffi
C
with
and ei the ith vector of R‘ canonical basis. For the span-
f ðxi Þ  yi p þ xi ; i ¼ 1; . . . ; ‘, estimate bound, they proposed the following theorem:
y  f ðxi Þp þ x^ i ;
i i ¼ 1; . . . ; ‘.
Theorem 2.2.2. Under the hypothesis that F is non-empty,
The solution of this problem can be obtained by means of and that the set of support vectors remains the same during
Lagrangian theory [5] and it is easy to show that the leave-one-out procedure, then the leave-one-out error for
SVR is bounded by
X

X
w¼ ð^ai  ai ÞFðxi Þ, (3) ðai þ a^ i ÞS2i þ ‘,
i¼1 i2F
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1491

~ i Þ to the span of all


where S 2i is the squared distance of Fðx From this equation, we can see that G is actually the criterion
other support vectors fFðxj Þ: with aj þ a^ j 40g. according to which variable are selected and n the vector that
drives the optimization problem.
Because S 2i is a discontinuous function, Chang et al.
The following subsections give details on the different
suggested to replace it by a looser but smoother function
2 criteria G we have proposed and on the search-space
namely S~ i which is a regularized version of S2i . For the
algorithms we have used for solving the minimization
sake of being self-contained, we remind here the equation
2 problem given in equation (6).
for processing S~ i but for more details about the derivation,
we suggested the reader to refer to the original paper [6]:
3.2. Variable ranking criteria
2 1 Z
S~ i ¼  ,
~ 1
ðM Þii a i þ a^ i In our case, the induction algorithm T is the SVR that
has been described in the previous section and the function
with parameter g% is the set fa; a^ g which optimizes Eq. (4). For
" #
K~ F þ DF 1F our variable subset ranking purpose, we have chosen G as
~ ¼
M , the following functions:
1tF 0
P‘
where 1F is a vector of jFj ones, DF and K~ F are jFj   Gw ða; a^ Þ ¼ ai ~ i ; xj Þ,
 ai Þð^aj  aj ÞKðx
i;j¼1 ð^
P‘
jFj matrices which elements are, respectively, K~ i;j ¼  Ga ða; a^ Þ ¼ ^ i Þ,
i¼1 ðai þ a
2 P‘
Z
Kðxi ; xj Þ þ C1 di;j and Di;j ¼ ai þ^ai di;j and Z a user-defined ~
 GR ða; a^ Þ ¼ R i¼1 ðai þ a^ i Þ,
constant which denotes the amount of smoothing regular- P 2
ization.  GS ða; a^ Þ ¼ ‘i¼1 ðai þ a^ i ÞS~ i .

3. The problem of variable subset ranking The two criteria G R and GS are bounds on the leave-one-
out error so they fit in the algorithmic framework of Eq. (6)
This section introduces a framework for variable ranking since they are directly related to the performance of the
which is based on the work of Lal et al. [15] on wrapper SVR. The criterion Ga is not a bound but is a term involved
methods. It also details the algorithms we have proposed in both radius-margin and span estimate bounds and thus
for solving this SVR-based variable ranking problem. minimizing this term may also minimize those bounds.
Hence, we propose to use G a as a criterion for variable
ranking because it is cheaper to compute. In fact, compared
3.1. The big picture
to G R and G S , it does not need nor the solution of a
quadratic programming problem (for processing R) ~ neither
Consider that the objective of feature selection is primarily ~ 2
the inversion of a matrix (for processing Si ). Finally, the
to optimize the performance of a classifier by selecting a subset
criterion G w is the norm of the regression function in H.
of variables. Then, given a learning set S ¼ fðxi ; yi Þgi¼1‘
Using this criterion for variable subset ranking may be
2 X  Y, drawn iid according to a distribution Pðx; yÞ, the
surprising, but we expect that for a fixed value of C in the
problem of feature selection is equivalent to look for a vector
primal SVR problem (2), the more the function f is
n 2 f0; 1gd and a function of H parametrized by a vector g
‘‘regular’’ according to a set of variables, the more it is
noted f g that minimizes the risk:
Z ‘‘stable’’ and thus, the more this set of variables is relevant
Rn ½f g  ¼ Lðy; f g ðn  xÞÞ dPðx; yÞ, (5) with regard to the target values. Hence, in that sense, G w
can be considered as an interesting criterion.
where the operator  denotes the componentwise vector
product. However, this optimization problem cannot be 3.3. Search-space algorithms
solved since Pðx; yÞ is unknown. Hence, given the training
set S, it is common to minimize an empirical risk based on S. Once a criterion Gðf % ; g% ; n; SÞ has been defined, one needs
For a wrapper-based variable selection problem, this boils an algorithm for minimizing Eq. (6) with respect to n.
down to the following problem [15]: However if sðnÞ ¼ knk0 , this problem is NP hard [1], so one
8 has to rely on an approximation strategies. Furthermore, since
> sðnÞpn0 ; we are interested in variable ranking, we are rather interested
<
inf Gðf ; g ; n; SÞ s.t g ¼ TðH; n; SÞ; in a set of vectors fnk gk¼1;...;d so that for each k, nk is the vector
% %
%
(6)
n2f0;1gd >
: f % ðxÞ ¼ f % ðxÞ; that optimizes Eq. (6) with sðnÞ ¼ knk0 and n0 ¼ k.
g
There are several suboptimal approaches for looking to
where T is an induction algorithm, f % and g% are, respectively, the vector n that optimizes Eq. (6). In this work, we have
the regression function and its parameters returned by T, G is used two of these search-space algorithms:
a function that is related to the performance of f % , and sðÞ a
function that measures the sparsity of the vector n and n0 a  A greedy iterative algorithm based on backward
user-defined parameter that denotes the desired sparsity in n. selection [9]. In this case, the variable ranking is
ARTICLE IN PRESS
1492 A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501

performed by at first considering all features and then by


removing one feature at a time until all of them are Algorithm 2. Variable ranking through gradient-descent
ranked. At each step, one removes the feature whose optimization.
removal minimizes the criterion. In other words, that
variable is the one that tends to increase the most the Initialize scaling factor n ¼ ½1; . . . ; 1
leave-one-out error bound when used. Hence, the first while stopping criterion is not reached do
removed variable is ranked last and the last removed one Compute all parameters needed for G while
is considered the top ranked variable. keeping the actual n fixed
This search-space algorithm fits in the above-presented for all Variables in VAR do
framework in the following way. Eq. (6) is solved by Compute qG qni
starting with n0 ¼ d  1, hence at optimality, only one end for
component of n will be equal to 0. And since there is n n  mrn G
only d vector n which satisfies such a condition, at this end while
stage the problem can be solved exactly since one can RANKED ¼ descending sort of fjni jgi¼1;...;d
compute Gðf % ; g% ; n; SÞ for all these vectors. However, as
n0 decreases, the problem becomes computationally
intractable, and we rely on the approximation that The constraint sðnÞpn0 is then taken into account in this
for a variable j for which at step k  1, nðjÞ k1 ¼ 0,
way: variables are ranked according to the absolute value
then nkðjÞ ¼ 0. of the final scaling vector n that is to say the k variables
 A gradient-descent-based algorithm. Similar to the work associated with the k largest absolute value of n compo-
of Weston et al. [24], the idea is to relax the constraints nents are considered as the k top ranked variables.
of n being in f0; 1gd by considering that n 2 Rd and then Since these two search-space algorithms can be used in
to solve the resulting problem through gradient-descent the context of Eq. (6) in conjunction with the four different
with respect to n. In our context, since we are interested variable selection criteria, we have eight different variable
on variable ranking, we initially solve the problem selection algorithms for which names are given in Table 1.
without the constraint sðnÞpn0 . When this constraint is
omitted, the resulting optimization problem can be 3.4. Derivatives with regard to a scaling factor
solved by means of a two-step alternate algorithm. The
first step consists in finding f % and g% for a given value For the algorithm 2, we need to process the gradient of a
of n (that is to say, we solve the dual SVR problem in Eq. criterion with respect to a scaling factor n. This latter acts
(4) by replacing all xi by n  xi ) and the second step as a componentwise multiplicative term on the input
consists in minimizing Gðf % ; g% ; n; SÞ by gradient-descent variables and thus kðx; x0 Þ becomes
with regard to n considering that all other parameters are
Kðm  x; m  x0 Þ,
fixed. Note that in our experiments, the gradient step
length m has been chosen at each step according to a where  denotes the componentwise vector product.
simple line search method. These two steps are Consequently, one obtains the following derivatives for a
0 2
alternatively performed until the reach of a minimum kmxmx k
Gaussian Kernel Kðm  x; m  x0 Þ ¼ e 2s2
:
or until the maximum number of allowed iterations is
reached. qK 1
¼  2 ni ðxi  x0i Þ2 Kðn  x; n  x0 Þ
qni s
1 1
¼  2 ðni  xi  ni  x0i Þ2 Kðn  x; n  x0 Þ.
Algorithm 1. Variable ranking with backwards algorithm. s ni

Initialization: RANKED ¼ ;; VAR ¼ ½1; . . . ; d Then, one needs to evaluate the gradient of the bounds
while VAR is not empty do with regard to a variable ni . Details of the gradient
for all Variables in VAR do computation for a given criterion can be found in [6],
Remove temporarily variable i in VAR
Compute all parameters needed for G ðiÞ Table 1
Compute ranking criterion G ðiÞ Notation of the different algorithms
end for Criterion Backward Gradient-descent
RANKVAR ¼ arg mini G ðiÞ
Gw Gw B G w GD
Rank variable: RANKED ¼ ½RANKVAR
Ga Ga B G a GD
RANKED GR GR B G R GD
Remove variable RANKVAR from the variable list GS GS B G S GD
VAR
end while Eight algorithms have been obtained by combining all criteria and the
search-space methods.
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1493

and they have been obtained using the results of [2,7]. Here, For a gradient-descent-based algorithm, the vector n has
we only give the final results: to be initialized. And since the minimization problem is
non-convex, it is very likely that the resulting variable
 G w gradient: ranking depends on that initial value. One solution for that
qG w X qKðm  xk ; m  xj Þ problem would be to run the algorithm several times with
¼ ð^ak  ak Þð^aj  aj Þ . some different initializations and keep the solution which
qni qni
k;j minimizes the criterion but this would be time consuming.
Furthermore, starting with a scaling vector n equals to
½1; . . . ; 1 has a particular meaning since it gives equal
 G a gradient:
weights to all variables. We have run some numerical
qG a X qð^aj þ aj Þ experiments that aim at illuminating the issues involved in
¼ ,
qni j2F
qni considering a fix or random initialization of the vector n.

where 3.5. Computational considerations


2 3
qð^a  aÞF " #1 2 qK~ 3
6 qni 7 K~ F 1F F In addition to their capability of selecting the most
6 7 4 qni a
ð^  aÞ F 5.
6 qb 7 ¼ t relevant variables for a given task, the above-proposed
4 5 1F 0
0 algorithms can also be compared according to their
qni
computational times. Since, these computational times
are mainly problem-dependent, instead of providing results
on data, we have preferred to summarize in Table 2 the
 G R gradient:
most time-demanding subproblems needed for each algo-
2
qG R 2 X qð^
aj þ aj Þ qR~ X rithm for performing variable ranking.
~
¼R þ ð^aj þ aj Þ,
qni qni qni j2F As we can see, for the backward algorithm, the inner
j2F
loop of Algorithm 1 requires a quadratic programming
2
where R~ is the optimal objective function of the (QP) to be solved for G w and G a , whereas for G R and G S , 2
following problem: QP and a QP and a matrix inversion are, respectively,
X X needed. An exact evaluation of the computational
max ~  xk ; n  xk Þ 
bk Kðn ~  xk ; n  xj Þ
bk bj Kðn
b
k k;j
complexity of each algorithm is difficult, since these
X complexities depend on the number of support vector
s:t bk ¼ 1 and bk X0 8k involved in the problem and thus they are problem
k
dependent.
and For the gradient-descent algorithm, we can see that the
2
X‘ ~  xj ; n  xj Þ gradient computation with regard to a given scaling factor
qR~ qKðn
ni needs the solving of a linear system, a matrix inversion or
¼ bj
qni j¼1
qni in the simplest case just a matrix multiplication. Hence,
‘ X
X ‘ ~  xj ; n  xk Þ when the number of variables is large, this procedure can
qKðn
 bj bk . be time consuming. For this search-space algorithm, the
j¼1 k¼1
qni most demanding part is the line-search part since it
involves at least a QP problem.
 Span estimate gradient:
2 Table 2
qG S X 2 qð^aj þ aj Þ X qS~ j Tables of the main computational burden for each criterion and each
¼ Sj þ ð^aj þ aj Þ, search-space algorithm
qni j2F
qni j2F
qni
(A) Algorithm 1: Backward
where Criterion Gw Ga GR GS
2   Burden QP QP 2 QP QP, MI
qS~ j 1 1 qM~ 1
¼  M~ ~
M (B) Algorithm 2: Gradient descent
qni ~ 1 Þ2
ðM qni jj Criterion Gw Ga GR GS
jj
Gradient MM LS LS, MM LS, MI
Z qð^aj þ aj Þ
þ . Line search QP QP 2 QP QP, MI
2 qni
ðaj þ a^ j Þ
For Algorithm 1, table depicts what needs to be solved for achieving a
ranking criterion for each single variable before removing a single
2 variable. For Algorithm 2, table gives what needs to be process for the
Details of the derivation of S~ j gradient, in addition to
gradient and for the line search at each iteration of gradient-descent. MI,
issues concerning differentiability of these bounds can be matrix inversion; LS, linear system; MM, matrix multiplication; QP,
found in [6]. quadratic programming.
ARTICLE IN PRESS
1494 A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501

3.6. Relations with other variable selection algorithms simple variable selection method based on correlation
coefficients. The performance measures that have been
In the previous paragraphs, we have described several used for comparison are the mean number of variables
novel criteria for variable ranking in the context of support correctly top ranked (if there are j relevant variables in a
vector regression. These criteria have been associated to problem, all j variables should be ranked in the top j) and a
two different search-space algorithms leading to eight normalized mean absolute error:
novel variable selection methods. P‘
Among all these methods, some of them can be related to jy^ p  yt j
Pn i t i
nmaeðyp ; yt Þ ¼ i¼1
other variable selection algorithms proposed in the i¼1 jyi j
literature. For instance, Mukkamala et al. [17] have yti and ypi are, respectively, the true and predicted target
proposed an approach that is similar to our Algorithm 1. values of a given xi .
In fact, they rank variables by using one step of a backward
variable elimination algorithm and their criterion is mainly
based on the classifier accuracy measured on a validation 4.1. Experiment on linear toy data
set. This is a classical wrapper approach applied to SVM
[14]. Our algorithm differs in two points to theirs: (i) we For a sake of sanity checking, we have firstly tested our
retrain the learning machine after variable elimination and different algorithms on a simple linear regression problem.
performs the full backward elimination; (ii) we use a bound For this experiment, the data have been created according
on the learning machine performance as a criterion. Hence, to the following steps. The input data are composed of 55
we do not need to rely on a validation set. normally distributed variables and only the first three of
In the context of SVM classifier, sensitivity-based criteria them are related to the output by the relation:
have also been used for variable ranking. These methods
y ¼ xð1Þ þ 2xð2Þ  xð3Þ .
have been originally introduced by Hermes and Buhmann
[13] and Guyon et al. [12]. Hermes et al. have suggested to The experiment that we carried out is then the following.
rank each variable according to the angle between the All the variable ranking algorithms have been run with
gradient of the decision function f ðxÞ and the unit vectors training set of increasing size and we have counted the
ej . It is easy to show (by calculating ðrx Kðxi ; xÞÞT ej ) that number of time the variables xð1Þ ; xð2Þ ; xð3Þ has been ranked
their ranking criterion is strongly related to a criterion in the top three variables. Naturally, the SVR variable
based on: ranking is based on a linear kernel and we have set the
  hyperparameters to C ¼ 1000 and  ¼ 0:01 and we have
qf ðn  xÞ
ðiÞ
G ¼   (7) used them for both variable ranking algorithm and
qni  regression function estimation. For obtaining these hyper-
evaluated at n ¼ ½1; . . . ; 1. Hence their idea can be parameters values, we have finely sampled a large range of
extended to the regression case and to other sensitivity hyperparameters value. And for each sample, we have
measures. For instance, we can use as a criterion: measured the average performance over 20 trials of an SVR
  trained on 100 examples. In other words, these parameters
qG
G ¼  ,
ðiÞ
(8) correspond to the ‘‘best’’ parameters if all variables are
qni used. Since, this first experiment aims at checking the
where G can be one of the criterion described in Section correctness of the algorithms, we have used a single
3.2. The underlying idea of such algorithm is to rank initialization with n ¼ ½1; . . . ; 1 for the gradient-descent-
variables according to their (positive or negative) influences based search-space algorithms.
on the leave-one-out bounds. For each training set size, the results have been averaged
However, since this approach is strongly related to the over 100 trials with random draws of training set. Fig. 1
work of Hermes et al. and in order to limit the shows the results. The first interesting result is that all the
experimental results for a sake of clarity, we have not algorithms perform correctly and are able to distinguish
reported in this paper, results based on these criteria. relevant variables as soon as training set size is large
enough. Besides, as soon as the training set size increases,
all algorithms perform better than the variable ranking
4. Numerical results
method based on correlation coefficients. It seems that our
algorithms are more robust to noisy variables than the
This section reports the results of our empirical analysis
correlation coefficient which is theoretically well suited for
on the above presented variable selection algorithms. Here,
ranking variables in a linear problem.
we have compared the performance of methods to the
According to this table, it seems that for linear problem,
variable ranking achieved by Sparse SVMs [3]1 and a
gradient-descent-based methods for search space are the
1
Note that here we have directly used the output of the Sparse SVM as a most efficient since regardless to the criterion, they are able
variable ranker and we have not performed bagging as described originally to correctly rank the relevant variables even for small
in the paper of Bi et al. [3]. training set size.
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1495

3
2
20
1
0
3
2
30
1
0
3
2
40
1
0
3
2
50
1
0
L1 CC Gw Ga Gr Gs Gw Ga Gr Gs
Backward Grad Desc

Fig. 1. Linear problem. Number of correct variables ranked in the top 3 with respect to the training set size (from top to bottom) and the algorithms (from
left to right). L1 and CC, respectively, denote sparse SVMs variable ranking algorithm and a filter method based on correlation coefficient.

4.2. Experiments on nonlinear toy data the scaling vector. This iterative process is stopped when
the relative norm difference between scaling vectors n at
The toy data that we used for these series of experiments two consecutive steps is lower than 0:01 or if the number of
are the following: five inputs variables xð1Þ ; . . . ; xð5Þ have iterations reaches 30 (we have checked that this latter
been drawn independently and distributed according to a condition is very rarely met). For this experiment, we have
normal distribution and the output variable y is obtained used 20 random initializations of the vector n and we have
through the relation: used for ranking the optimized vector associated to the
ð5Þ lowest criterion value after optimization.
y ¼ xð1Þ þ 2xð2Þ þ cos ðxð3Þ Þ þ ðxð4Þ Þ2 þ ex The Z parameter of the span estimate has been set to
then 50 other normally distributed variables independent 0:01. The hyperparameters of VS-SSVM are the hyper-
to y has been added to the experimental data sets. This parameters (C ¼ 0:1 and  ¼ 0:06) that minimize the mean
yields to a data set fxi ; yi gi¼1;...;‘ 2 Rd  R with d ¼ 55. absolute value of the test error of a sparse linear model as
described in [3].
4.2.1. Performance with respect to training set size The results of this experiment are depicted in Figs. 2–5.
The first experiment deals with the performance of the Figs. 2–4 plot the normalized mean absolute error of
different algorithms with regard to the training set size. each algorithm with respects to the training set size.
Our algorithms have been used a variable ranker algo- When comparing performances of criteria G w , G a , G R
rithms and for each method the top five ranked variables and G S regardless of the search-space method, it seems
have been used as predictive variables in a SVR for y. that all criteria, except G w descent, yield to equivalent
Ten thousand examples of artificial data have been performances. A slight advantage for Ga and GR using
generated but only ‘ of them have been used as training set backward or gradient-descent methods can be seen. The
whereas the remaining part is then considered as the test poor performance of G w , as we will show in a following
set. Training set has been normalized to get zero mean and subsection, is essentially due to the several random
unit variance and the test set has also been normalized with initializations of the vector n. In fact, it seems that this
respect to the training data normalization parameters. criterion performs better when n is initialized with a vector
For this experiment, the hyperparameters C,  and the of one.
width of the Gaussian kernel s have been set to C ¼ 10, Fig. 4 depicts the performance of the baseline algo-
 ¼ 0:01 and s ¼ 10. The same procedure than the one rithms. As expected a simple SVR using all variables
described for linear problem has been used for selecting performs badly and that variable selection by means of
these hyperparameter values. Sparse SVMs and correlation coefficient do not compare
For the gradient-descent-based algorithms, the gradient favorably to our methods based on gradient-descent and
have been normalized to a unit vector and a simple line backward algorithms especially when the learning set size
search is performed in order to compute the next iterate of becomes large. This is simply explained by the fact that
ARTICLE IN PRESS
1496 A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501

Gw Gα
0.8 0.8
0.75 0.75

Normalized mean absolute error


Normalized mean absolute error

Backward Backward
0.7 0.7
Gradient Desc. Gradient Desc.
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0 20 40 60 80 100 0 20 40 60 80 100
Training set size Training set size

Fig. 2. Nonlinear problem. Plots of the normalized mean absolute error of a SVR fed with the top 5 ranked variables with regard to the training size. The
left and right figure, respectively, depict the performance of an SVR associated with variable selection algorithms using criteria Gw , Ga . The four curves in
each plot correspond to a search-space algorithm: (solid) backward selection; (dotted) gradient-descent.

GR GS

0.8 0.8
0.75 0.75
Normalized mean absolute error

Normalized mean absolute error

0.7 Backward 0.7 Backward


Gradient Desc. Gradient Desc.
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0 20 40 60 80 100 0 20 40 60 80 100
Training set size Training set size

Fig. 3. Nonlinear problem. Plots of the normalized mean absolute error of a SVR fed with the top 5 ranked variables with regard to the training size. The
left and right figure, respectively, depict the performance of an SVR associated with variable selection algorithms using criteria GR , GS . The four curves in
each plot correspond to a search-space algorithm: (solid) backward selection; (dotted) gradient-descent.

Bench
0.8 0.8
0.75 0.75
Normalized mean absolute error

Normalized mean absolute error

Gα Back L1
0.7 0.7 Corr Coef
GR GD
0.65 0.65 SVM.
L1
0.6 0.6
0.55 0.55
0.5 0.5
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0 20 40 60 80 100 0 20 40 60 80 100
Training set size Training set size

Fig. 4. Nonlinear problem. Plots of the normalized mean absolute error of a SVR fed with the top 5 ranked variables with regard to the training size. The
left figure compares the performances of the SVR combined with the following variable selection algorithms: (solid) Ga with backward selection; (dotted)
GR with gradient-descent. (dashed) VS-SSVM; (dashed-dotted) GS with sensitivity method and C ¼ 1. The right figure plots the performances of SVR
combined with baseline variable selection algorithms such as (solid) variable selection with VS-SSVM; (dotted) correlation coefficient filter method;
(dashed) ‘‘using all variables’’.
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1497

5
20

0
5
30

0
5
40

0
5
50

0
5
75

0
5
100

0
L1 CC Gw Ga Gr Gs Gw Ga Gr Gs
Backward Grad Desc

Fig. 5. Nonlinear problem. Number of correct variables ranked in the top 5 with respect to the training set size (from top to bottom) and the algorithms
(from left to right). The star  shows the best performing algorithm for a given training set size.

both methods tend to select variables that are linearly Table 3


correlated to the output y. Nonlinear problem
Fig. 5 shows the average number of variables among
Criterion 20 30 40 50
xð1Þ ; . . . ; xð5Þ that have been correctly ranked in the top 5. It
corroborates nearly all findings that have been deduced fix rand fix rand fix rand fix rand
from the figures. Ga and GR seem to be the best criteria.
Gw GD 2.06* 1.42 2.62* 1.83 2.89* 2.17 3.15* 2.28
This figure also depicts that on the overall performances of Ga GD 2.00 2.05 2.94 3.00 3.38 3.54* 3.80 3.90*
gradient-descent algorithms seem to be better than those of GR GD 1.95 2.10 2.91 3.13* 3.45 3.64* 3.76 3.86*
backward algorithms. Sparse SVMs and correlation GS GD 2.12 1.94 2.76 2.74 3.15 3.16* 3.37 3.55*
coefficient method are not able to predict better the
Comparison of performances in gradient-descent-based algorithms with
relevant variables of the problem. regard to initialization vectors. The table presents the number of correct
top ranked variables obtained with different training set sizes. The
4.2.2. Scaling vector initialization in the gradient-descent columns denoted as fix and rand, respectively, stand for results with fix
algorithm and with 20 random initialization values. n Indicates that the performance
In order to corroborate the statement that initialization difference between the two initialization methods is significant according
to a Wilcoxon sign test with a p-value of 0:1.
of the scaling vector n plays an important role in the
performance of the gradient-descent algorithm, we have
run the following experiment.
Using the same non-linear data set, we have run all the except for the criterion Gw , using several random initializa-
gradient-descent-based algorithms, with a initialization tions lead to equivalent or better performances, and that
vector chosen randomly according to a uniform distribu- improvements are most of the times significant (according
tion over ½0; 2 random variables. The random initialization to a Wilcoxon sign test). At the contrary, for the G w
has been done 20 times and among all these initializations, criterion, using a fixed initialization achieves far better
we have kept the one that has reached the lowest criterion performances. Hence, we can deduce that this criterion is
value. Except in that point, the same experimental setting very sensitive to that initialization value. A rationale for
as above has been used (training and test set, hyperpara- this finding is that, unlike the other criteria, this criterion is
meters). not related to a generalization bound but with the norm of
Table 3 compares the number of correct top ranked the decision function f ðxÞ. Then, choosing an initialization
variables using the 20 random initializations and a single value that minimizes this criterion does not necessarily lead
initialization with n ¼ ½1; . . . ; 1. Experiments shows that, to better performances.
ARTICLE IN PRESS
1498 A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501

4.2.3. How to select the ranked variables Again, we have used the same experimental setting
One of the drawbacks of variable ranking comes out as above. At first, we have processed the optimal
when the question of how many of these variables to use hyperparameters according to the above-described proce-
for prediction is raised. This problem has already been dure. It yields to C ¼ 5, s ¼ 7 and  ¼ 0:01. For VS-
addressed in the literature and there are many methods to SSVM, the optimal hyperparameters are C ¼ 0:1 and
cope with it. Typically, the easiest way to deal with this  ¼ 0:5.
problem is to select the number of variable to use by means Fig. 7 depicts the bar charts of the influence of the cross-
of a validation or a leave-one-out error of the regression terms in the ranking performance. In fact, we have plotted
function. However, this is computationally expensive. the difference (absence minus presence of cross-terms) of
Artificial probe random variables independent to the the number of variables correctly ranked in the top five
output can be added to the training data. Then depending variables. We can clearly see that cross-terms indeed
on how these random variables are ranked, one can select influence negatively performances for all criteria associated
the variables to use for prediction [3]. Here, we have used to the backward and the gradient-descent algorithms.
SVR bounds in order to select the number of variables to However, the impact of these cross-terms are rather benign
use. For instance, Fig. 6 plots the normalized bounds and since on the average over the 100 trials, the maximal
the test error with respect to the number of variables for decrease of the number of correctly ranked variables is of
prediction for different training set size. We can see that the the order of 0.6.
span-estimate bounds is able to correctly estimate the
number of variable to use especially for large data sets.
4.3. Real-world data sets
However, for small data sets, it seems that the span-
estimate bound behavior differs from the one of the test
We have tested our algorithms on some real-world
error leading to a suboptimal number of variables. On the
problems such as QSAR data and DNA microarray
other hand, the radius-margin bounds seems to be too
problems. Informations on these data sets, including
conservative and tends to select too few variables. These
references where they can be found, are given in Table 4.
findings are in agreement with those of Chang and Lin [6],
As described some of these data sets deal with classification
where they conclude that model selection with span
problems. In such cases, we have still addressed such
estimate is more robust.
problems by treating them as a regression problem. We
have naturally considered the class label of a given example
4.2.4. Influence of cross-terms as its target value.
In the above-described experiments, the target function
does not have any cross-terms. So we have also run some 4.3.1. Results
experiments in order to investigate how our algorithms The performance’s measure we used is the normalized
perform in the presence of cross-terms. Hence, we have mean absolute error. All problems have been addressed by a
built the following function: Gaussian SVR expect ‘‘colon’’ and ‘‘lymphoma’’ for which
we have used a linear kernel. The reason for using a linear
ð5Þ solution for these two data sets is that they have been widely
y ¼ xð1Þ þ 2xð2Þ þ cosðxð3Þ Þ þ ðxð4Þ Þ2 þ ex
ð2Þ
used for illustrating variable selection algorithm perfor-
þ xð1Þ xð2Þ þ xð1Þ ex . mance and in all cases [23,18] a linear kernel has been used.

1 1 1

0.9 0.9 0.9


Normalized error or bounds

Normalized error or bounds

Normalized error or bounds

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 Span Estimate 0.5 Span Estimate 0.5 Span Estimate


RadiusMargin RadiusMargin RadiusMargin
0.4 Test Error 0.4 Test Error 0.4 Test Error
0.3 0.3 0.3

0.2 0.2 0.2


0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Number of variables used for SVR Number of variables used for SVR Number of variables used for SVR

Fig. 6. Nonlinear problem. Using bounds for evaluating the number of ranked variables to be used for training a SVR. The size of the training set varies
from left to right. (left) ‘ ¼ 30; (middle) ‘ ¼ 50; (right) ‘ ¼ 100.
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1499

0.4
20 0.2
0

0.4
30 0.2
0

0.4
40 0.2
0

0.4
50 0.2
0

0.4
75 0.2
0

0.4
100 0.2
0
L1 CC Gw Ga Gr Gs Gw Ga Gr Gs
Backward Grad Desc

Fig. 7. Nonlinear problem. Difference (absence–presence of cross-terms) of number of correct variables ranked in the top 5 with respect to the training set
size (from top to bottom) and the algorithms (from left to right). Positive values mean that presence of cross-terms reduces the ability of an algorithm to
correctly rank variables.

Table 4 error. In order to check whether the performances of the


Summary of the real-world data sets and related references on their SVR with ranked variables are significantly different to
previous uses those of a SVR using all variables a Wilcoxon signed test
Data sets No. var. No. obser. Type References has been processed and the resulting p-value reported.
Results are summarized in Tables 5 and 6. It is evident
Pyrimidines 27 74 regres. [4] from these tables that there is no couple of criterion and
Triazines 60 186 regres. [4]
search-space algorithm that outperforms the other. How-
Selwood 53 31 regres. [20]
Phenetyl 628 22 regres. [20] ever, if we get into details, we can see that G R in
Olitos 25 120 class. [20] conjunction with a gradient-descent algorithm yields to
Yeast 79 208 class. [23] the best performance on two of the data sets, that Gw and
Colon 2000 62 class. [23] gradient descent achieves two times the best results whereas
Lymphoma 4026 96 class. [23]
the Ga , G S with a gradient descent and G a and G S with a
backward algorithm wins, respectively, one of the data sets.
Interestingly, if we look instead into significant improve-
Here, we have performed a 2 stage 5-fold cross- ment of performance (p-value lower that 0.3), we can see
validation for evaluating the performance of our algo- that several couple of criterion and search-space algorithm
rithms and for selecting the appropriate hyperparameters. yields significant performance improvement on 5 over 8 of
Hence, for each cross-validation fold, we have done the the data sets.
following. A cross-validation procedure is processed on the Note that in this experiment, the hyperparameters of the
resulting training set in order to figure out the best variable ranker and SVR algorithm have been selected by
hyperparameters C,  and s of the Gaussian kernel that model selection using all variables, and still we get good
lead to the minimum mean normalized absolute error. performances. Although, we believe that these are not the
Hence, with the best hyperparameters of this fold, the optimal hyperparameters that lead to the best variable
different variable ranking algorithms are run on all the ranking performance, this approach is cheaper than using a
training data. Finally for each algorithm, support vector model selection procedure which includes the variable
regression with increasing number of ranked variables are ranking stage.
fitted on the training data and tested on the hold-out data
of the first cross-validation stage. For this experiment, the 5. Conclusions and perspectives
gradient-descent-based algorithms have been randomly
initialized five times. The work that has been presented here is essentially
For each data set and algorithms, we have measured the an extension of the study described in [18]. We have
minimal normalized mean absolute error and the number proposed in this paper to use leave-one-out bounds
of variables that are needed for achieving this minimal as criteria for variable ranking in support vector regression.
ARTICLE IN PRESS
1500 A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501

Table 5
Variable selection performance considering that an oracle has selected the optimal number of variables to use

Algo Datasets

Pyrim Triazines Selwood Phenetyl

Mean Var. p-values Mean Var. p-values Mean Var. p-values Mean Var. p-values

SVM 8.11 27 – 16.14 60 – 100.20 53 – 3.52 628 –


CC 8.06 25 0.150 15.84 29 0.506 80.54 11 0.126 1.55 30 0.006
Gw B 7.91 21 0.492 15.54 11 0.627 74.91 5 0.170 1.88 60 0.022
Ga B 7.73 23 0.981 15.13 11 0.103 77.65 9 0.065 2.44 95 0.263
GR B 7.59 23 0.815 15.18 23 0.109 82.13 9 0.085 2.18 35 0.131
GS B 8.11 27 0.381 16.06 15 0.820 71.45 11 0.005 2.09 205 0.017
Gw GD 7.98 21 0.499 16.13 51 0.534 93.80 9 0.724 1.38 35 0.006
Ga GD 7.15 15 0.019 16.21 19 0.839 87.16 9 0.256 2.06 60 0.046
GR GD 7.52 23 0.264 16.08 27 0.645 91.27 2 0.860 2.03 115 0.026
GS GD 8.11 27 0.208 15.63 37 0.171 89.08 17 0.136 1.71 30 0.031

In bold, best performances with respect to the mean normalized absolute error are highlighted.

Table 6
Variable selection performance considering that an oracle has selected the optimal number of variables to use

Algo Data sets

Olitos Yeast Colon Lymphoma

Mean Var. p-values Mean Var. p-values Mean Var. p-values Mean Var. p-values

SVM 30.93 25 – 16.50 79 – 53.75 2000 – 58.80 4026 –


CC 30.38 13 0.424 14.73 23 0.045 51.01 300 0.517 57.16 400 0.497
Gw B 30.81 17 0.840 12.91 31 0.003 52.40 500 0.206 53.04 100 0.029
Ga B 30.79 21 0.850 13.56 9 0.000 51.42 200 0.273 57.11 600 0.149
GR B 30.93 25 – 15.64 35 0.139 50.29 400 0.007 53.91 200 0.022
GS B 30.93 25 – 11.60 15 0.000 53.31 75 0.908 55.64 1000 0.045
Gw GD 30.93 25 – 11.82 17 0.000 51.27 600 0.060 49.01 100 0.029
Ga GD 30.72 11 0.724 12.79 15 0.000 49.93 100 0.129 55.65 25 0.682
GR GD 30.27 19 0.210 11.95 23 0.000 49.26 50 0.179 52.12 50 0.329
GS GD 30.93 25 – 11.55 27 0.000 52.05 100 0.787 54.28 500 0.303

In bold, best performances with respect to the mean normalized absolute error are highlighted.

A radius-margin and a span estimate bounds have been criterion since it requires two quadratic programming
used. And in addition to these two criteria, two other ones problems to be solved.
which are related to these bounds have also been derived.  In order to process the span estimate or the radius-
Besides, variable ranking algorithms also need a search- margin, one needs to invert a matrix which size is of
space algorithms which are used for selecting variable the order of the number of support vectors involved in
according to a given criterion. In this paper, we have the variable ranking procedure or to solve a QP
compared the classical backward selection algorithm and a problem. If one wants to trade performance against
method based on optimizing through gradient descent a efficiency then it is preferable to use G a as a criterion.
scaling vector. Numerical experiments have then been This criterion is cheaper to compute than the gradient of
performed on toy data and real QSAR problems in order the span estimate and still gets good results on the real
to analyze performance and behavior of each couple of data.
criterion and search-space algorithm. The main conclu-  Although gradient-based algorithm gives better perfor-
sions that can be derived from this work are: mance than backward selection algorithm, this latter can
be useful when the number of variables is very large. In
 According to our experimental results, if we consider the fact, it is more flexible since at a given iteration, it is
ranking capability of each algorithm, the best variable possible to rank a bunch of variables while for the
ranking algorithm is the radius-margin criterion. It gradient-descent algorithm, gradient according to all
yields to significant performance improvement using variables must be processed.
fewer variables in most of the real-world problems we  Since our variable ranking algorithms are based on
used. However, this criterion is a time-demanding support vector regression, the problem of model
ARTICLE IN PRESS
A. Rakotomamonjy / Neurocomputing 70 (2007) 1489–1501 1501

selection has then to be addressed. Ideally, the model [8] R. Collobert, S. Bengio, Svmtorch: support vector machines for
selection should also consider the variable ranking large-scale regression problems, J. Mach. Learning Res. 1 (2001)
143–160.
procedure. For instance, for the backward feature
[9] C. Couvreur, Y. Bresler, On the optimality of the backward greedy
elimination algorithm, one should consider the ranking algorithm for the subset selection problem, SIAM J. Matrix Anal.
and the model selection procedure all together. That Appl. 21 (3) (2000) 797–808.
means that a variable ranking procedure should have [10] Y. Grandvalet, S. Canu, Adaptive scaling for feature selection in
been processed for each hyperparameter setting. This svms, Advances in Neural Information Processing Systems, vol. 15,
can be rapidly intractable if the space of hyperpara- MIT Press, Cambridge, MA, 2003.
[11] I. Guyon, A. Elisseeff, An introduction to variable and feature
meters to be sampled is large. For instance, for say 10 selection, J. Mach. Learning Res. 3 (2003) 1157–1182.
values of hyperparameters C, s (if we use a Gaussian [12] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for
kernel) and , this would have yielded to 1000 ranking cancer classification using support vector machines, Mach. Learning
procedures. And for a backward algorithm with 50 (2000).
variables, each procedure would have involved 50 þ [13] L. Hermes, J.M. Buhmann, Feature selection for svm, in: Proceedings
of 15th International Conference on Pattern Recognition, vol. 2,
49 þ    þ 1 QP problems to solve. Hence, in all our 2000, pp. 716–719.
experiments, we have selected these hyperparameters [14] R. Kohavi, G. John, Wrappers for feature subset selection, Artificial
with regard to the performance of a SVR on all Intelligence 97 (1–2) (1997) 273–324.
variables. Although, this is clearly not optimal, variable [15] T. Lal, O. Chapelle, J. Weston, A. Elisseeff, in: I. Guyon, S. Gunn,
M. Nikravesh, L. Zadeh (Eds.), Embedded Methods, Springer,
ranking with these hyperparameters yields improved
Berlin, 2005, pp. 1–20.
performances using fewer variables. [16] S.-P. Liao, H.-T. Lin, C.-J. Lin, A note on the decomposition
methods for support vector regression, Neural Comput. 14 (2002)
The practical experiments that we have carried out 1267–1281.
indicate that most of the algorithms (although we have [17] S. Mukkamala, A. Sung, Feature ranking and selection for intrusion
suggested our preferences to the practitioner) we proposed detection systems, in: Proceedings of International Conference on
Information and Knowledge Engineering, 2002.
are able to select the relevant variables and to improve the [18] A. Rakotomamonjy, Variable selection using SVM-based criteria,
test errors. However, this work still lacks some theoretical J. Mach. Learning Res. 3 (2003) 1357–1370.
analysis and in our future works we plan to address this [19] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern
point through the theoretical setting of Lal et al. [15]. Analysis, Cambridge University Press, Cambridge, 2004.
[20] R. Todeschini, Milano chemometrics and qsar research group data
sets, hhttp://www.disat.unimib.it/chm/download/data sets.htmi,
References 2001.
[21] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.
[1] E. Amaldi, V. Kann, On the approximability of minimizing non zero [22] V. Vapnik, O. Chapelle, Bounds on error expectation for support
variables or unsatisfied relations in linear systems, Theoret. Comput. vector machines, Neural Comput. 12 (9) (2000) 2013–2036.
Sci. 209 (1998) 237–260. [23] J. Weston, A. Elisseeff, B. Scholkopf, M. Tipping, Use of the ‘0 -norm
[2] Y. Bengio, Gradient-based optimization of hyperparameters, Neural with linear models and kernel methods, J. Mach. Learning Res. 3
Comput. 12 (2000) 1889–1900. (2003) 1439–1461.
[3] J. Bi, K. Bennett, M. Embrechts, C. Breneman, M. Song, [24] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, V.
Dimensionality reduction via sparse support vector machines, Vapnik, Feature selection for svms, in: Advances in Neural
J. Mach. Learning Res. 3 (2003) 1245–1264. Information Processing Systems, vol. 13, 2001.
[4] C. Blake, C. Merz, UCI repository of machine learning databases,
hhttp://www.ics.uci.edu/ mlearn/MLRepository.htmli, University of
California, Irvine, Department of Information and Computer Alain Rakotomamonjy graduated in M.S. degree
Sciences, 1998. in Electrical Engineering and Computer Science
from the Ecole Supérieure d’Electronique de
[5] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge
University Press, Cambridge, 2004. l’Ouest in 1993 and then received his Ph.D. from
[6] M.-W. Chang, C.-J. Lin, Leave-one-out bounds for support the University of Orleans in 1997. Since 1999, he
vector regression model selection, Neural Comput. 17 (4) (2005) is assistant professor at the INSA de Rouen
France. Alain’s research interests include ma-
1–26.
[7] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukerjhee, Choosing chine learning, kernel methods, variable selection
multiple parameters for SVM, Machine Learning 46 (1–3) (2002) and signal discrimination.
131–159.

You might also like