0% found this document useful (0 votes)
85 views6 pages

Lasso SVM

This document summarizes a research paper that: 1) Refines previous work reducing constrained Lasso and Elastic Net problems to support vector machine (SVM) problems. 2) Experiments show the LIBSVM library can solve these problems faster than existing algorithms for small problems, and competitively for large problems when adapted. 3) The reductions allow advances in SVM solving to benefit Lasso and Elastic Net problem solving.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views6 pages

Lasso SVM

This document summarizes a research paper that: 1) Refines previous work reducing constrained Lasso and Elastic Net problems to support vector machine (SVM) problems. 2) Experiments show the LIBSVM library can solve these problems faster than existing algorithms for small problems, and competitively for large problems when adapted. 3) The reductions allow advances in SVM solving to benefit Lasso and Elastic Net problem solving.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence

and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.

Solving Constrained Lasso and Elastic Net Using


ν–SVMs

Carlos M. Alaı́z, Alberto Torres and José R. Dorronsoro

Universidad Autónoma de Madrid - Departamento de Ingenierı́a Informática


Tomás y Valiente 11, 28049 Madrid - Spain

Abstract. Many important linear sparse models have at its core the
Lasso problem, for which the GLMNet algorithm is often considered as the
current state of the art. Recently M. Jaggi has observed that Constrained
Lasso (CL) can be reduced to a SVM-like problem, which opens the way to
use efficient SVM algorithms to solve CL. We will refine Jaggi’s arguments
to reduce CL as well as constrained Elastic Net to a Nearest Point Problem
and show experimentally that the well known LIBSVM library results in a
faster convergence than GLMNet for small problems and also, if properly
adapted, for larger ones.

1 Introduction
Big Data problems are putting a strong emphasis in using simple linear sparse
models to handle large size and dimensional samples. This has made Lasso [1]
and Elastic Net [2] often the methods of choice in large scale Machine Learning.
Both are usually stated in their unconstrained version of solving
1 µ
min Uλ,µ (β) = kXβ − yk22 + kβk22 + λkβk1 , (1)
β 2 2

where for an N -size sample S = {(x1 , y 1 ), . . . , (xN , y N )}, X is an N × d data


matrix containing as its rows the transposes of the d-dimensional features xn ,
y = (y 1 , . . . , y N )t is the N ×1 target vector, β denotes the Lasso (when µ = 0) or
Elastic Net model coefficients, and the subscripts 1, 2 denote the `1 and `2 norms
respectively. We will refer to problem (1) as λ-Unconstrained Lasso (or λ-UL),
writing then simply Uλ , or as (µ, λ)-Unconstrained Elastic Net(or (µ, λ)-UEN).
It has an equivalent constrained formulation
1 µ
min Cρ,µ (β) = kXβ − yk22 + kβk22 s.t. kβk1 ≤ ρ . (2)
β 2 2

Again, we will refer to problem (2) as ρ-CL, and write Cρ , or as (µ, ρ)-CEN. Two
algorithmic approaches stand out when solving (1), the GLMNet algorithm [3]
that uses cyclic coordinate descent and the FISTA algorithm [4], based on prox-
imal optimization. Recently M. Jaggi [5] has shown the equivalence between
∗ With partial support from Spain’s grants TIN2013-42351-P and S2013/ICE-2845 CASI-

CAM-CM and also of the Cátedra UAM–ADIC in Data Science and Machine Learning. The
authors also gratefully acknowledge the use of the facilities of Centro de Computación Cientı́fica
(CCC) at UAM. The second author is also supported by the FPU–MEC grant AP-2012-5163.

267
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.

constrained Lasso and some SVM variants. This is relatively simple when going
from Lasso to SVM (the direction we are interested in) but more involved the
other way around. In [6] Jaggi’s approach is refined to obtain a one way re-
duction from constrained Elastic Net to a squared hinge loss SVM problem. As
mentioned in [5], this opens the way for advances in one problem to be translated
in advances into another and, while the application of SVM solvers to Lasso is
not addressed in [5], prior work in parallelizing SVMs is leveraged in [6] to obtain
a highly optimized and parallel solver for the Elastic Net and Lasso.
In this paper we retake M. Jaggi’s approach, first to simplify and refine the
reductions in [5] and [6] and, then, to explore the efficiency of the SVM solver
LIBSVM, when applied to Lasso. More precisely, our contributions are
• A simple reduction in Sect. 2 of Elastic Net to Lasso and a proof of the
equivalence between λ-UL and ρ-CL. This is a known result but proofs are
hard to find, so our simple argument may have a value in itself.
• A refinement in Sect. 3 on M. Jaggi’s arguments to reduce ρ-CL to Nearest
Point and ν-SVC problems solvable using LIBSVM.
• An experimental comparison in Sect. 4 of NPP-LIBSVM with GLMNet
and FISTA, showing NPP-LIBSVM to be competitive with GLMNet.
The paper will close with a brief discussion and pointers to further work.

2 Unconstrained and Constrained Lasso and Elastic Net


We show that λ-CL and ρ-UL and also (µ, λ)-UEN and (µ, ρ)-CEN have the
same β solution for appropriate choices of λ and ρ. We will first reduce our
discussion to the Lasso case describing how Elastic Net (EN) can be recast as
a pure Lasso problem over an enlarged data matrix and target vector. For
this let’s consider in either problem (1) or (2) the (N + d) × d matrix X and

(N + d) × 1 vector Y defined as X t = (X t , µId ), Y t = (y t , 0td ) with Id the d × d
identity matrix and 0d the d dimensional 0 vector. It is now easy to check that
kXβ − yk22 + µkβk22 = kX β − Yk22 and, thus, we can rewrite (1) or (2) as Lasso
problems over an extended sample
√ √
S = {(x1 , y 1 ), . . . , (xN , y N ), ( µe1 , 0), . . . , ( µed , 0)}

of size N + d, where ei denotes the canonical vectors for Rd , for which X and
Y are the new data matrix and target vector. Because of this, in what follows
we limit our discussion to UL and CL, pointing where needed the changes to be
made for UEN and CEN, and reverting to the (X, y) notations instead of (X , Y).
Assuming λ fixed it is obvious that a minimizer βλ of λ-UL is also a minimizer
for ρλ -CL with ρλ = kβλ k1 . Next, if βLR is the solution of Linear Regression and
ρLR = kβLR k1 , the solution of ρ-CL is clearly βLR for ρ ≥ ρLR , so we assume
ρ < ρLR . Let’s denote by βρ a minimum of the convex problem ρ-CL; we shall

268
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.

prove next that βρ solves λρ -UL with λρ = βρt X t (Xβρ − y) /ρ. To do so, let
e2 (β) = 21 kXβ − yk2 and g(β) = kβk1 − ρ, and define
f (β) = max{e2 (β) − eρ2 , g(β)} ,
where eρ2 = e2 (βρ ) is the optimal square error in ρ-CL. Then, since f is convex,
the subgradient ∂f (β 0 ) at any β 0 is
∂f (β 0 ) = {γ∇e2 (β 0 ) + (1 − γ)h : 0 ≤ γ ≤ 1, h ∈ ∂g(β 0 ) = ∂k · k1 (βρ )} .
Thus, as βρ minimizes f (any β 0 with kβ 0 k1 < ρ will have and error greater than
eρ2 ), there is an hρ ∈ ∂k · k1 (βρ ) and γ, 0 ≤ γ ≤ 1, such that 0 = γ∇e2 (βρ ) + (1 −
γ)hρ . But γ > 0, otherwise we would have hρ = 0, i.e., βρ = 0, contradicting
that kβρ k1 = ρ. Therefore, taking λρ = (1 − γ)/γ, we have
0 = ∇e2 (βρ ) + λρ hρ ∈ ∂ [e2 (·) + λρ k · k1 ] (βρ ) = ∂Uλρ (βρ ) ,
i.e., βρ minimizes λρ -UL. We finally derive a value for λρ by observing that
βρt hρ = kβρ k1 = ρ and 0 = ∇e2 (βρ ) + λρ hρ = −X t rβρ + hρ λρ , where rβρ is just
the residual (Xβρ − y). As a result,
βρt X t rβρ βρt X t rβρ
λρ kβρ k1 = βρt hλρ = βρt X t rβρ ⇒ λρ = = .
kβρ k1 ρ
For (µ, ρ)-CEN, the corresponding λρ would be
(Y − X βρ ) · X βρ (y − Xβρ ) · Xβρ − µkβρ k22
λρ = = .
ρ ρ

3 From Constrained Lasso to NPP


The basic idea in [5] is to re-scale β̃ = β/ρ and ỹ = y/ρ to normalize ρ-CL into
1-CL and then to replace the `1 unit ball B1 with the simplex
2d
X
∆2d = {α = (α1 , . . . , α2d ) ∈ R2d : 0 ≤ αi ≤ 1, αi = 1} ,
i=1

by enlarging the initial data matrix X to an N ×2d dimensional X b with columns


c c d+c c
X = X and X
b b = −X , c = 1, . . . , d. As a consequence, the square error in
the re-scaled Lasso kX β̃ − ỹk2 becomes kXα b − ỹk2 . Since Xα
b now lies in the
convex hull C spanned by {X , 1 ≤ c ≤ 2d}, finding an optimal αo is equivalent to
b c

finding the point in C closest to ỹ, i.e., solving the nearest point problem (NPP)
between the N -dimensional convex hull C and the singleton {ỹ}. In turn, NPP
can be scaled [7] into an equivalent linear ν-SVC problem [8] with ν = 2/(2d+1),
over the sample S = {S+ , S− }, where S+ = {X 1 , . . . , X d , −X 1 , . . . , −X d } and
S− = {ỹ}. This problem can be solved using the LIBSVM library [9]. The
optimal ρ-CL solution βρ is recovered as follows: first we obtain the optimal
NPP coefficients αo by scaling the ν-SVM solution γ o as αo = (2d + 1)γ o , then
we compute (β̃ρ )i = αio − αi+d
o
and finally we re-scale again βρ = ρβ̃ρ .
Finally, changes for (µ, ρ)-CEN are straightforward, since we only have to
√ √
add the d extra dimensions µec and − µec to the column vectors X c .

269
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.

4 Numerical Experiments
In this section we will compare the performance of the LIBSVM approach for
solving ρ-CL with two well known, state-of-the-art algorithms for λ-UL, FISTA
and GLMNet. We first discuss their expected complexity.
FISTA (Fast ISTA; [4]) combines the basic iterations in Iterative Shrinkage–
Thresholding Algorithm (ISTA) with a Nesterov acceleration step. Assuming
that the covariance X t X is precomputed at a fixed initial cost O(N d2 ), the
cost per iteration of FISTA is O(d2 ), i.e., that of computing X t Xβ. If only m
components of β are non zero, the iteration cost is O(dm). On the other hand,
GLMNet performs cyclic coordinate subgradient descent on the λ-UL cost func-
tion. GLMNet carefully manages the covariance computations X j · X k ensuring
that there is a cost O(N d) only the first time they are performed at a given co-
ordinate k and that afterwards their cost is just O(d). Note that an iteration in
GLMNet just changes one coordinate while in FISTA it changes d components.
Finally, the ν-SVC solver in LIBSVM performs SMO-like iterations that change
two coordinates, the pair that most violates the ν-SVC KKT conditions. The
cost per iteration is also O(d) plus the time needed to compute the required
dot products, i.e., linear kernel operations. LIBSVM builds the kernel matrix
making no assumptions on the particular structure of the data. This may lead to
eventually compute 4d2 dot products (without considering the last column) even
though only d2 are actually needed, since the 2d × 2d dimensional linear kernel
matrix X̂ X̂ t is made of four d × d blocks with the covariance matrix C = XX t
in the two diagonal blocks and −C in the other two. This can be certainly
controlled but in our experiments we have directly used LIBSVM, without any
attempt to optimize running times exploiting the structure of the kernel matrix.
We compare next the number of iterations and running times that FISTA,
GLMNet and LIBSVM need to solve equivalent λ-UL and ρ-CL problems. We
will do so over four datasets from the UCI collection, namely the relatively
small prostate and housing, and the larger year and ctscan. As it is well
known, training complexity greatly depends on λ. We will consider three possible
λ values for UL: an optimal λ∗ obtained according to the regularization path
procedure of [3], a smaller λ∗ /2 value (that should result in longer training) and a
stricter penalty 2λ∗ value. The optimal λ∗ values for the problems considered are
2.04 10−3 , 8.19 10−3 , 6.87 10−3 and 3.75 10−3 respectively. The corresponding ρ
parameters are computed as ρ = kβλ k1 , with βλ the optimal solution for λ-UL.
To make a balanced comparison, for each λ and dataset we first make a

long run of each method M so that it converges to a βM that we take as the
k ∗
optimum. We then compare for each M the evolution of fM (βM ) − fM (βM ),
k
with βM the coefficients at the k-th iteration of method M , until it is smaller
than a threshold  that we take as 10−12 for prostate and housing and 10−6 for
year and ctscan. Table 1 shows in columns 2 to 4 the number of iterations that
each method requires to arrive at the  threshold. As it can be seen, LIBSVM
is the fastest on this account, with GLMNet in second place and FISTA a more
distant third (for a proper comparison a FISTA iterate is made to correspond to

270
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.

Iterations Time (ms)


Dataset LIBSVM GLMNet FISTA LIBSVM GLMNet
prostate (2λ∗ ) 86 181 1,232 0.051 0.053
prostate ( λ∗ ) 75 174 1,336 0.049 0.057
prostate (λ∗ /2) 52 173 904 0.036 0.075
housing (2λ∗ ) 143 1,010 5,538 0.266 0.832
housing ( λ∗ ) 112 933 5,538 0.205 0.749
housing (λ∗ /2) 88 666 3,913 0.190 0.520
year (2λ∗ ) 760 5,584 8,550 1,528.9 558.2
year ( λ∗ ) 576 6,123 8,460 1,371.8 590.7
year (λ∗ /2) 392 6,393 8,640 1,140.6 626.0
ctscan (2λ∗ ) 972 92,390 190,080 23,935.4 8,823.9
ctscan ( λ∗ ) 670 78,456 159,744 15,868.2 6,935.8
ctscan (λ∗ /2) 418 53,630 132,480 14,999.5 4,173.7

Table 1: Iterations and running times.


d iterates of LIBSVM and GLMNet). Columns 5 and 6 give the times required by
LIBSVM and GLMNet; we omit FISTA’s times as they are not competitive (at
least under the implementation we used). We use the LIBSVM and GLMNet
implementations in the Scikit Python library, which both have a compiled C
core, so we may expect time comparisons to be broadly homogeneous.
Even without considering LIBSVM’s overhead when computing a 2d × 2d
kernel matrix, it is clearly faster on prostate and housing but not so on the
other two datasets. However, it turns out that LIBSVM indeed computes about
4 times more dot products than needed, which suggests that the running time of
a LIBSVM version properly adapted for ρ-CL should have running times about
a fourth of those reported here, outperforming then GLMNet. This behavior is
further illustrated in Fig. 1 that depicts, for housing and ctscan with λ∗ , the
number of iterations and running times required to reach the  threshold.

5 Discussion and Conclusions


GLMNet can be considered the current state of the art to solve the Lasso prob-
lem. M. Jaggi’s recent observation that ρ-CL can be reduced to ν-SVC opens
the way to the application of SVM methods. In this work we have shown us-
ing four examples how the ν-SVC option of LIBSVM is faster than GLMNet
for small problems and pointed out how adapted versions could also beat it on
larger problems. Devising such an adaptation is thus a first further line of work.
Moreover, Lasso is at the core of many other problems in convex regulariza-
tion, such as Fused Lasso, wavelet smoothing or trend filtering, currently solved
using specialized algorithms. A ν-SVC approach could provide for them the
same faster convergence that we have illustrated here for standard Lasso. Fur-
thermore, the ν-SVC training could be speed up, as suggested in [5], using the
screening rules available for Lasso [10] in order to remove columns of the data
matrix. We are working on these and other related questions.

271
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.

Iterations - housing Time - housing

100 FISTA GLMNet


GLMNet SVM
SVM
Objective

10−6

10−12
0 2,000 4,000 6,000 0 0.2 0.4 0.6 0.8
Iteration Time (ms)
Iterations - ctscan Time - ctscan

FISTA GLMNet
100
GLMNet SVM
SVM
Objective

10−3

10−6
0 50,000 1 · 105 1.5 · 105 0 5,000 10,000 15,000
Iteration Time (ms)

Fig. 1: Results for housing and ctscan.

References
[1] R. Tibshirani. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal
Statistical Society and Series B, 58:267–288, 1994.
[2] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
[3] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths for Generalized Linear
Models via Coordinate Descent. Journal of Statistical Software, 33(1):1–22, 2010.
[4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear
inverse problems. SIAM J. Img. Sci., 2(1):183–202, 2009.
[5] M. Jaggi. An Equivalence between the Lasso and Support Vector Machines. In Regular-
ization, Optimization, Kernels, and Support Vector Machines. CRC Press, 2014.
[6] Q. Zhou, W. Chen, S. Song, J. R. Gardner, K. Q. Weinberger, and Y. Chen. A Reduction
of the Elastic Net to Support Vector Machines with an Application to GPU Computing.
Technical Report arXiv:1409.1976 [stat.ML], 2014.
[7] J. López, Á. Barbero, and J.R. Dorronsoro. Clipping Algorithms for Solving the Nearest
Point Problem over Reduced Convex Hulls. Pattern Recognition, 44(3):607–614, 2011.
[8] C.-C. Chang and C.-J. Lin. Training ν-Support Vector Classifiers: Theory and Algo-
rithms. Neural Computation, 13(9):2119–2147, 2001.
[9] C.-C. Chang and C.-J. Lin. LIBSVM: a Library for Support Vector Machines. http:
//www.csie.ntu.edu.tw/~cjlin/libsvm.
[10] J. Wang, J. Zhou, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projec-
tion. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger,
editors, Advances in Neural Information Processing Systems 26, pages 1070–1078. Curran
Associates, Inc., 2013.

272

You might also like