ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.
      Solving Constrained Lasso and Elastic Net Using
                         ν–SVMs
                                                                                       ∗
                      Carlos M. Alaı́z, Alberto Torres and José R. Dorronsoro
           Universidad Autónoma de Madrid - Departamento de Ingenierı́a Informática
                          Tomás y Valiente 11, 28049 Madrid - Spain
             Abstract.      Many important linear sparse models have at its core the
             Lasso problem, for which the GLMNet algorithm is often considered as the
             current state of the art. Recently M. Jaggi has observed that Constrained
             Lasso (CL) can be reduced to a SVM-like problem, which opens the way to
             use efficient SVM algorithms to solve CL. We will refine Jaggi’s arguments
             to reduce CL as well as constrained Elastic Net to a Nearest Point Problem
             and show experimentally that the well known LIBSVM library results in a
             faster convergence than GLMNet for small problems and also, if properly
             adapted, for larger ones.
      1    Introduction
      Big Data problems are putting a strong emphasis in using simple linear sparse
      models to handle large size and dimensional samples. This has made Lasso [1]
      and Elastic Net [2] often the methods of choice in large scale Machine Learning.
      Both are usually stated in their unconstrained version of solving
                                             1             µ
                           min Uλ,µ (β) =      kXβ − yk22 + kβk22 + λkβk1 ,                          (1)
                            β                2             2
      where for an N -size sample S = {(x1 , y 1 ), . . . , (xN , y N )}, X is an N × d data
      matrix containing as its rows the transposes of the d-dimensional features xn ,
      y = (y 1 , . . . , y N )t is the N ×1 target vector, β denotes the Lasso (when µ = 0) or
      Elastic Net model coefficients, and the subscripts 1, 2 denote the `1 and `2 norms
      respectively. We will refer to problem (1) as λ-Unconstrained Lasso (or λ-UL),
      writing then simply Uλ , or as (µ, λ)-Unconstrained Elastic Net(or (µ, λ)-UEN).
      It has an equivalent constrained formulation
                                         1             µ
                     min Cρ,µ (β) =        kXβ − yk22 + kβk22          s.t. kβk1 ≤ ρ .               (2)
                       β                 2             2
      Again, we will refer to problem (2) as ρ-CL, and write Cρ , or as (µ, ρ)-CEN. Two
      algorithmic approaches stand out when solving (1), the GLMNet algorithm [3]
      that uses cyclic coordinate descent and the FISTA algorithm [4], based on prox-
      imal optimization. Recently M. Jaggi [5] has shown the equivalence between
         ∗ With partial support from Spain’s grants TIN2013-42351-P and S2013/ICE-2845 CASI-
      CAM-CM and also of the Cátedra UAM–ADIC in Data Science and Machine Learning. The
      authors also gratefully acknowledge the use of the facilities of Centro de Computación Cientı́fica
      (CCC) at UAM. The second author is also supported by the FPU–MEC grant AP-2012-5163.
                                                    267
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.
      constrained Lasso and some SVM variants. This is relatively simple when going
      from Lasso to SVM (the direction we are interested in) but more involved the
      other way around. In [6] Jaggi’s approach is refined to obtain a one way re-
      duction from constrained Elastic Net to a squared hinge loss SVM problem. As
      mentioned in [5], this opens the way for advances in one problem to be translated
      in advances into another and, while the application of SVM solvers to Lasso is
      not addressed in [5], prior work in parallelizing SVMs is leveraged in [6] to obtain
      a highly optimized and parallel solver for the Elastic Net and Lasso.
          In this paper we retake M. Jaggi’s approach, first to simplify and refine the
      reductions in [5] and [6] and, then, to explore the efficiency of the SVM solver
      LIBSVM, when applied to Lasso. More precisely, our contributions are
          • A simple reduction in Sect. 2 of Elastic Net to Lasso and a proof of the
            equivalence between λ-UL and ρ-CL. This is a known result but proofs are
            hard to find, so our simple argument may have a value in itself.
          • A refinement in Sect. 3 on M. Jaggi’s arguments to reduce ρ-CL to Nearest
            Point and ν-SVC problems solvable using LIBSVM.
          • An experimental comparison in Sect. 4 of NPP-LIBSVM with GLMNet
            and FISTA, showing NPP-LIBSVM to be competitive with GLMNet.
      The paper will close with a brief discussion and pointers to further work.
      2    Unconstrained and Constrained Lasso and Elastic Net
      We show that λ-CL and ρ-UL and also (µ, λ)-UEN and (µ, ρ)-CEN have the
      same β solution for appropriate choices of λ and ρ. We will first reduce our
      discussion to the Lasso case describing how Elastic Net (EN) can be recast as
      a pure Lasso problem over an enlarged data matrix and target vector. For
      this let’s consider in either problem (1) or (2) the (N + d) × d matrix X and
                                                  √
      (N + d) × 1 vector Y defined as X t = (X t , µId ), Y t = (y t , 0td ) with Id the d × d
      identity matrix and 0d the d dimensional 0 vector. It is now easy to check that
      kXβ − yk22 + µkβk22 = kX β − Yk22 and, thus, we can rewrite (1) or (2) as Lasso
      problems over an extended sample
                                                             √                   √
                     S = {(x1 , y 1 ), . . . , (xN , y N ), ( µe1 , 0), . . . , ( µed , 0)}
      of size N + d, where ei denotes the canonical vectors for Rd , for which X and
      Y are the new data matrix and target vector. Because of this, in what follows
      we limit our discussion to UL and CL, pointing where needed the changes to be
      made for UEN and CEN, and reverting to the (X, y) notations instead of (X , Y).
          Assuming λ fixed it is obvious that a minimizer βλ of λ-UL is also a minimizer
      for ρλ -CL with ρλ = kβλ k1 . Next, if βLR is the solution of Linear Regression and
      ρLR = kβLR k1 , the solution of ρ-CL is clearly βLR for ρ ≥ ρLR , so we assume
      ρ < ρLR . Let’s denote by βρ a minimum of the convex problem ρ-CL; we shall
                                                    268
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.
      prove next that βρ solves λρ -UL with λρ = βρt X t (Xβρ − y) /ρ. To do so, let
      e2 (β) = 21 kXβ − yk2 and g(β) = kβk1 − ρ, and define
                                  f (β) = max{e2 (β) − eρ2 , g(β)} ,
      where eρ2 = e2 (βρ ) is the optimal square error in ρ-CL. Then, since f is convex,
      the subgradient ∂f (β 0 ) at any β 0 is
          ∂f (β 0 ) = {γ∇e2 (β 0 ) + (1 − γ)h : 0 ≤ γ ≤ 1, h ∈ ∂g(β 0 ) = ∂k · k1 (βρ )} .
      Thus, as βρ minimizes f (any β 0 with kβ 0 k1 < ρ will have and error greater than
      eρ2 ), there is an hρ ∈ ∂k · k1 (βρ ) and γ, 0 ≤ γ ≤ 1, such that 0 = γ∇e2 (βρ ) + (1 −
      γ)hρ . But γ > 0, otherwise we would have hρ = 0, i.e., βρ = 0, contradicting
      that kβρ k1 = ρ. Therefore, taking λρ = (1 − γ)/γ, we have
                  0 = ∇e2 (βρ ) + λρ hρ ∈ ∂ [e2 (·) + λρ k · k1 ] (βρ ) = ∂Uλρ (βρ ) ,
      i.e., βρ minimizes λρ -UL. We finally derive a value for λρ by observing that
      βρt hρ = kβρ k1 = ρ and 0 = ∇e2 (βρ ) + λρ hρ = −X t rβρ + hρ λρ , where rβρ is just
      the residual (Xβρ − y). As a result,
                                                              βρt X t rβρ   βρt X t rβρ
                λρ kβρ k1 = βρt hλρ = βρt X t rβρ ⇒ λρ =                  =             .
                                                                kβρ k1           ρ
      For (µ, ρ)-CEN, the corresponding λρ would be
                            (Y − X βρ ) · X βρ   (y − Xβρ ) · Xβρ − µkβρ k22
                     λρ =                      =                             .
                                   ρ                          ρ
      3    From Constrained Lasso to NPP
      The basic idea in [5] is to re-scale β̃ = β/ρ and ỹ = y/ρ to normalize ρ-CL into
      1-CL and then to replace the `1 unit ball B1 with the simplex
                                                                        2d
                                                                        X
                 ∆2d = {α = (α1 , . . . , α2d ) ∈ R2d : 0 ≤ αi ≤ 1,           αi = 1} ,
                                                                        i=1
      by enlarging the initial data matrix X to an N ×2d dimensional X      b with columns
         c     c         d+c      c
      X = X and X
       b               b     = −X , c = 1, . . . , d. As a consequence, the square error in
      the re-scaled Lasso kX β̃ − ỹk2 becomes kXα    b − ỹk2 . Since Xα
                                                                        b now lies in the
      convex hull C spanned by {X , 1 ≤ c ≤ 2d}, finding an optimal αo is equivalent to
                                  b  c
      finding the point in C closest to ỹ, i.e., solving the nearest point problem (NPP)
      between the N -dimensional convex hull C and the singleton {ỹ}. In turn, NPP
      can be scaled [7] into an equivalent linear ν-SVC problem [8] with ν = 2/(2d+1),
      over the sample S = {S+ , S− }, where S+ = {X 1 , . . . , X d , −X 1 , . . . , −X d } and
      S− = {ỹ}. This problem can be solved using the LIBSVM library [9]. The
      optimal ρ-CL solution βρ is recovered as follows: first we obtain the optimal
      NPP coefficients αo by scaling the ν-SVM solution γ o as αo = (2d + 1)γ o , then
      we compute (β̃ρ )i = αio − αi+d
                                    o
                                        and finally we re-scale again βρ = ρβ̃ρ .
           Finally, changes for (µ, ρ)-CEN are straightforward, since we only have to
                                     √              √
      add the d extra dimensions µec and − µec to the column vectors X c .
                                                269
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.
      4    Numerical Experiments
      In this section we will compare the performance of the LIBSVM approach for
      solving ρ-CL with two well known, state-of-the-art algorithms for λ-UL, FISTA
      and GLMNet. We first discuss their expected complexity.
          FISTA (Fast ISTA; [4]) combines the basic iterations in Iterative Shrinkage–
      Thresholding Algorithm (ISTA) with a Nesterov acceleration step. Assuming
      that the covariance X t X is precomputed at a fixed initial cost O(N d2 ), the
      cost per iteration of FISTA is O(d2 ), i.e., that of computing X t Xβ. If only m
      components of β are non zero, the iteration cost is O(dm). On the other hand,
      GLMNet performs cyclic coordinate subgradient descent on the λ-UL cost func-
      tion. GLMNet carefully manages the covariance computations X j · X k ensuring
      that there is a cost O(N d) only the first time they are performed at a given co-
      ordinate k and that afterwards their cost is just O(d). Note that an iteration in
      GLMNet just changes one coordinate while in FISTA it changes d components.
      Finally, the ν-SVC solver in LIBSVM performs SMO-like iterations that change
      two coordinates, the pair that most violates the ν-SVC KKT conditions. The
      cost per iteration is also O(d) plus the time needed to compute the required
      dot products, i.e., linear kernel operations. LIBSVM builds the kernel matrix
      making no assumptions on the particular structure of the data. This may lead to
      eventually compute 4d2 dot products (without considering the last column) even
      though only d2 are actually needed, since the 2d × 2d dimensional linear kernel
      matrix X̂ X̂ t is made of four d × d blocks with the covariance matrix C = XX t
      in the two diagonal blocks and −C in the other two. This can be certainly
      controlled but in our experiments we have directly used LIBSVM, without any
      attempt to optimize running times exploiting the structure of the kernel matrix.
          We compare next the number of iterations and running times that FISTA,
      GLMNet and LIBSVM need to solve equivalent λ-UL and ρ-CL problems. We
      will do so over four datasets from the UCI collection, namely the relatively
      small prostate and housing, and the larger year and ctscan. As it is well
      known, training complexity greatly depends on λ. We will consider three possible
      λ values for UL: an optimal λ∗ obtained according to the regularization path
      procedure of [3], a smaller λ∗ /2 value (that should result in longer training) and a
      stricter penalty 2λ∗ value. The optimal λ∗ values for the problems considered are
      2.04 10−3 , 8.19 10−3 , 6.87 10−3 and 3.75 10−3 respectively. The corresponding ρ
      parameters are computed as ρ = kβλ k1 , with βλ the optimal solution for λ-UL.
          To make a balanced comparison, for each λ and dataset we first make a
                                                                   ∗
      long run of each method M so that it converges to a βM           that we take as the
                                                                            k           ∗
      optimum. We then compare for each M the evolution of fM (βM             ) − fM (βM  ),
              k
      with βM    the coefficients at the k-th iteration of method M , until it is smaller
      than a threshold  that we take as 10−12 for prostate and housing and 10−6 for
      year and ctscan. Table 1 shows in columns 2 to 4 the number of iterations that
      each method requires to arrive at the  threshold. As it can be seen, LIBSVM
      is the fastest on this account, with GLMNet in second place and FISTA a more
      distant third (for a proper comparison a FISTA iterate is made to correspond to
                                                270
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.
                                              Iterations                 Time (ms)
              Dataset               LIBSVM     GLMNet      FISTA     LIBSVM     GLMNet
              prostate  (2λ∗ )         86        181        1,232     0.051      0.053
              prostate ( λ∗ )          75        174        1,336     0.049      0.057
              prostate (λ∗ /2)         52        173         904      0.036      0.075
              housing (2λ∗ )          143        1,010      5,538     0.266      0.832
              housing ( λ∗ )          112         933       5,538     0.205      0.749
              housing (λ∗ /2)          88         666       3,913     0.190      0.520
              year (2λ∗ )             760        5,584      8,550    1,528.9     558.2
              year ( λ∗ )             576        6,123      8,460    1,371.8     590.7
              year (λ∗ /2)            392        6,393      8,640    1,140.6     626.0
              ctscan (2λ∗ )           972       92,390     190,080   23,935.4    8,823.9
              ctscan ( λ∗ )           670       78,456     159,744   15,868.2    6,935.8
              ctscan (λ∗ /2)          418       53,630     132,480   14,999.5    4,173.7
                                 Table 1: Iterations and running times.
      d iterates of LIBSVM and GLMNet). Columns 5 and 6 give the times required by
      LIBSVM and GLMNet; we omit FISTA’s times as they are not competitive (at
      least under the implementation we used). We use the LIBSVM and GLMNet
      implementations in the Scikit Python library, which both have a compiled C
      core, so we may expect time comparisons to be broadly homogeneous.
          Even without considering LIBSVM’s overhead when computing a 2d × 2d
      kernel matrix, it is clearly faster on prostate and housing but not so on the
      other two datasets. However, it turns out that LIBSVM indeed computes about
      4 times more dot products than needed, which suggests that the running time of
      a LIBSVM version properly adapted for ρ-CL should have running times about
      a fourth of those reported here, outperforming then GLMNet. This behavior is
      further illustrated in Fig. 1 that depicts, for housing and ctscan with λ∗ , the
      number of iterations and running times required to reach the  threshold.
      5    Discussion and Conclusions
      GLMNet can be considered the current state of the art to solve the Lasso prob-
      lem. M. Jaggi’s recent observation that ρ-CL can be reduced to ν-SVC opens
      the way to the application of SVM methods. In this work we have shown us-
      ing four examples how the ν-SVC option of LIBSVM is faster than GLMNet
      for small problems and pointed out how adapted versions could also beat it on
      larger problems. Devising such an adaptation is thus a first further line of work.
          Moreover, Lasso is at the core of many other problems in convex regulariza-
      tion, such as Fused Lasso, wavelet smoothing or trend filtering, currently solved
      using specialized algorithms. A ν-SVC approach could provide for them the
      same faster convergence that we have illustrated here for standard Lasso. Fur-
      thermore, the ν-SVC training could be speed up, as suggested in [5], using the
      screening rules available for Lasso [10] in order to remove columns of the data
      matrix. We are working on these and other related questions.
                                                  271
ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 22-24 April 2015, i6doc.com publ., ISBN 978-287587014-8.
Available from http://www.i6doc.com/en/.
                                 Iterations - housing                           Time - housing
                       100                             FISTA                                       GLMNet
                                                      GLMNet                                        SVM
                                                        SVM
         Objective
                     10−6
                     10−12
                             0     2,000          4,000        6,000 0    0.2       0.4      0.6      0.8
                                          Iteration                               Time (ms)
                                 Iterations - ctscan                            Time - ctscan
                                                       FISTA                                       GLMNet
                      100
                                                      GLMNet                                        SVM
                                                        SVM
         Objective
                     10−3
                     10−6
                            0    50,000     1 · 105       1.5 · 105   0     5,000         10,000      15,000
                                      Iteration                                   Time (ms)
                                   Fig. 1: Results for housing and ctscan.
      References
       [1] R. Tibshirani. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal
           Statistical Society and Series B, 58:267–288, 1994.
       [2] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal
           of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
       [3] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization Paths for Generalized Linear
           Models via Coordinate Descent. Journal of Statistical Software, 33(1):1–22, 2010.
       [4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear
           inverse problems. SIAM J. Img. Sci., 2(1):183–202, 2009.
       [5] M. Jaggi. An Equivalence between the Lasso and Support Vector Machines. In Regular-
           ization, Optimization, Kernels, and Support Vector Machines. CRC Press, 2014.
       [6] Q. Zhou, W. Chen, S. Song, J. R. Gardner, K. Q. Weinberger, and Y. Chen. A Reduction
           of the Elastic Net to Support Vector Machines with an Application to GPU Computing.
           Technical Report arXiv:1409.1976 [stat.ML], 2014.
       [7] J. López, Á. Barbero, and J.R. Dorronsoro. Clipping Algorithms for Solving the Nearest
           Point Problem over Reduced Convex Hulls. Pattern Recognition, 44(3):607–614, 2011.
       [8] C.-C. Chang and C.-J. Lin. Training ν-Support Vector Classifiers: Theory and Algo-
           rithms. Neural Computation, 13(9):2119–2147, 2001.
       [9] C.-C. Chang and C.-J. Lin. LIBSVM: a Library for Support Vector Machines. http:
           //www.csie.ntu.edu.tw/~cjlin/libsvm.
      [10] J. Wang, J. Zhou, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projec-
           tion. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger,
           editors, Advances in Neural Information Processing Systems 26, pages 1070–1078. Curran
           Associates, Inc., 2013.
                                                            272