Filters,Wrappers, and
Embedded methods
                                                                                       Feature
                                                         All features   Filter                   Predictor
                                                                                       subset
                Lecture 9:                                                       Multiple
             Embedded Methods                            All features            Feature         Predictor
                                                                                 subsets
            Isabelle Guyon guyoni @inf.ethz.ch                                   Wrapper
           André Elisseeff AEL@zurich.ibm.com                                                    Feature
                                                                                  Embedded       subset
                                                         All features
               Chapter 5: Embedded methods                                         method
                                                                                                 Predictor
                        Filters                                         Wrappers
Methods:                                             Methods:
  • Criterion: Measure feature/feature subset          • Criterion: Measure feature subset
    “relevance”                                          “usefulness”
  • Search: Usually order features (individual         • Search: Search the space of all feature
    feature ranking or nested subsets of features)       subsets
                                                       • Assessment: Use cross-validation
  • Assessment: Use statistical tests
                                                     Results:
Results:
                                                       • Can in principle find the most “useful”
  • Are (relatively) robust against overfitting          features, but
  • May fail to select the most “ useful” features     • Are prone to overfitting
                                                                                                             1
    New      Embedded Methods                                                   Three “Ingredients”
Methods:
  • Criterion: Measure feature subset
    “usefulness”
                                                                                                                Single
  • Search: Search guided by the learning                                 New
                                                                                                                            Cr
                                                                                        nt
                                                                                                                feature
                                                                                                   Cross        relevance
                                                                                                                relevance
    process
                                                                                      me
                                                                                                                              ite
                                                                                                   validation
                                                                                                                Relevance
                                                                                                                Relevance
                                                                                                                                 rio
                                                                                                                in
                                                                                                                in context
                                                                                    ss
                                                                                                                   context
  • Assessment: Use cross-validation                                                        Performance
                                                                                            Performance         Feature subset
                                                                                                                                    n
                                                                                  se
                                                                                            bounds
                                                                                            bounds                   relevance
                                                                                As
                                                                                                                  Performance
Results:                                                                             Statistical
                                                                                     Statistical
                                                                                                                  Performance
                                                                                                                        learning
                                                                                                                        learning
                                                                                     tests                                 machine
                                                                                                          Nested subset,
  • Similar to wrappers, but                                                                 Heuristic or forward selection/
                                                                                        stochastic search backward elimination
  • Less computationally expensive                                                Exhaustive search               Single feature ranking
  • Less prone to overfitting                                                                         Search
              Forward Selection                                          Forward Selection with GS
                                                                  Stoppiglia, 2002. Gram-Schmidt orthogonalization.
                            Start
                                                                       • Select a first feature X?(1) with maximum
    n                                                                    cosine with the target cos(xi, y) =x.y/||x|| ||y||
                                                                       • For each remaining feature Xi
    n-1                                                                    – Project Xi and the target Y on the null space of the
                                                                             features already selected
                                                                           – Compute the cosine of Xi with the target in the
    n-2                                                                      projection
    …
                                                                       • Select the feature X?(k) with maximum cosine
    1
                                                                         with the target in the projection.
   New     Guided search: we do not consider alternative paths.       Embedded method for the linear least square predictor
                                                                                                                                           2
       Forward Selection w. Trees                                          Backward Elimination
  • Tree classifiers,
    like CART (Breiman, 1984) or C4.5 (Quinlan, 1993)
                                                                    …
                                             At each step,          n-2
 f2     All the                               choose the
         data                                 feature that          n-1
                                          “reduces entropy”
                                              most. Work            n
          f1
                                            towards “node
                                                purity”.
                  Choose f1                                                                          Start
                              Choose f2
      Backward Elimination:RFE                                                                  OBD               (LeCun et al, 1990)
RFE-SVM, Guyon, Weston, et al, 2002                                                      J[f]
      Start with all the features.
      • Train a learning machine f on the current subset
        of features by minimizing a risk functional J[f].     DJ ≅ ½ ∂2 J/∂wi2 (wi* )2
      • For each (remaining) feature Xi, estimate,
        without retraining f, the change in J[f] resulting                                                      wi*                     wi
                                                                                                0
        from the removal of Xi.
                                                                                                    Dwi = wi*
      • Remove the feature X?(k) that results in
        improving or least degrading J.                        DJ = Σi ∂J/∂wi Dwi + ½ Σi ∂2 J/∂wi2 (Dwi)2 + cross-terms + O(||Dw||3 )
                                                               Simple case: linear classifier + J quadratic form of w ⇒ DJ α wi2
 Embedded method for SVM, kernel methods, neural nets.
                                                               RFE for ridge regression and SVM: remove input with smallest wi2
                                                                                                                                             3
          Nested Subset Methods                                           Complexity Comparison
  • Forward selection                                                  Generalization_error ≤ Validation_error + ε(C / m)
     []]]]]]]]]]]]]]]]                                                   Method            Number of Complexity
                                                                                           subsets   C
  • Backward elimination                                                                   tried
                                                                                            n
                                                                         Exhaustive search 2         n
     []]]]]]]]]]]]]]]]                                                   wrapper
                                                                         Nested subsets    n(n+1)/2  log n
  • Feature ranking (filters)                                            greedy wrapper
                                                                         Feature ranking   n         log n
     []]]]]]]]]]]]]]]]                                                   or embedded
                                                                         methods
                                                                        m: number of validation examples, n: number of features.
                                                                                     Embedded methods
                 Scaling Factors                                                   (alternative definition)
Idea: Transform a discrete space into a continuous space.           • Definition: an embedded feature selection method is
                                                                      a machine learning algorithm that returns a model
                                                                      using a limited number of features.
                                           σ=[σ1 , σ2 , σ3 , σ4 ]
                                                                    Training set
   • Discrete indicators of feature presence: σi ∈{0, 1}            Learning algorithm
   • Continuous scaling factors: σi ∈ IR
          Now we can do gradient descent!                                                                             output
                                                                                                                                   4
                         Examples                                         Design strategies
  • Forward selection with Decision
    trees                                                     • As previously suggested: use tricks and
  • Forward selection with Gram-                                intuition . Might work but difficult. Still can
    Schmidt                                                     produce very smart algorithms (decision
  • Any algorithm producing a model                             trees).
    where “sensitivity” analysis can be
    done:
     – Linear system: remove feature i if w i is
       smaller than a fixed value.
                                                              • Other means: interpret feature selection as a
     – Others, e.g. parallelepipeds: remove                     model selection problem. In that context, we
       dimension where width is below a                         are interested in finding the set of features
       fixed value.
                                                                such that the model is the “ best”.
Note: embedded methods use the specific
structure of the model returned by the algorithm to
get the set of “relevant” features.
                 Feature selection as                                      Feature selection as
                  model selection - 1                                      model selection - 2
  • Let us consider the following set of functions
    parameterized by α and where σ 2 {0,1}n represents        • We are interested in finding α and σ such that
    the use (σi =1) or rejection of feature i.                  the generalization error is minimized:
                                             σ1 =1 σ3 =0
                                                                  where
                                                            Sometimes we add a constraint: # non zero σi’s · s 0
Example (linear systems, α=w):                     output
                                                              Problem: the generalization error is not known…
                                                                                                                   5
                Feature selection as                                      Feature selection as
                model selection - 3                                        model selection -4
 • The generalization error is not known directly but         • How to minimize              ?
   bounds can be used.
 • Most embedded methods minimize those bounds
   using different optimization strategies:                   Most approaches use the following method:
     – Add and remove features
     – Relaxation methods and gradient descent                                                    This optimization is
     – Relaxation methods and regularization                                                      often done by relaxing
                                                                                                  the constraint
Example of bounds (linear systems):                                                               σ 2 {0,1}n
                                                                                                  as σ 2 [0,1]n
Non separable
Linearly separable
         Add/Remove features 1                                      Add/Remove features 2
  • Many learning algorithms are cast into a minimization
    of some regularized functional:                           • It can be shown (under some conditions) that
                                                                the removal of one feature will induce a
                                                                change in G proportional to:
                                                                                                 Gradient of f wrt. ith
                                            Regularization                                        feature at point xk
                     Empirical error       capacity control
  • What does G(σ) become if one feature is removed?
  • Sometimes, G can only increase… (e.g. SVM)                • Examples: SVMs
                                                              ! RFE (Ω(α) = Ω(w) = ∑i wi2)
                                                                                                                           6
                                                                                       Add/Remove feature
   Add/Remove features - RFE                                                               summary
                                                                      •      Many algorithms can be turned into embedded methods for
 • Recursive Feature Elimination                                             feature selections by using the following approach:
                                                                      1.     Choose an objective function that measure how well the
                                                       Minimize              model returned by the algorithm performs
                                                      estimate of     2.     “Differentiate” (or sensitivity analysis) this objective function
                                                        R(α, σ)              according to the σ parameter (i.e. how does the value of this
                                                        wrt. α               function change when one feature is removed and the
                                                                             algorithm is rerun)
                                                                      3.     Select the features whose removal ( resp. addition) induces
                                                   Minimize the              the desired change in the objective function (i.e. minimize
                                                  estimate R( α,σ)           error estimate, maximize alignment with target, etc.)
                                                 wrt. σ and under
                                                  a constraint that   What makes this method an ‘ embedded method’ is the use of the
                                                    only limited          structure of the learning algorithm to compute the gradient
                                                                          and to search/weight relevant features.
                                                     number of
                                                 features must be
                                                      selected
           Add/Remove features
                                                                                      Gradient descent - 1
               when to stop
• When would you stop selecting features?                             • How to minimize                            ?
  – When objective function has reached a
                                                                          Most approaches use the following method:
    plateau?
     • What happens for the bound   r2||w||2   when                                                                       Would it make sense to
       features are removed?                                                                                              perform just a gradient
                                                                                                                          step here too?
  – Using a validation set?                                                                                             Gradient step in [0,1]n.
     • What size should you consider?
  – Don’t stop, just rank features?
                                                                                                                                                    7
                                                                               Gradient descent
           Gradient descent 2                                                     summary
                                                           •     Many algorithms can be turned into embedded methods for
Advantage of this approach:                                      feature selections by using the following approach:
• can be done for non- linear systems (e.g. SVM
  with Gaussian kernels)                                   1.    Choose an objective function that measure how well the
                                                                 model returned by the algorithm performs
• can mix the search for features with the                 2.    Differentiate this objective function according to the σ
  search for an optimal regularization                           parameter
  parameters and/or other kernel parameters.               3.    Performs a gradient descent on σ. At each iteration, rerun the
                                                                 initial learning algorithm to compute its solution on the new
                                                                 scaled feature space.
                                                           4.    Stop when no more changes (or early stopping, etc.)
Drawback:
                                                           5.    Threshold values to get list of features and retrain algorithm
• heavy computations                                             on the subset of features.
• back to gradient based machine algorithms
                                                                 Difference from add/remove approach is the search strategy.
  (early stopping, initialization, etc.)                         It still uses the inner structure of the learning model but it
                                                                 scales features rather than it selects them.
                                                               Feature selection for linear system is
  Design strategies (revisited)                                               NP hard
• Directly minimize the number of features that an         • Amaldi and Kann (1998) showed that the
  algorithm uses (focus on feature selection directly        minimization problem related to feature
  and forget generalization error).                          selection for linear systems is NP hard: the
• In the case of linear system, feature selection can be                                                  1-
                                                             minimum cannot be approximated within 2log
  expressed as:                                              ε(n)
                                                                  for all ε >0, unless NP is in
                                                             DTIME( npolylog (n) ).
                                                           • Is feature selection hopeless?
 Subject to                                                • How can we approximate this minimization?
                                                                                                                                  8
   Minimization of a sparsity function                                         The l1 SVM
                                                                • The version of the SVM where the
• Replace                      by another objective function:     margin term ||w||2 is replace by the l1
                                                                  norm ∑i |wi | can be considered as an
  – l 1 norm:                                                     embedded method:
                                                                  – Only a limited number of weights will be
  – Differentiable function:                                        non zero (tend to remove redundant
                                                                    features)
                                                                  – Difference from the regular SVM where
                                                                    redundant features are all included (non
• Do the optimization directly!                                     zero weights)
                A note on SVM                                          The gradient descent
 • Changing the regularization term has a strong
   impact on the generalization behavior …
                                                                • Perform a constrained gradient descent
                                                                  on:
 • Let w 1=(1,0), w 2=(0,1) and w λ=(1- λ)w 1+λw 2
   for λ 2 [0,1], we have:
    – ||wλ|| 2 = (1-λ)2 + λ2 ) minimum for λ = 1/2
    – |wλ| 1 = (1-λ) + λ
                λ2 + (1-λ) 2                  λ+ (1-λ) = 1      Under the constraints:
         w2       w1                   w2      w1
                                                                                                               9
               A direct approach                                        Embedded method - summary
•   Replace             by ∑i log(ε + | wi| )
                                                                     • Embedded methods are a good inspiration to design
•   Same idea as gradient descent but using another approximation.     new feature selection techniques for your own
                                                                       algorithms:
•   Boils down to the following multiplicative update:
                                                                        – Find a functional that represents your prior knowledge about
                                                                          what a good model is.
                                                                        – Add the \sigma weights into the functional and make sure it’s
                                                                          either differentiable or you can perform a sensitivity analysis
                                                                          efficiently
                                                                        – Optimize alternatively according to \alpha and \sigma
                                                                        – Use early stopping (validation set) or your own stopping
                                                                          criterion to stop and select the subset of features
                                                                     • Embedded methods are therefore not too far from
                                                                       wrapper techniques and can be extended to
                                                                       multiclass, regression, etc…
                                                                             Homework 8: Solution
                                                                     • Baseline model: 5% BER (trained on training
                                                                       data only)
                  Exercise Class                                     • Best challenge entries ~3% BER
                                                                     • Tips to outperform the challengers:
                                                                        – Train on (training + validation) set => double the
                                                                          number of examples
                                                                        – Vary the number of features
                                                                     my_classif=svc({'coef0=1', 'degree=1', 'gamma=0',
                                                                       'shrinkage=0.5'});
                                                                     my_model=chain({s2n('f_max=??? '), normalize,
                                                                       my_classif})
                                                                        – Select best model by CV
                                                                                                                                            10
           Difficulty: Good CV                                                              Filters Implemented
              blue=cv5, red=cv10, green=cv15
              8
             7.5                                                               •   @s2n
              7
             6.5
                                                                               •   @Relief
  BER
              6
             5.5
                                                                               •   @Ttest
              5                                                                •   @Pearson           (Use Matlab corrcoef. Gives the same results
             4.5                                                                   as Ttest, classes are balanced.)
              4
             3.5
                                                                               • @Ftest         (gives the same results as Ttest . Important for the
                                                                                   pvalues: the Fisher criterion needs to be multiplied by
              3
                   0   500    1000   1500   2000   2500   3000   3500   4000       num_patt_per_class or use anovan.)
                             Number of features                                • @aucfs         (ranksum test)
• Get 1 point if you make an entry with less than 5% error
• Get 2 points if you make an entry with less than 4% error
                             Exercise - 1                                                     Exercise - 1 (cont.)
                                                                               • 1. Motivate the choice of such a function to
• Consider the 1 nearest neighbor algorithm.                                     upper bound the generalization error
  We define the following score:                                                 (qualitative answer)
                                                                               • 2. How would you derive an embedded
                                                                                 method to perform feature selection for 1
                                                                                 nearest neighbor using this functional?
                                                                               • 3. Motivate your choice (what makes your
• Where s(k) (resp. d(k)) is the index of the                                    method an ‘embedded method’ and not a
  nearest neighbor of x k belonging to the same                                  ‘wrapper ’ method)
  class (resp. different class) as x k.
                                                                                                                                                       11
             Exercise - 2
• Design an RFE algorithm in a multi-
  class set-up (hint: choose a regular
  multi-class SVM, add the \sigma scaling
  factors into the functional and compute
  the gradient).
• Discuss the advantages/drawback of
  this approach when compared to using
  many two classes RFE algorithms in a
  one-against-the rest approach.
                                            12