0% found this document useful (0 votes)
66 views12 pages

Embedded Methods: Isabelle Guyon André Elisseeff

Embedded methods combine feature selection with model training in an integrated way: 1. The feature selection process is guided by model training and performance rather than an independent criterion. 2. Features are selected because they are useful for predicting the target variable rather than having a high correlation with the target. 3. Embedded methods are generally more computationally efficient than wrapper methods since they do not require retraining models from scratch. They are also less prone to overfitting than filters.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views12 pages

Embedded Methods: Isabelle Guyon André Elisseeff

Embedded methods combine feature selection with model training in an integrated way: 1. The feature selection process is guided by model training and performance rather than an independent criterion. 2. Features are selected because they are useful for predicting the target variable rather than having a high correlation with the target. 3. Embedded methods are generally more computationally efficient than wrapper methods since they do not require retraining models from scratch. They are also less prone to overfitting than filters.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Filters,Wrappers, and

Embedded methods
Feature
All features Filter Predictor
subset
Lecture 9: Multiple
Embedded Methods All features Feature Predictor
subsets
Isabelle Guyon guyoni @inf.ethz.ch Wrapper
André Elisseeff AEL@zurich.ibm.com Feature
Embedded subset
All features
Chapter 5: Embedded methods method
Predictor

Filters Wrappers
Methods: Methods:
• Criterion: Measure feature/feature subset • Criterion: Measure feature subset
“relevance” “usefulness”
• Search: Usually order features (individual • Search: Search the space of all feature
feature ranking or nested subsets of features) subsets
• Assessment: Use cross-validation
• Assessment: Use statistical tests
Results:
Results:
• Can in principle find the most “useful”
• Are (relatively) robust against overfitting features, but
• May fail to select the most “ useful” features • Are prone to overfitting

1
New Embedded Methods Three “Ingredients”
Methods:
• Criterion: Measure feature subset
“usefulness”
Single
• Search: Search guided by the learning New

Cr
nt
feature
Cross relevance
relevance
process

me

ite
validation
Relevance
Relevance

rio
in
in context

ss
context
• Assessment: Use cross-validation Performance
Performance Feature subset

n
se
bounds
bounds relevance

As
Performance
Results: Statistical
Statistical
Performance
learning
learning
tests machine
Nested subset,
• Similar to wrappers, but Heuristic or forward selection/
stochastic search backward elimination

• Less computationally expensive Exhaustive search Single feature ranking

• Less prone to overfitting Search

Forward Selection Forward Selection with GS


Stoppiglia, 2002. Gram-Schmidt orthogonalization.
Start
• Select a first feature X?(1) with maximum
n cosine with the target cos(xi, y) =x.y/||x|| ||y||
• For each remaining feature Xi
n-1 – Project Xi and the target Y on the null space of the
features already selected
– Compute the cosine of Xi with the target in the
n-2 projection

• Select the feature X?(k) with maximum cosine


1
with the target in the projection.

New Guided search: we do not consider alternative paths. Embedded method for the linear least square predictor

2
Forward Selection w. Trees Backward Elimination
• Tree classifiers,
like CART (Breiman, 1984) or C4.5 (Quinlan, 1993)


At each step, n-2
f2 All the choose the
data feature that n-1
“reduces entropy”
most. Work n
f1
towards “node
purity”.
Choose f1 Start
Choose f2

Backward Elimination:RFE OBD (LeCun et al, 1990)

RFE-SVM, Guyon, Weston, et al, 2002 J[f]

Start with all the features.


• Train a learning machine f on the current subset
of features by minimizing a risk functional J[f]. DJ ≅ ½ ∂2 J/∂wi2 (wi* )2
• For each (remaining) feature Xi, estimate,
without retraining f, the change in J[f] resulting wi* wi
0
from the removal of Xi.
Dwi = wi*
• Remove the feature X?(k) that results in
improving or least degrading J. DJ = Σi ∂J/∂wi Dwi + ½ Σi ∂2 J/∂wi2 (Dwi)2 + cross-terms + O(||Dw||3 )

Simple case: linear classifier + J quadratic form of w ⇒ DJ α wi2


Embedded method for SVM, kernel methods, neural nets.
RFE for ridge regression and SVM: remove input with smallest wi2

3
Nested Subset Methods Complexity Comparison

• Forward selection Generalization_error ≤ Validation_error + ε(C / m)

[]]]]]]]]]]]]]]]] Method Number of Complexity


subsets C
• Backward elimination tried
n
Exhaustive search 2 n
[]]]]]]]]]]]]]]]] wrapper
Nested subsets n(n+1)/2 log n
• Feature ranking (filters) greedy wrapper
Feature ranking n log n
[]]]]]]]]]]]]]]]] or embedded
methods
m: number of validation examples, n: number of features.

Embedded methods
Scaling Factors (alternative definition)
Idea: Transform a discrete space into a continuous space. • Definition: an embedded feature selection method is
a machine learning algorithm that returns a model
using a limited number of features.
σ=[σ1 , σ2 , σ3 , σ4 ]
Training set

• Discrete indicators of feature presence: σi ∈{0, 1} Learning algorithm


• Continuous scaling factors: σi ∈ IR

Now we can do gradient descent! output

4
Examples Design strategies
• Forward selection with Decision
trees • As previously suggested: use tricks and
• Forward selection with Gram- intuition . Might work but difficult. Still can
Schmidt produce very smart algorithms (decision
• Any algorithm producing a model trees).
where “sensitivity” analysis can be
done:
– Linear system: remove feature i if w i is
smaller than a fixed value.
• Other means: interpret feature selection as a
– Others, e.g. parallelepipeds: remove model selection problem. In that context, we
dimension where width is below a are interested in finding the set of features
fixed value.
such that the model is the “ best”.
Note: embedded methods use the specific
structure of the model returned by the algorithm to
get the set of “relevant” features.

Feature selection as Feature selection as


model selection - 1 model selection - 2
• Let us consider the following set of functions
parameterized by α and where σ 2 {0,1}n represents • We are interested in finding α and σ such that
the use (σi =1) or rejection of feature i. the generalization error is minimized:
σ1 =1 σ3 =0

where

Sometimes we add a constraint: # non zero σi’s · s 0


Example (linear systems, α=w): output
Problem: the generalization error is not known…

5
Feature selection as Feature selection as
model selection - 3 model selection -4
• The generalization error is not known directly but • How to minimize ?
bounds can be used.
• Most embedded methods minimize those bounds
using different optimization strategies: Most approaches use the following method:
– Add and remove features
– Relaxation methods and gradient descent This optimization is
– Relaxation methods and regularization often done by relaxing
the constraint
Example of bounds (linear systems): σ 2 {0,1}n
as σ 2 [0,1]n
Non separable

Linearly separable

Add/Remove features 1 Add/Remove features 2


• Many learning algorithms are cast into a minimization
of some regularized functional: • It can be shown (under some conditions) that
the removal of one feature will induce a
change in G proportional to:

Gradient of f wrt. ith


Regularization feature at point xk
Empirical error capacity control

• What does G(σ) become if one feature is removed?


• Sometimes, G can only increase… (e.g. SVM) • Examples: SVMs
! RFE (Ω(α) = Ω(w) = ∑i wi2)

6
Add/Remove feature
Add/Remove features - RFE summary
• Many algorithms can be turned into embedded methods for
• Recursive Feature Elimination feature selections by using the following approach:

1. Choose an objective function that measure how well the


Minimize model returned by the algorithm performs
estimate of 2. “Differentiate” (or sensitivity analysis) this objective function
R(α, σ) according to the σ parameter (i.e. how does the value of this
wrt. α function change when one feature is removed and the
algorithm is rerun)
3. Select the features whose removal ( resp. addition) induces
Minimize the the desired change in the objective function (i.e. minimize
estimate R( α,σ) error estimate, maximize alignment with target, etc.)
wrt. σ and under
a constraint that What makes this method an ‘ embedded method’ is the use of the
only limited structure of the learning algorithm to compute the gradient
and to search/weight relevant features.
number of
features must be
selected

Add/Remove features
Gradient descent - 1
when to stop
• When would you stop selecting features? • How to minimize ?
– When objective function has reached a
Most approaches use the following method:
plateau?
• What happens for the bound r2||w||2 when Would it make sense to
features are removed? perform just a gradient
step here too?

– Using a validation set? Gradient step in [0,1]n.


• What size should you consider?

– Don’t stop, just rank features?

7
Gradient descent
Gradient descent 2 summary
• Many algorithms can be turned into embedded methods for
Advantage of this approach: feature selections by using the following approach:
• can be done for non- linear systems (e.g. SVM
with Gaussian kernels) 1. Choose an objective function that measure how well the
model returned by the algorithm performs
• can mix the search for features with the 2. Differentiate this objective function according to the σ
search for an optimal regularization parameter
parameters and/or other kernel parameters. 3. Performs a gradient descent on σ. At each iteration, rerun the
initial learning algorithm to compute its solution on the new
scaled feature space.
4. Stop when no more changes (or early stopping, etc.)
Drawback:
5. Threshold values to get list of features and retrain algorithm
• heavy computations on the subset of features.
• back to gradient based machine algorithms
Difference from add/remove approach is the search strategy.
(early stopping, initialization, etc.) It still uses the inner structure of the learning model but it
scales features rather than it selects them.

Feature selection for linear system is


Design strategies (revisited) NP hard

• Directly minimize the number of features that an • Amaldi and Kann (1998) showed that the
algorithm uses (focus on feature selection directly minimization problem related to feature
and forget generalization error). selection for linear systems is NP hard: the
• In the case of linear system, feature selection can be 1-
minimum cannot be approximated within 2log
expressed as: ε(n)
for all ε >0, unless NP is in
DTIME( npolylog (n) ).

• Is feature selection hopeless?

Subject to • How can we approximate this minimization?

8
Minimization of a sparsity function The l1 SVM

• The version of the SVM where the


• Replace by another objective function: margin term ||w||2 is replace by the l1
norm ∑i |wi | can be considered as an
– l 1 norm: embedded method:
– Only a limited number of weights will be
– Differentiable function: non zero (tend to remove redundant
features)
– Difference from the regular SVM where
redundant features are all included (non
• Do the optimization directly! zero weights)

A note on SVM The gradient descent


• Changing the regularization term has a strong
impact on the generalization behavior …
• Perform a constrained gradient descent
on:
• Let w 1=(1,0), w 2=(0,1) and w λ=(1- λ)w 1+λw 2
for λ 2 [0,1], we have:
– ||wλ|| 2 = (1-λ)2 + λ2 ) minimum for λ = 1/2
– |wλ| 1 = (1-λ) + λ

λ2 + (1-λ) 2 λ+ (1-λ) = 1 Under the constraints:

w2 w1 w2 w1

9
A direct approach Embedded method - summary
• Replace by ∑i log(ε + | wi| )
• Embedded methods are a good inspiration to design
• Same idea as gradient descent but using another approximation. new feature selection techniques for your own
algorithms:
• Boils down to the following multiplicative update:
– Find a functional that represents your prior knowledge about
what a good model is.
– Add the \sigma weights into the functional and make sure it’s
either differentiable or you can perform a sensitivity analysis
efficiently
– Optimize alternatively according to \alpha and \sigma
– Use early stopping (validation set) or your own stopping
criterion to stop and select the subset of features

• Embedded methods are therefore not too far from


wrapper techniques and can be extended to
multiclass, regression, etc…

Homework 8: Solution

• Baseline model: 5% BER (trained on training


data only)

Exercise Class • Best challenge entries ~3% BER


• Tips to outperform the challengers:
– Train on (training + validation) set => double the
number of examples
– Vary the number of features
my_classif=svc({'coef0=1', 'degree=1', 'gamma=0',
'shrinkage=0.5'});
my_model=chain({s2n('f_max=??? '), normalize,
my_classif})
– Select best model by CV

10
Difficulty: Good CV Filters Implemented
blue=cv5, red=cv10, green=cv15
8

7.5 • @s2n
7

6.5
• @Relief
BER
6

5.5
• @Ttest
5 • @Pearson (Use Matlab corrcoef. Gives the same results
4.5 as Ttest, classes are balanced.)
4

3.5
• @Ftest (gives the same results as Ttest . Important for the
pvalues: the Fisher criterion needs to be multiplied by
3
0 500 1000 1500 2000 2500 3000 3500 4000 num_patt_per_class or use anovan.)
Number of features • @aucfs (ranksum test)
• Get 1 point if you make an entry with less than 5% error
• Get 2 points if you make an entry with less than 4% error

Exercise - 1 Exercise - 1 (cont.)

• 1. Motivate the choice of such a function to


• Consider the 1 nearest neighbor algorithm. upper bound the generalization error
We define the following score: (qualitative answer)
• 2. How would you derive an embedded
method to perform feature selection for 1
nearest neighbor using this functional?
• 3. Motivate your choice (what makes your
• Where s(k) (resp. d(k)) is the index of the method an ‘embedded method’ and not a
nearest neighbor of x k belonging to the same ‘wrapper ’ method)
class (resp. different class) as x k.

11
Exercise - 2

• Design an RFE algorithm in a multi-


class set-up (hint: choose a regular
multi-class SVM, add the \sigma scaling
factors into the functional and compute
the gradient).
• Discuss the advantages/drawback of
this approach when compared to using
many two classes RFE algorithms in a
one-against-the rest approach.

12

You might also like