0% found this document useful (0 votes)

66 views12 pages

Embedded Methods: Isabelle Guyon André Elisseeff

Embedded methods combine feature selection with model training in an integrated way: 1. The feature selection process is guided by model training and performance rather than an independent criterion. 2. Features are selected because they are useful for predicting the target variable rather than having a high correlation with the target. 3. Embedded methods are generally more computationally efficient than wrapper methods since they do not require retraining models from scratch. They are also less prone to overfitting than filters.

Uploaded by

Luis Rodolfo Corona Gonzalez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views12 pages

Embedded Methods: Isabelle Guyon André Elisseeff

Uploaded by

Luis Rodolfo Corona Gonzalez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Filters,Wrappers, and

Embedded methods
Feature
All features Filter Predictor
subset
Lecture 9: Multiple
Embedded Methods All features Feature Predictor
subsets
Isabelle Guyon guyoni @inf.ethz.ch Wrapper
André Elisseeff AEL@zurich.ibm.com Feature
Embedded subset
All features
Chapter 5: Embedded methods method
Predictor

Filters Wrappers
Methods: Methods:
• Criterion: Measure feature/feature subset • Criterion: Measure feature subset
“relevance” “usefulness”
• Search: Usually order features (individual • Search: Search the space of all feature
feature ranking or nested subsets of features) subsets
• Assessment: Use cross-validation
• Assessment: Use statistical tests
Results:
Results:
• Can in principle find the most “useful”
• Are (relatively) robust against overfitting features, but
• May fail to select the most “ useful” features • Are prone to overfitting

1
New Embedded Methods Three “Ingredients”
Methods:
• Criterion: Measure feature subset
“usefulness”
Single
• Search: Search guided by the learning New

Cr
nt
feature
Cross relevance
relevance
process

ite
validation
Relevance
Relevance

rio
in
in context

ss
context
• Assessment: Use cross-validation Performance
Performance Feature subset

n
se
bounds
bounds relevance

As
Performance
Results: Statistical
Statistical
Performance
learning
learning
tests machine
Nested subset,
• Similar to wrappers, but Heuristic or forward selection/
stochastic search backward elimination

• Less computationally expensive Exhaustive search Single feature ranking

• Less prone to overfitting Search

Forward Selection Forward Selection with GS

Stoppiglia, 2002. Gram-Schmidt orthogonalization.
Start
• Select a first feature X?(1) with maximum
n cosine with the target cos(xi, y) =x.y/||x|| ||y||
• For each remaining feature Xi
n-1 – Project Xi and the target Y on the null space of the
features already selected
– Compute the cosine of Xi with the target in the
n-2 projection
…

• Select the feature X?(k) with maximum cosine

1
with the target in the projection.

New Guided search: we do not consider alternative paths. Embedded method for the linear least square predictor

2
Forward Selection w. Trees Backward Elimination
• Tree classifiers,
like CART (Breiman, 1984) or C4.5 (Quinlan, 1993)

…
At each step, n-2
f2 All the choose the
data feature that n-1
“reduces entropy”
most. Work n
f1
towards “node
purity”.
Choose f1 Start
Choose f2

Backward Elimination:RFE OBD (LeCun et al, 1990)

RFE-SVM, Guyon, Weston, et al, 2002 J[f]

Start with all the features.

• Train a learning machine f on the current subset
of features by minimizing a risk functional J[f]. DJ ≅ ½ ∂2 J/∂wi2 (wi* )2
• For each (remaining) feature Xi, estimate,
without retraining f, the change in J[f] resulting wi* wi
0
from the removal of Xi.
Dwi = wi*
• Remove the feature X?(k) that results in
improving or least degrading J. DJ = Σi ∂J/∂wi Dwi + ½ Σi ∂2 J/∂wi2 (Dwi)2 + cross-terms + O(||Dw||3 )

Simple case: linear classifier + J quadratic form of w ⇒ DJ α wi2

Embedded method for SVM, kernel methods, neural nets.
RFE for ridge regression and SVM: remove input with smallest wi2

3
Nested Subset Methods Complexity Comparison

• Forward selection Generalization_error ≤ Validation_error + ε(C / m)

[]]]]]]]]]]]]]]]] Method Number of Complexity

subsets C
• Backward elimination tried
n
Exhaustive search 2 n
[]]]]]]]]]]]]]]]] wrapper
Nested subsets n(n+1)/2 log n
• Feature ranking (filters) greedy wrapper
Feature ranking n log n
[]]]]]]]]]]]]]]]] or embedded
methods
m: number of validation examples, n: number of features.

Embedded methods
Scaling Factors (alternative definition)
Idea: Transform a discrete space into a continuous space. • Definition: an embedded feature selection method is
a machine learning algorithm that returns a model
using a limited number of features.
σ=[σ1 , σ2 , σ3 , σ4 ]
Training set

• Discrete indicators of feature presence: σi ∈{0, 1} Learning algorithm

• Continuous scaling factors: σi ∈ IR

Now we can do gradient descent! output

4
Examples Design strategies
• Forward selection with Decision
trees • As previously suggested: use tricks and
• Forward selection with Gram- intuition . Might work but difficult. Still can
Schmidt produce very smart algorithms (decision
• Any algorithm producing a model trees).
where “sensitivity” analysis can be
done:
– Linear system: remove feature i if w i is
smaller than a fixed value.
• Other means: interpret feature selection as a
– Others, e.g. parallelepipeds: remove model selection problem. In that context, we
dimension where width is below a are interested in finding the set of features
fixed value.
such that the model is the “ best”.
Note: embedded methods use the specific
structure of the model returned by the algorithm to
get the set of “relevant” features.

Feature selection as Feature selection as

model selection - 1 model selection - 2
• Let us consider the following set of functions
parameterized by α and where σ 2 {0,1}n represents • We are interested in finding α and σ such that
the use (σi =1) or rejection of feature i. the generalization error is minimized:
σ1 =1 σ3 =0

where

Sometimes we add a constraint: # non zero σi’s · s 0

Example (linear systems, α=w): output
Problem: the generalization error is not known…

5
Feature selection as Feature selection as
model selection - 3 model selection -4
• The generalization error is not known directly but • How to minimize ?
bounds can be used.
• Most embedded methods minimize those bounds
using different optimization strategies: Most approaches use the following method:
– Add and remove features
– Relaxation methods and gradient descent This optimization is
– Relaxation methods and regularization often done by relaxing
the constraint
Example of bounds (linear systems): σ 2 {0,1}n
as σ 2 [0,1]n
Non separable

Linearly separable

Add/Remove features 1 Add/Remove features 2

• Many learning algorithms are cast into a minimization
of some regularized functional: • It can be shown (under some conditions) that
the removal of one feature will induce a
change in G proportional to:

Gradient of f wrt. ith

Regularization feature at point xk
Empirical error capacity control

• What does G(σ) become if one feature is removed?

• Sometimes, G can only increase… (e.g. SVM) • Examples: SVMs
! RFE (Ω(α) = Ω(w) = ∑i wi2)

6
Add/Remove feature
Add/Remove features - RFE summary
• Many algorithms can be turned into embedded methods for
• Recursive Feature Elimination feature selections by using the following approach:

1. Choose an objective function that measure how well the

Minimize model returned by the algorithm performs
estimate of 2. “Differentiate” (or sensitivity analysis) this objective function
R(α, σ) according to the σ parameter (i.e. how does the value of this
wrt. α function change when one feature is removed and the
algorithm is rerun)
3. Select the features whose removal ( resp. addition) induces
Minimize the the desired change in the objective function (i.e. minimize
estimate R( α,σ) error estimate, maximize alignment with target, etc.)
wrt. σ and under
a constraint that What makes this method an ‘ embedded method’ is the use of the
only limited structure of the learning algorithm to compute the gradient
and to search/weight relevant features.
number of
features must be
selected

Add/Remove features
Gradient descent - 1
when to stop
• When would you stop selecting features? • How to minimize ?
– When objective function has reached a
Most approaches use the following method:
plateau?
• What happens for the bound r2||w||2 when Would it make sense to
features are removed? perform just a gradient
step here too?

– Using a validation set? Gradient step in [0,1]n.

• What size should you consider?

– Don’t stop, just rank features?

7
Gradient descent
Gradient descent 2 summary
• Many algorithms can be turned into embedded methods for
Advantage of this approach: feature selections by using the following approach:
• can be done for non- linear systems (e.g. SVM
with Gaussian kernels) 1. Choose an objective function that measure how well the
model returned by the algorithm performs
• can mix the search for features with the 2. Differentiate this objective function according to the σ
search for an optimal regularization parameter
parameters and/or other kernel parameters. 3. Performs a gradient descent on σ. At each iteration, rerun the
initial learning algorithm to compute its solution on the new
scaled feature space.
4. Stop when no more changes (or early stopping, etc.)
Drawback:
5. Threshold values to get list of features and retrain algorithm
• heavy computations on the subset of features.
• back to gradient based machine algorithms
Difference from add/remove approach is the search strategy.
(early stopping, initialization, etc.) It still uses the inner structure of the learning model but it
scales features rather than it selects them.

Feature selection for linear system is

Design strategies (revisited) NP hard

• Directly minimize the number of features that an • Amaldi and Kann (1998) showed that the
algorithm uses (focus on feature selection directly minimization problem related to feature
and forget generalization error). selection for linear systems is NP hard: the
• In the case of linear system, feature selection can be 1-
minimum cannot be approximated within 2log
expressed as: ε(n)
for all ε >0, unless NP is in
DTIME( npolylog (n) ).

• Is feature selection hopeless?

Subject to • How can we approximate this minimization?

8
Minimization of a sparsity function The l1 SVM

• The version of the SVM where the

• Replace by another objective function: margin term ||w||2 is replace by the l1
norm ∑i |wi | can be considered as an
– l 1 norm: embedded method:
– Only a limited number of weights will be
– Differentiable function: non zero (tend to remove redundant
features)
– Difference from the regular SVM where
redundant features are all included (non
• Do the optimization directly! zero weights)

A note on SVM The gradient descent

• Changing the regularization term has a strong
impact on the generalization behavior …
• Perform a constrained gradient descent
on:
• Let w 1=(1,0), w 2=(0,1) and w λ=(1- λ)w 1+λw 2
for λ 2 [0,1], we have:
– ||wλ|| 2 = (1-λ)2 + λ2 ) minimum for λ = 1/2
– |wλ| 1 = (1-λ) + λ

λ2 + (1-λ) 2 λ+ (1-λ) = 1 Under the constraints:

w2 w1 w2 w1

9
A direct approach Embedded method - summary
• Replace by ∑i log(ε + | wi| )
• Embedded methods are a good inspiration to design
• Same idea as gradient descent but using another approximation. new feature selection techniques for your own
algorithms:
• Boils down to the following multiplicative update:
– Find a functional that represents your prior knowledge about
what a good model is.
– Add the \sigma weights into the functional and make sure it’s
either differentiable or you can perform a sensitivity analysis
efficiently
– Optimize alternatively according to \alpha and \sigma
– Use early stopping (validation set) or your own stopping
criterion to stop and select the subset of features

• Embedded methods are therefore not too far from

wrapper techniques and can be extended to
multiclass, regression, etc…

Homework 8: Solution

• Baseline model: 5% BER (trained on training

data only)

Exercise Class • Best challenge entries ~3% BER

• Tips to outperform the challengers:
– Train on (training + validation) set => double the
number of examples
– Vary the number of features
my_classif=svc({'coef0=1', 'degree=1', 'gamma=0',
'shrinkage=0.5'});
my_model=chain({s2n('f_max=??? '), normalize,
my_classif})
– Select best model by CV

10
Difficulty: Good CV Filters Implemented
blue=cv5, red=cv10, green=cv15
8

7.5 • @s2n
7

6.5
• @Relief
BER
6

5.5
• @Ttest
5 • @Pearson (Use Matlab corrcoef. Gives the same results
4.5 as Ttest, classes are balanced.)
4

3.5
• @Ftest (gives the same results as Ttest . Important for the
pvalues: the Fisher criterion needs to be multiplied by
3
0 500 1000 1500 2000 2500 3000 3500 4000 num_patt_per_class or use anovan.)
Number of features • @aucfs (ranksum test)
• Get 1 point if you make an entry with less than 5% error
• Get 2 points if you make an entry with less than 4% error

Exercise - 1 Exercise - 1 (cont.)

• 1. Motivate the choice of such a function to

• Consider the 1 nearest neighbor algorithm. upper bound the generalization error
We define the following score: (qualitative answer)
• 2. How would you derive an embedded
method to perform feature selection for 1
nearest neighbor using this functional?
• 3. Motivate your choice (what makes your
• Where s(k) (resp. d(k)) is the index of the method an ‘embedded method’ and not a
nearest neighbor of x k belonging to the same ‘wrapper ’ method)
class (resp. different class) as x k.

11
Exercise - 2

• Design an RFE algorithm in a multi-

class set-up (hint: choose a regular
multi-class SVM, add the \sigma scaling
factors into the functional and compute
the gradient).
• Discuss the advantages/drawback of
this approach when compared to using
many two classes RFE algorithms in a
one-against-the rest approach.

ML Notes
No ratings yet
ML Notes
15 pages
Lecture 15 - 23.09.2024 - Feature Selection
No ratings yet
Lecture 15 - 23.09.2024 - Feature Selection
47 pages
Introduction To Feature Selection Methods With An Example
No ratings yet
Introduction To Feature Selection Methods With An Example
10 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
SVM-Based Feature Selection & Classification
No ratings yet
SVM-Based Feature Selection & Classification
22 pages
Feature Selection
No ratings yet
Feature Selection
2 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Unit 4
No ratings yet
Unit 4
121 pages
ML Module VI
No ratings yet
ML Module VI
24 pages
Sta 5
No ratings yet
Sta 5
16 pages
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
No ratings yet
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
9 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
Test
No ratings yet
Test
4 pages
ML Lecture 02
No ratings yet
ML Lecture 02
40 pages
Module-3 - DS (Autosaved)
No ratings yet
Module-3 - DS (Autosaved)
18 pages
MCA Class Note Feature
No ratings yet
MCA Class Note Feature
5 pages
Chandra Shekar 2014
No ratings yet
Chandra Shekar 2014
13 pages
AI5003 AML Week07
No ratings yet
AI5003 AML Week07
14 pages
7 Selectia Trasaturilor
No ratings yet
7 Selectia Trasaturilor
54 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
4 pages
Feature Selection & Extraction
No ratings yet
Feature Selection & Extraction
15 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Graph Autoencoder-Based Unsupervised Feature Selection With Broad and Local Data Structure Preservation
No ratings yet
Graph Autoencoder-Based Unsupervised Feature Selection With Broad and Local Data Structure Preservation
28 pages
Featuere Selection
No ratings yet
Featuere Selection
5 pages
Feature Selection Technique
No ratings yet
Feature Selection Technique
7 pages
Data-Science Feature Selection & Extraction
No ratings yet
Data-Science Feature Selection & Extraction
15 pages
Unit 3
No ratings yet
Unit 3
50 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
5 pages
Symbolic Regression and Machine Learning
No ratings yet
Symbolic Regression and Machine Learning
8 pages
3ML.03.Feature Reduction
No ratings yet
3ML.03.Feature Reduction
44 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Kernels, Model & Feature Selection
No ratings yet
Kernels, Model & Feature Selection
5 pages
Feature Selection Tech
No ratings yet
Feature Selection Tech
5 pages
Feature Engineering
No ratings yet
Feature Engineering
5 pages
Lecture#10
No ratings yet
Lecture#10
24 pages
کتاب پنجم بارگزاری شده
No ratings yet
کتاب پنجم بارگزاری شده
35 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
No ratings yet
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
66 pages
Hybrid-Recursive Feature Elimination For Efficient Feature Selection
No ratings yet
Hybrid-Recursive Feature Elimination For Efficient Feature Selection
9 pages
Module5.2 Feature Selection Methods
No ratings yet
Module5.2 Feature Selection Methods
64 pages
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
No ratings yet
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
25 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
Week 2 v1.1 (Hidden) - Dimensionality and Evaluation
No ratings yet
Week 2 v1.1 (Hidden) - Dimensionality and Evaluation
47 pages
Feature Selection - New
No ratings yet
Feature Selection - New
41 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
33 pages
Features Election
No ratings yet
Features Election
18 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
Feature Selection Techniques
No ratings yet
Feature Selection Techniques
5 pages
Feature Selection 1692278667
No ratings yet
Feature Selection 1692278667
100 pages
3.1 Feature Selection
No ratings yet
3.1 Feature Selection
35 pages
Chen 2007
No ratings yet
Chen 2007
7 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
9 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
52 pages
Feature Selection For SVMS: J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik
No ratings yet
Feature Selection For SVMS: J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik
7 pages
Dimensionality Reduction & Model Evaluation
No ratings yet
Dimensionality Reduction & Model Evaluation
80 pages
Nanorobotics 150820142400 Lva1 App6891
100% (1)
Nanorobotics 150820142400 Lva1 App6891
48 pages
Innovature QP
73% (11)
Innovature QP
7 pages
Welcome Back to Andrew Sibbald
No ratings yet
Welcome Back to Andrew Sibbald
5 pages
Computer Studies Section A Problem - 051621
100% (2)
Computer Studies Section A Problem - 051621
25 pages
DAE Scientific Officer Recruitment 2022
No ratings yet
DAE Scientific Officer Recruitment 2022
34 pages
Grade 2
No ratings yet
Grade 2
11 pages
Introduction to ICT and IT Basics
No ratings yet
Introduction to ICT and IT Basics
111 pages
AI Courses (Jan - Apr 2025)
No ratings yet
AI Courses (Jan - Apr 2025)
3 pages
Explainability in Quantum ML
No ratings yet
Explainability in Quantum ML
32 pages
K2000 All-In-One Machine: TC-S36424 Spec: T/4U/V5
No ratings yet
K2000 All-In-One Machine: TC-S36424 Spec: T/4U/V5
9 pages
CSP Microproject-Numbered
No ratings yet
CSP Microproject-Numbered
23 pages
SolidCAM 5-Axis Machining Guide
50% (2)
SolidCAM 5-Axis Machining Guide
33 pages
Quality Risk Management Manajemen Risiko Mutu (QRM/MRM)
No ratings yet
Quality Risk Management Manajemen Risiko Mutu (QRM/MRM)
42 pages
Problems Problem 33-1 (IFRS)
No ratings yet
Problems Problem 33-1 (IFRS)
8 pages
Unit 1 The Product and The Process: Structure
No ratings yet
Unit 1 The Product and The Process: Structure
125 pages
Simcom Sim5320 Atc en v1.23 PDF
No ratings yet
Simcom Sim5320 Atc en v1.23 PDF
498 pages
PLCs: Advantages, Parts, and Applications
No ratings yet
PLCs: Advantages, Parts, and Applications
34 pages
Pfe 2018 FACG
100% (2)
Pfe 2018 FACG
44 pages
C++ Function Overloading Guide
No ratings yet
C++ Function Overloading Guide
10 pages
Recent Developments in Mechatronics and Intelligent Robotics
No ratings yet
Recent Developments in Mechatronics and Intelligent Robotics
1,291 pages
SYSTEM SOFTWARE Chapter 5 5.2 Machine Dependent Complier Features
No ratings yet
SYSTEM SOFTWARE Chapter 5 5.2 Machine Dependent Complier Features
12 pages
Turbo C Lecture
No ratings yet
Turbo C Lecture
14 pages
Gender's Impact on Internet Addiction
No ratings yet
Gender's Impact on Internet Addiction
12 pages
MOST150: Audi's Infotainment Evolution
No ratings yet
MOST150: Audi's Infotainment Evolution
21 pages
DAKSA Company Profile 2019
No ratings yet
DAKSA Company Profile 2019
38 pages
Microwave & Telecom Tech.
No ratings yet
Microwave & Telecom Tech.
43 pages
Stcs - Vmir: Shrinking Tube Control System
No ratings yet
Stcs - Vmir: Shrinking Tube Control System
2 pages
Pipeline Architecture PDF
100% (1)
Pipeline Architecture PDF
42 pages
OpenText Media Management 16.3 - Administration Guide English (MEDMGT160300-AGD-EN-02) PDF
No ratings yet
OpenText Media Management 16.3 - Administration Guide English (MEDMGT160300-AGD-EN-02) PDF
306 pages
Institution Registration - English
No ratings yet
Institution Registration - English
3 pages

Embedded Methods: Isabelle Guyon André Elisseeff

Uploaded by

Embedded Methods: Isabelle Guyon André Elisseeff

Uploaded by

Filters,Wrappers, and

• Less computationally expensive Exhaustive search Single feature ranking

• Less prone to overfitting Search

Forward Selection Forward Selection with GS

• Select the feature X?(k) with maximum cosine

Backward Elimination:RFE OBD (LeCun et al, 1990)

RFE-SVM, Guyon, Weston, et al, 2002 J[f]

Start with all the features.

Simple case: linear classifier + J quadratic form of w ⇒ DJ α wi2

• Forward selection Generalization_error ≤ Validation_error + ε(C / m)

[]]]]]]]]]]]]]]]] Method Number of Complexity

• Discrete indicators of feature presence: σi ∈{0, 1} Learning algorithm

Now we can do gradient descent! output

Feature selection as Feature selection as

Sometimes we add a constraint: # non zero σi’s · s 0

Add/Remove features 1 Add/Remove features 2

Gradient of f wrt. ith

• What does G(σ) become if one feature is removed?

1. Choose an objective function that measure how well the

– Using a validation set? Gradient step in [0,1]n.

– Don’t stop, just rank features?

Feature selection for linear system is

• Is feature selection hopeless?

Subject to • How can we approximate this minimization?

• The version of the SVM where the

A note on SVM The gradient descent

λ2 + (1-λ) 2 λ+ (1-λ) = 1 Under the constraints:

• Embedded methods are therefore not too far from

• Baseline model: 5% BER (trained on training

Exercise Class • Best challenge entries ~3% BER

Exercise - 1 Exercise - 1 (cont.)

• 1. Motivate the choice of such a function to

• Design an RFE algorithm in a multi-

You might also like