Foundations of Machine Learning
Module 3: Instance Based Learning and
Feature Reduction
Part B: Feature Selection
Sudeshna Sarkar
IIT Kharagpur
Feature Reduction in ML
- The information about the target class is inherent
in the variables.
- Naïve view:
More features
=> More information
=> More discrimination power.
- In practice:
many reasons why this is not the case!
Curse of Dimensionality
• number of training examples is fixed
=> the classifier’s performance usually will
degrade for a large number of features!
Feature Reduction in ML
- Irrelevant and
- redundant features
- can confuse learners.
- Limited training data.
- Limited computational resources.
- Curse of dimensionality.
Feature Selection
Problem of selecting some subset of features, while
ignoring the rest
Feature Extraction
Criteria for selection/extraction:
either improve or maintain the classification
accuracy, simplify classifier complexity.
Feature Selection - Definition
Subset selection
7
Feature Selection Steps
Feature selection is an
optimization problem.
o Step 1: Search the space
of possible feature
subsets.
o Step 2: Pick the subset
that is optimal or near-
optimal with respect to
some objective function.
Feature Selection Steps (cont’d)
Search strategies
– Optimum
– Heuristic
– Randomized
Evaluation strategies
- Filter methods
- Wrapper methods
Evaluating feature subset
• Supervised (wrapper method)
– Train using selected subset
– Estimate error on validation dataset
• Unsupervised (filter method)
– Look at input only
– Select the subset that has the most information
Evaluation Strategies
Filter Methods Wrapper Methods
Subset selection
• Select uncorrelated features
• Forward search
– Start from empty set of features
– Try each of remaining features
– Estimate classification/regression error for adding specific
feature
– Select feature that gives maximum improvement in
validation error
– Stop when no significant improvement
• Backward search
– Start with original set of size d
– Drop features with smallest impact on error
Feature selection
Univariate (looks at each feature independently of others)
– Pearson correlation coefficient
– F-score Univariate methods measure
– Chi-square some type of correlation
– Signal to noise ratio between two random variables
– mutual information • the label (yi) and a fixed feature
– Etc. (xij for fixed j)
• Rank features by importance
• Ranking cut-off is determined by user
Pearson correlation coefficient
Pearson correlation coefficient
From Wikipedia
Signal to noise ratio
• Difference in means divided by difference in
standard deviation between the two classes
S2N(X,Y) = (μX - μY)/(σX – σY)
• Large values indicate a strong correlation
Multivariate feature selection
• Multivariate (considers all features simultaneously)
• Consider the vector w for any linear classifier.
• Classification of a point x is given by wTx+w0.
• Small entries of w will have little effect on the dot
product and therefore those features are less relevant.
• For example if w = (10, .01, -9) then features 0 and 2 are
contributing more to the dot product than feature 1.
– A ranking of features given by this w is 0, 2, 1.
Multivariate feature selection
• The w can be obtained by any of linear classifiers
• A variant of this approach is called recursive feature
elimination:
– Compute w on all features
– Remove feature with smallest wi
– Recompute w on reduced data
– If stopping criterion not met then go to step 2
Feature Extraction
Foundations of Machine Learning
Module 3: Instance Based Learning and
Feature Reduction
Part C: Feature Extraction
Sudeshna Sarkar
IIT Kharagpur
Feature extraction - definition
Feature Extraction
PCA
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Algebraic definition of PCs
Given a sample of p observations on a vector of N variables
x , x ,, x
1 2 p
N
define the first principal component of the sample
by the linear transformation
N
z1 a x j ai1 xij ,
T
1 j 1,2, , p.
i 1
where the vector a1 (a11 , a21 , , a N 1 )
x j ( x1 j , x2 j , , x Nj )
is chosen such that var[ z1 ] is maximum.
PCA
PCA
PCA for image compression
p=1 p=2 p=4 p=8
Original
p=16 p=32 p=64 p=100 Image
Is PCA a good criterion for classification?
• Data variation
determines the
projection direction
• What’s missing?
– Class information
What is a good projection?
Two classes
• Similarly, what is a
overlap
good criterion?
– Separating different
classes
Two classes are
separated
What class information may be useful?
• Between-class distance
– Distance between the centroids
of different classes
Between-class distance
What class information may be useful?
• Between-class distance
– Distance between the centroids of
different classes
• Within-class distance
• Accumulated distance of an instance
to the centroid of its class
• Linear discriminant analysis (LDA) finds
most discriminant projection by
• maximizing between-class distance
• and minimizing within-class distance Within-class distance
Linear Discriminant Analysis
Means and Scatter after projection
Good Projection
• Means are as far away as possible
• Scatter is small as possible
• Fisher Linear Discriminant
m1 m2
2
J w
s s
2
1
2
2
Multiple Classes