Lecture 5
Lecture 5
Spring 24
Feature extraction
65
1
60
weight [kg]
110
100
90
weight [kg]
80
70
60
50
40
30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Typical dimension: Scatter matrix S
Scatter matrix S of a cluster is defined as:
N
S = ( xk − )( xk − )
T
k =1
While xk - is the kth pattern in the cluster,
N - is the number of patterns in the cluster,
- is the mean of all the patterns in the cluster
120
110
100
Larger Smaller
scatter scatter
90
weight [kg]
80
70
60
50
40
30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Data synthesis
• In order to test and debug pattern 120
weight [kg]
most of the literature it is
70
customary to assume a normal
distribution. 60
patterns distributions 30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Example: Two clusters synthesis
clear all
N1 = 150; N2 = 150;
E1 = [150 15; 120 20]; E2 = [100 10; 70 30];
M1 = [170,75]'; M2 = [160,50]'; 120
[P1,A1] = eig(E1); [P2,A2] = eig(E2);
110
y1=randn(2,N1); y2=randn(2,N2);
100
for i=1:N1, Males
x1(:,i) =P1*sqrt(A1)* y1(:,i)+M1; 90
end; weight [kg]
for i=1:N2, 80
figure; 50
Females
plot(x1(1,:),x1(2,:),'.',x2(1,:),x2(2,:),'or');
40
axis([120 220 30 120]);
xlabel ('height [cm]'); 30
120 130 140 150 160 170 180 190 200 210 220
ylabel ('weight [kg]'); height [cm]
Separability
Q: How does feature extraction
method affect the pattern
recognition performance?
120
110
100
90
weight [kg]
80
70
60
50
40
30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
height [cm]
Separability
A: More than the classification algorithm.
If the patterns achieved by the feature extraction
method creates nonseparate clusters, no
classification algorithm can do the job.
Classification
Feature algorithm
extraction
method
Separability
90 90 90
weight [kg]
weight [kg]
weight [kg]
80 80 80
70 70 70
60 60 60
50
50 50
40
40 40
30
120 130 140 150 160 170 180 190 200 210 220 30 30
height [cm] 120 130 140 150 160 170 180 190 200 210 220 120 130 140 150 160 170 180 190 200 210 220
height [cm] height [cm]
120
k =1
110
80
70
60
50
40
30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Separability
If we define the within-class
N
scatter matrix as:
C
SW = Pi ( xki − i )( xki − i )
i
T
i =1 k =1
While xki - is the kth pattern in the ith cluster,
Ni - is the number of patterns in the ith cluster,
i - is the mean of all the patterns in the ith cluster
C - is the number of the clusters
Pi - is the a priori probability of the ith cluster, and may be
estimated by C
Pi = Ni N
p =1
p
i =1
While - is the mean of all the patterns in all the clusters
ST = S B + SW
Separability function J
We can find in the literature several
functions which represent the separability
of clusters by a scalar:
tr ( S B )
J=
tr ( SW )
(
J = ln det ( SW−1S B ) )
J = det ( SW−1S B ) J = tr ( SW−1S B )
• Height
Effective and simple.
• Weight
Effective and simple.
Features extraction guidelines
“When we have two or more classes, feature extraction
consists of choosing those features which are most effective
for preserving class separability” (Fukunaga p. 441)
Because of:
• The complexity of most of the
classification algorithms is O(n2).
• Learning time
• “Curse of dimensionality”
Feature reduction
“Feature selection, also known as subset selection or variable selection,
is a process commonly used in machine learning, wherein a subset of
the features available from the data are selected for application of a
learning algorithm. Feature selection is necessary either because it is
computationally infeasible to use all available features, or because of
problems of estimation when limited data samples (but a large
number of features) are present. The latter problem is related to the
so-called curse of dimensionality.”
(WIKIPEDIA)
Feature Feature
extraction Selection
High Low
Object dimension dimension
Pattern Pattern
Feature reduction
Q: How can one identify the irrelevant and the
redundant features?
A: Several options:
1. Feature subset selection
By grouping features one can represent every (or more efficiently the
promising) set of features n provide a small set of features yet highly
representative subset of the feature space.
2. Heuristics
In most applications, the relevant information resides in the low frequency
range (e.g. images), hence it is logical to reduce dimensionality by taking
the first coefficients of the Fourier/DCT transform.
120
120
10
110
10
100
Males
90
90
weight [kg]
80
80
70
70
60
60
50
50
Females
40
40
30
0 5 10 15 20 25 30 30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
30
25
20
15
10
0
120 130 140 150 160 170 180 190 200 210 220
Feature selection - Example
No! The same separability can be achieved by
projecting the patterns onto the blue axis i.e.
only one-dimension feature-space is needed.
120
110
100
Males
90 But how can we find this
“blue axis” or the
weight [kg]
80
feature subspace on
70 which we should
project the patterns?
60
50
Females Dimensionality
40
Reduction
Techniques
DCT, LDA, PCA …
1
0
-1
30
-1.2 120 130 140 150 160 170 180 190 200 210 220
-1
height [cm]
30
25
-0.8 20
15
-0.6 10
5
-0.4 0
• Introduction/Motivation WHY ?
• Feature selection:
Problem of selecting some subset of a
learning algorithm’s input variables upon
which it should focus attention, while
ignoring the rest
(DIMENSIONALITY REDUCTION)
N. Sigala & N. Logothetis, 2002: Visual categorization shapes feature selectivity in the primate temporal cortex.
[1] Nathasha Sigala, Nikos Logothetis: Visual categorization shapes feature selectivity in the primate visual cortex. Nature Vol. 415(2002)
Motivational example from Biology
Diagnostic features:
- Eye separation
- Eye height
Non-Diagnostic features:
- Mouth height
- Nose length
Motivational example from Biology
- In practice:
many reasons why this is not the case!
- Also:
Optimization is (usually) good, so why not try to optimize the
input-coding?
Feature Selection in ML ? YES!
- Many explored domains have hundreds to tens of thousands of
variables/features with many irrelevant and redundant ones!
- Curse of dimensionality!
Curse of dimensionality
1
positiveexamples
positive examples
negativeexamples
negative examples
0.9
1
0.8
0.9
0.8
0.7
0.7
0.6
0.6
0.5
x2x3
0.5
0.4
0.3
0.4
0.2
0.1
0.3
0
1
0.20.9
0.8 1
0.7 0.9
0.6 0.8
0.1 0.5 0.7
0.4 0.6
0.3 0.5
0.4
0.2 0.3
0 0.1 0.2
0 0.1 0.2 0.3 0
0.4 0.1 0.6
x2 0 0.5 0.7 0.8 0.9
0.9 11
x1
x1
Curse of dimensionality
[8] C. Ambroise, G.J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expresseion data. PNAS Vol. 99 6562-6566(2002)
Motivation
• Introduction/Motivation
X = f1 ... f i ... f n
x1
- i.e. each instance x has n
x=
attributes, variables, features,
dimensions
xn
Feature Selection - Definition
• Feature Selection: F
F‘
• Feature Extraction/Creation
F F‘
[1] R. Kohavi and G. John Wrappers for features selection. Artificial Intelligence, 97(1-2):273-324, December 1997
[2] Ron Kohavi, George H. John: Wrappers for Feature Subset Selection. AIJ special issue on relevance (1996)
Relevance of features
• Relevance Optimality of Feature-Set
– Classifiers induced from training data are likely to be
suboptimal (no access to the real distribution of the data)
• Introduction/Motivation
k =1 k,i
− fi k =1 k
−y
The higher the correlation between the feature and the target, the higher the score!
Ranking Criteria – Correlation
Ranking Criteria – Correlation
Correlation Criteria:
i, ) −
• R (XY 1 ,1
• what if y = XOR(x1,x2) ?
Ranking Criteria – Correlation
Questions:
• Can a useless variable (i.e. one with a small score) be useful together
with others ?
Yes
• diversity of features
Ranking Criteria – Inf. Theory
Information Theoretic Criteria
• Most approaches use (empirical estimates of)
mutual information between features and the
target: p( x , y)
I ( xi , y) = p ( xi , y ) log i
dxdy
xi y
p( xi ) p( y)
• Introduction/Motivation
• Goal:
- Find the optimal feature subset.
(or at least a “good one.”)
• Classification of methods:
– Filters
– Wrappers
Feature Subset Selection
• You need:
– a measure for assessing the goodness of a feature
subset (scoring function)
[9] E. Amaldi, V. Kann: The approximability of minimizing nonzero variables and unsatisfied relations in linear systems. (1997)
Feature Subset Selection
Filter Methods
• Select subsets of variables as a pre-processing step,
independently of the used classifier!!
– Backward elimination
(start with full feature set and discard features at each step) we start
with the full model (including all the independent variables) and
then remove the insignificant feature with highest p-value(>
significance level).
Feature Subset Selection
Wrapper Methods
• Bi-directional elimination (Stepwise Selection)
➢ It is similar to forward selection but the difference is while
adding a new feature it also checks the significance of already
added features and if it finds any of the already selected features
insignificant then it simply removes that particular feature
through backward elimination.