0% found this document useful (0 votes)
18 views66 pages

Lecture 5

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views66 pages

Lecture 5

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Lecture 5

Feature Extraction, & Selection


Dr. Amr El-Wakeel
Lane Department of Computer
Science and Electrical Engineering

Spring 24
Feature extraction

Some slides from Alon Slapak’s


Geometric structure of a cluster
Each of the red circles The ellipse which encircles the major
80 part of the cluster represents the
represents an object pattern. The
distribution of the patterns.
coordinates of 75the circle are the
features (e.g. height and weight).
70

65
1
60
weight [kg]

The length of the


55 (152,51.5) principle axes of the
50
2 ellipse is proportional to
twice the square root of
Females
45 the eigenvalues of the
covariance matrix of
40 the patterns distribution
(λ1, λ2).
35
The principles axes of the
ellipse coincide
30 with the
120 130
eigenvectors of the covariance 140 150 160 170 180 190
height [cm] The center of the ellipse 
matrix of the patterns
distribution. is the mean of the patterns
distribution .
Typical dimension: Mean 
The center of the ellipse  is the mean of the patterns
distribution.
It can be estimated by:
N
1
=
N
x
k =1
k

While xk - is the kth pattern in the cluster,


N - is the number of patterns in the cluster. 120

110

100

90

weight [kg]
80

70

60

50

40

30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Typical dimension: Scatter matrix S
Scatter matrix S of a cluster is defined as:
N
S =  ( xk −  )( xk −  )
T

k =1
While xk - is the kth pattern in the cluster,
N - is the number of patterns in the cluster,
 - is the mean of all the patterns in the cluster
120

110

100
Larger Smaller
scatter scatter
90

weight [kg]
80

70

60

50

40

30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Data synthesis
• In order to test and debug pattern 120

recognition algorithms, it is 110

customary to use synthesized data. 100

• The synthesized data may be drawn 90

from arbitrary distribution, but in 80

weight [kg]
most of the literature it is
70
customary to assume a normal
distribution. 60

• But, it is not infrequent to come 50

across applications involving other 40

patterns distributions 30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Example: Two clusters synthesis
clear all

N1 = 150; N2 = 150;
E1 = [150 15; 120 20]; E2 = [100 10; 70 30];
M1 = [170,75]'; M2 = [160,50]'; 120
[P1,A1] = eig(E1); [P2,A2] = eig(E2);
110
y1=randn(2,N1); y2=randn(2,N2);
100
for i=1:N1, Males
x1(:,i) =P1*sqrt(A1)* y1(:,i)+M1; 90
end; weight [kg]

for i=1:N2, 80

x2(:,i) =P2*sqrt(A2)* y2(:,i)+M2;


70
end;
60

figure; 50
Females
plot(x1(1,:),x1(2,:),'.',x2(1,:),x2(2,:),'or');
40
axis([120 220 30 120]);
xlabel ('height [cm]'); 30
120 130 140 150 160 170 180 190 200 210 220
ylabel ('weight [kg]'); height [cm]
Separability
Q: How does feature extraction
method affect the pattern
recognition performance?
120

110

100

90
weight [kg]

80

70

60

50

40

30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
height [cm]
Separability
A: More than the classification algorithm.
If the patterns achieved by the feature extraction
method creates nonseparate clusters, no
classification algorithm can do the job.

Classification
Feature algorithm
extraction
method
Separability

In fact, the separability


achieved by the feature
extraction method is the
upper limit of the pattern
recognition performance.
Separability

Q: How can one assess the separability


achieved by the feature extraction
method?
120 120 120

110 110 110

100 100 100

90 90 90
weight [kg]

weight [kg]

weight [kg]
80 80 80

70 70 70

60 60 60

50
50 50

40
40 40

30
120 130 140 150 160 170 180 190 200 210 220 30 30
height [cm] 120 130 140 150 160 170 180 190 200 210 220 120 130 140 150 160 170 180 190 200 210 220
height [cm] height [cm]

Hopeless Bad Good


Separability

A: Several separability criteria exist,


most are involving scatters matrices
N
S =  ( xk −  )( xk −  )
T

120
k =1

110

100 Larger Smaller


90 scatter scatter
weight [kg]

80

70

60

50

40

30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Separability
If we define the within-class
N
scatter matrix as:
C
SW =  Pi  ( xki − i )( xki − i )
i
T

i =1 k =1
While xki - is the kth pattern in the ith cluster,
Ni - is the number of patterns in the ith cluster,
i - is the mean of all the patterns in the ith cluster
C - is the number of the clusters
Pi - is the a priori probability of the ith cluster, and may be
estimated by C
Pi = Ni N
p =1
p

and the between-class scatter matrix as:


C
S B =  Pi ( i −  )( i −  )
T

i =1
While  - is the mean of all the patterns in all the clusters

and the total-class scatter matrix as:

ST = S B + SW
Separability function J
We can find in the literature several
functions which represent the separability
of clusters by a scalar:
tr ( S B )
J=
tr ( SW )
(
J = ln det ( SW−1S B ) )
J = det ( SW−1S B ) J = tr ( SW−1S B )

Pay attention to the fact that J is a scalar. It is of utmost importance


because scalar function is necessary for later optimization (we will
try to maximize J of course)
Feature extraction guidelines
Q: What will be regarded as a reasonable
feature?
Example: Features of a student: Useless. Includes no
information on gender.
• Number of eyes Useless. Very poor

• Hair color correlation with gender.

Useless. Very poor


• Wear glasses or not correlation with gender.

• Hair length Effective, but is hard to


measure.

• Shoe size Effective and simple.

• Height
Effective and simple.
• Weight
Effective and simple.
Features extraction guidelines
“When we have two or more classes, feature extraction
consists of choosing those features which are most effective
for preserving class separability” (Fukunaga p. 441)

• Correlation with the classification feature


• Easy to measure
• Maximizing the separability function
Features extraction guidelines
Q: How can one determine what feature
follows the guidelines?

A1: There is an extensive literature dealing


with features to specific applications (e.g.
symmetry for face recognition)

A2: A widespread approach is to emulate


human mechanism (e.g. treating all the
pixels as features in face recognition)
Example - handwritten recognition
The following 20 figures depict examples of handwritten
digits. Since almost everyone can recognize each of the
digits, it is reasonable to assume that treating all the
pixels as features will assure a successful pattern
recognition process.
Feature reduction
Complexity problem !!!

• While dealing with 28X28 pixels handwriting digits, the


feature space dimensions is 784.
• While dealing with face recognition, we may have about
200X200 pixels. It means that the feature space
dimensions is 40000.
• While dealing with voice recognition, we may have about
1[Sec]X16[kHz]. It means that the feature space dimensions
is 16000.
Feature reduction
Omit features !!!

Because of:
• The complexity of most of the
classification algorithms is O(n2).
• Learning time
• “Curse of dimensionality”
Feature reduction
“Feature selection, also known as subset selection or variable selection,
is a process commonly used in machine learning, wherein a subset of
the features available from the data are selected for application of a
learning algorithm. Feature selection is necessary either because it is
computationally infeasible to use all available features, or because of
problems of estimation when limited data samples (but a large
number of features) are present. The latter problem is related to the
so-called curse of dimensionality.”
(WIKIPEDIA)

Feature Feature
extraction Selection

High Low
Object dimension dimension
Pattern Pattern
Feature reduction
Q: How can one identify the irrelevant and the
redundant features?
A: Several options:
1. Feature subset selection
By grouping features one can represent every (or more efficiently the
promising) set of features n provide a small set of features yet highly
representative subset of the feature space.

2. Heuristics
In most applications, the relevant information resides in the low frequency
range (e.g. images), hence it is logical to reduce dimensionality by taking
the first coefficients of the Fourier/DCT transform.

3. Optimization approach: Karhunen-Loeve Transform (KLT),


Principal Component Analysis (PCA), Fisher Linear Discriminant Analysis
(FLDA)
We may select a subset of the features (or linear combinations of the
features) which best contribute to the separability of the clusters.
Feature selection - Example
The separability of these two classes is pretty good. But, do
we really need two features?

120
120
10
110
10
100
Males
90
90
weight [kg]

80
80
70
70
60
60
50
50
Females
40
40

30
0 5 10 15 20 25 30 30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
30

25

20

15

10

0
120 130 140 150 160 170 180 190 200 210 220
Feature selection - Example
No! The same separability can be achieved by
projecting the patterns onto the blue axis i.e.
only one-dimension feature-space is needed.
120

110

100
Males
90 But how can we find this
“blue axis” or the
weight [kg]

80
feature subspace on
70 which we should
project the patterns?
60

50
Females Dimensionality
40
Reduction
Techniques
DCT, LDA, PCA …
1
0
-1
30
-1.2 120 130 140 150 160 170 180 190 200 210 220
-1
height [cm]
30
25
-0.8 20
15
-0.6 10
5
-0.4 0

March 2006 Alon Slapak 24 of 32


Feature Extraction & selection: Overview

• Introduction/Motivation WHY ?

• Basic definitions, Terminology WHAT ?

• Variable Ranking methods


HOW ?

• Feature subset selection


Problem: Where to focus attention?

• A universal problem of intelligent (learning)


agents is where to focus their attention.

• What aspects of the problem at hand are


important/necessary to solve?

• Discriminate between the relevant and


irrelevant parts of experience.
What is Feature selection?

• Feature selection:
Problem of selecting some subset of a
learning algorithm’s input variables upon
which it should focus attention, while
ignoring the rest
(DIMENSIONALITY REDUCTION)

• Humans/animals do that constantly!


Motivational example from Biology
[1]

Monkeys performing classification task


?

N. Sigala & N. Logothetis, 2002: Visual categorization shapes feature selectivity in the primate temporal cortex.

[1] Nathasha Sigala, Nikos Logothetis: Visual categorization shapes feature selectivity in the primate visual cortex. Nature Vol. 415(2002)
Motivational example from Biology

Monkeys performing classification task

Diagnostic features:
- Eye separation
- Eye height

Non-Diagnostic features:
- Mouth height
- Nose length
Motivational example from Biology

Monkeys performing classification task


Results:
– activity of a population of 150 neurons in the
anterior inferior temporal cortex was measured
– 44 neurons responded significantly differently to
at least one feature
– After Training: 72% (32/44) were selective to
one or both of the diagnostic features (and not for
the non-diagnostic features)
Motivational example from Biology

Monkeys performing classification task


Results:
(population
of neurons)
Feature Selection in ML?

Why even think about Feature Selection in ML?


- The information about the target class is inherent in the
variables!

- Naive theoretical view:


More features
=> More information
=> More discrimination power.

- In practice:
many reasons why this is not the case!

- Also:
Optimization is (usually) good, so why not try to optimize the
input-coding?
Feature Selection in ML ? YES!
- Many explored domains have hundreds to tens of thousands of
variables/features with many irrelevant and redundant ones!

- In domains with many features the underlying probability


distribution can be very complex and very hard to estimate (e.g.
dependencies between variables)!

- Irrelevant and redundant features can “confuse” learners!

- Limited training data!

- Limited computational resources!

- Curse of dimensionality!
Curse of dimensionality

1
positiveexamples
positive examples
negativeexamples
negative examples
0.9

1
0.8
0.9
0.8
0.7
0.7
0.6
0.6
0.5
x2x3

0.5
0.4
0.3
0.4
0.2
0.1
0.3
0
1
0.20.9
0.8 1
0.7 0.9
0.6 0.8
0.1 0.5 0.7
0.4 0.6
0.3 0.5
0.4
0.2 0.3
0 0.1 0.2
0 0.1 0.2 0.3 0
0.4 0.1 0.6
x2 0 0.5 0.7 0.8 0.9
0.9 11
x1
x1
Curse of dimensionality

• The required number of samples (to achieve the same


accuracy) grows exponentially with the number of
variables!
• In practice: number of training examples is fixed!
=> the classifier’s performance usually will degrade for a
large number of features!

In many cases the information that is


lost by discarding variables is made
up for by a more accurate
mapping/sampling in the lower-
dimensional space !
Example for ML-Problem

Gene selection from microarray data


– Variables:
gene expression coefficients corresponding to the amount of
mRNA in a patient‘s sample (e.g. tissue biopsy)

– Task: Separate healthy patients from cancer patients

– Usually there are only about 100 examples (patients)


available for training and testing (!!!)
– Number of variables in the raw data: 6.000 – 60.000
– Does this work ? ([8])

[8] C. Ambroise, G.J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expresseion data. PNAS Vol. 99 6562-6566(2002)
Motivation

• Especially when dealing with a large number


of variables there is a need for
dimensionality reduction!

• Feature Selection can significantly improve a


learning algorithm’s performance!
Overview

• Introduction/Motivation

• Basic definitions, Terminology

• Variable Ranking methods

• Feature subset selection


Problem setting

• Classification/Regression (Supervised Learning):


Given empirical data (training data)
=
L{
(
x
1,)
y,
1.
.
.,
(x
iy
,i)
,.
.
.,
(x
m,
ym)
}
X
Y

a learner has to find a hypothesis hH:X→ Y


that is used to assign a label y to unseen x.

- Classification: y is an integer (e.g. Y = {-1,1})


- Regression: y is real-valued (e.g. Y = [-1,1])
Features/Variables

- Take a closer look at the data:

X = f1  ...  f i  ...  f n

 x1 
 
- i.e. each instance x has n   
x=
attributes, variables, features,   
dimensions  
 xn 
Feature Selection - Definition

• Given a set of features F ={ f1 ,..., f i ,..., f n }


the Feature Selection problem is
to find a subset F '  F that “maximizes the learners
ability to classify patterns”.
Feature Extraction-Definition
• Given a set of features F ={ f1 ,..., f i ,..., f n }
the Feature Extraction (“Construction”) problem is
is to map F to some feature set F ''that maximizes the
learner’s ability to classify patterns.
''=
F ar
gma
x
G
 
()
G

• This general definition subsumes feature selection


(i.e. a feature selection algorithm also performs a mapping but
can only map to subsets of the input variables)

*  here is the set of all possible feature sets


Feature Selection / - Extraction

• Feature Selection: F

F‘

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯


f . selection
→{ fi1 ,..., fi j ,..., fim } i j  1,..., n ; j = 1,..., m
ia = ib  a = b; a, b  1,..., m

• Feature Extraction/Creation
F F‘

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯→


f . extraction
{g1 ( f1 ,..., f n ),..., g j ( f1,..., f n ),..., g m ( f1,..., f n )}
Feature Selection – Optimality ?

• In theory the goal is to find an optimal feature-subset (one that


maximizes the scoring function)

• In real world applications this is usually not possible


– For most problems it is computationally intractable to search the
whole space of possible feature subsets
– One usually has to settle for approximations of the optimal subset
– Most of the research in this area is devoted to finding efficient
search-heuristics
Relevance of features
• Relevance of a variable/feature:
– There are several definitions of relevance in literature

– Relevance of 1 variable, Relevance of a variable given other


variables, Relevance given a certain learning algorithm,..

– Most definitions are problematic, because there are problems


where all features would be declared to be irrelevant

– removal of fi alone will always result in a performance


deterioration of an optimal Bayes classifier.

[1] R. Kohavi and G. John Wrappers for features selection. Artificial Intelligence, 97(1-2):273-324, December 1997
[2] Ron Kohavi, George H. John: Wrappers for Feature Subset Selection. AIJ special issue on relevance (1996)
Relevance of features
• Relevance  Optimality of Feature-Set
– Classifiers induced from training data are likely to be
suboptimal (no access to the real distribution of the data)

– Relevance does not imply that the feature is in the optimal


feature subset

– Even “irrelevant” features can improve a classifier‘s


performance

– Defining relevance in terms of a given classifier (and


therefore a hypothesis space) would be better.
Overview

• Introduction/Motivation

• Basic definitions, Terminology

• Variable Ranking methods

• Feature subset selection


Ranking Criteria – Correlation
Correlation Criteria:
• Pearson correlation coefficient
cov( fi , y)
R ( fi , y ) =
var( fi ) var( y)

• Estimate for m samples: m


R( f i , y) =
 (f k =1 k ,i − fi )( y k −y )
 (f )  (y )
m 2 m 2

k =1 k,i
− fi k =1 k
−y
The higher the correlation between the feature and the target, the higher the score!
Ranking Criteria – Correlation
Ranking Criteria – Correlation
Correlation Criteria:
i, )  − 
• R (XY 1 ,1

• mostly R(xi,y)² or |R(xi,y)| is used

• measure for the goodness of linear fit of xi and y.


(can only detect linear dependencies between variable
and target.)

• what if y = XOR(x1,x2) ?
Ranking Criteria – Correlation
Questions:

• Can variables with small score be automatically discarded ?


No

• Can a useless variable (i.e. one with a small score) be useful together
with others ?
Yes

• Can two variables that are useless by themselves can be useful


together?)
this depends on the correlation between x1 and x2
Ranking Criteria – Correlation
• Can variables with small score
be discarded without further
consideration?
NO!

• Even variables with small


score can improve class
seperability!
• Here this depends on the
correlation between x1 and x2.
(Here the class conditional
distributions have a high
covariance in the direction
orthogonal to the line between
the two class centers)
Ranking Criteria – Correlation
• Example with high correlation
between x1 and x2.

(Here the class conditional


distributions have a high
covariance in the direction of
the two class centers)

• No gain in seperation ability by


using two variables instead of
just one!
Ranking Criteria – Correlation
• Can a useless variable be
useful together with
others ?
YES!
Ranking Criteria – Correlation
• correlation between variables and target are not enough to
assess relevance!

• correlation / covariance between pairs of variables has to be


considered too!
(potentially difficult)

• diversity of features
Ranking Criteria – Inf. Theory
Information Theoretic Criteria
• Most approaches use (empirical estimates of)
mutual information between features and the
target: p( x , y)
I ( xi , y) =   p ( xi , y ) log i
dxdy
xi y
p( xi ) p( y)

• Case of discrete variables: P( X = xi , Y = y )


I ( xi , y) =  x  P( X = xi , Y = y) log
i y
P ( X = x i )P (Y = y )
(probabilities are estimated from frequency counts)
Ranking Criteria – Inf. Theory
• Mutual information can also detect non-linear dependencies
among variables!

• But harder to estimate than correlation!

• It is a measure for “how much information (in terms of


entropy) two random variables share”
Overview

• Introduction/Motivation

• Basic definitions, Terminology

• Variable Ranking methods

• Feature subset selection


Feature Subset Selection

• Goal:
- Find the optimal feature subset.
(or at least a “good one.”)

• Classification of methods:
– Filters
– Wrappers
Feature Subset Selection
• You need:
– a measure for assessing the goodness of a feature
subset (scoring function)

– a strategy to search the space of possible feature


subsets

• Finding a minimal optimal feature set for an


arbitrary target concept is NP-hard
=> Good heuristics are needed!

[9] E. Amaldi, V. Kann: The approximability of minimizing nonzero variables and unsatisfied relations in linear systems. (1997)
Feature Subset Selection
Filter Methods
• Select subsets of variables as a pre-processing step,
independently of the used classifier!!

• Note that Variable Ranking-FS is a filter method


Feature Subset Selection
Filter Methods
• usually fast
• provide generic selection of features, not tuned by
given learner (universal)
• this is also often criticised (feature set not optimized
for used classifier)
• sometimes used as a pre-processing step for other
methods
Feature Subset Selection
Wrapper Methods
• Learner is considered a black-box

• Interface of the black-box is used to score subsets of


variables according to the predictive power of the learner
when using the subsets.

• Results vary for different learners


• One needs to define:
– how to search the space of all possible variable subsets ?
– how to assess the prediction performance of a learner ?
Feature Subset Selection
Wrapper Methods
Feature Subset Selection
Wrapper Methods
• The problem of finding the optimal subset is NP-hard!

• A wide range of heuristic search strategies can be used.


Two different classes:
– Forward selection
(start with empty feature set and add features at each step) we start
with a null model and then start fitting the model with each
individual feature one at a time and select the feature with the
minimum p-value.

– Backward elimination
(start with full feature set and discard features at each step) we start
with the full model (including all the independent variables) and
then remove the insignificant feature with highest p-value(>
significance level).
Feature Subset Selection
Wrapper Methods
• Bi-directional elimination (Stepwise Selection)
➢ It is similar to forward selection but the difference is while
adding a new feature it also checks the significance of already
added features and if it finds any of the already selected features
insignificant then it simply removes that particular feature
through backward elimination.

➢ Hence, It is a combination of forward selection and backward


elimination.

• predictive power is usually measured on a validation set or


by cross-validation
• By using the learner as a black box wrappers are universal
and simple!
• Criticism: a large amount of computation is required.
• Still gready

You might also like