0% found this document useful (0 votes)

18 views66 pages

Lecture 5

Uploaded by

dylan.j.gormley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views66 pages

Lecture 5

Uploaded by

dylan.j.gormley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Lecture 5

Feature Extraction, & Selection

Dr. Amr El-Wakeel
Lane Department of Computer
Science and Electrical Engineering

Spring 24
Feature extraction

Some slides from Alon Slapak’s

Geometric structure of a cluster
Each of the red circles The ellipse which encircles the major
80 part of the cluster represents the
represents an object pattern. The
distribution of the patterns.
coordinates of 75the circle are the
features (e.g. height and weight).
70

65
1
60
weight [kg]

The length of the

55 (152,51.5) principle axes of the
50
2 ellipse is proportional to
twice the square root of
Females
45 the eigenvalues of the
covariance matrix of
40 the patterns distribution
(λ1, λ2).
35
The principles axes of the
ellipse coincide
30 with the
120 130
eigenvectors of the covariance 140 150 160 170 180 190
height [cm] The center of the ellipse 
matrix of the patterns
distribution. is the mean of the patterns
distribution .
Typical dimension: Mean 
The center of the ellipse  is the mean of the patterns
distribution.
It can be estimated by:
N
1
=
N
x
k =1
k

While xk - is the kth pattern in the cluster,

N - is the number of patterns in the cluster. 120

110

100

weight [kg]
80

30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Typical dimension: Scatter matrix S
Scatter matrix S of a cluster is defined as:
N
S =  ( xk −  )( xk −  )
T

k =1
While xk - is the kth pattern in the cluster,
N - is the number of patterns in the cluster,
 - is the mean of all the patterns in the cluster
120

110

100
Larger Smaller
scatter scatter
90

weight [kg]
80

30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Data synthesis
• In order to test and debug pattern 120

recognition algorithms, it is 110

customary to use synthesized data. 100

• The synthesized data may be drawn 90

from arbitrary distribution, but in 80

weight [kg]
most of the literature it is
70
customary to assume a normal
distribution. 60

• But, it is not infrequent to come 50

across applications involving other 40

patterns distributions 30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Example: Two clusters synthesis
clear all

N1 = 150; N2 = 150;
E1 = [150 15; 120 20]; E2 = [100 10; 70 30];
M1 = [170,75]'; M2 = [160,50]'; 120
[P1,A1] = eig(E1); [P2,A2] = eig(E2);
110
y1=randn(2,N1); y2=randn(2,N2);
100
for i=1:N1, Males
x1(:,i) =P1*sqrt(A1)* y1(:,i)+M1; 90
end; weight [kg]

for i=1:N2, 80

x2(:,i) =P2sqrt(A2) y2(:,i)+M2;

70
end;
60

figure; 50
Females
plot(x1(1,:),x1(2,:),'.',x2(1,:),x2(2,:),'or');
40
axis([120 220 30 120]);
xlabel ('height [cm]'); 30
120 130 140 150 160 170 180 190 200 210 220
ylabel ('weight [kg]'); height [cm]
Separability
Q: How does feature extraction
method affect the pattern
recognition performance?
120

110

100

90
weight [kg]

30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
height [cm]
Separability
A: More than the classification algorithm.
If the patterns achieved by the feature extraction
method creates nonseparate clusters, no
classification algorithm can do the job.

Classification
Feature algorithm
extraction
method
Separability

In fact, the separability

achieved by the feature
extraction method is the
upper limit of the pattern
recognition performance.
Separability

Q: How can one assess the separability

achieved by the feature extraction
method?
120 120 120

110 110 110

100 100 100

90 90 90
weight [kg]

weight [kg]

weight [kg]
80 80 80

70 70 70

60 60 60

50
50 50

40
40 40

30
120 130 140 150 160 170 180 190 200 210 220 30 30
height [cm] 120 130 140 150 160 170 180 190 200 210 220 120 130 140 150 160 170 180 190 200 210 220
height [cm] height [cm]

Hopeless Bad Good

Separability

A: Several separability criteria exist,

most are involving scatters matrices
N
S =  ( xk −  )( xk −  )
T

120
k =1

110

100 Larger Smaller

90 scatter scatter
weight [kg]

30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
Separability
If we define the within-class
N
scatter matrix as:
C
SW =  Pi  ( xki − i )( xki − i )
i
T

i =1 k =1
While xki - is the kth pattern in the ith cluster,
Ni - is the number of patterns in the ith cluster,
i - is the mean of all the patterns in the ith cluster
C - is the number of the clusters
Pi - is the a priori probability of the ith cluster, and may be
estimated by C
Pi = Ni N
p =1
p

and the between-class scatter matrix as:

C
S B =  Pi ( i −  )( i −  )
T

i =1
While  - is the mean of all the patterns in all the clusters

and the total-class scatter matrix as:

ST = S B + SW
Separability function J
We can find in the literature several
functions which represent the separability
of clusters by a scalar:
tr ( S B )
J=
tr ( SW )
(
J = ln det ( SW−1S B ) )
J = det ( SW−1S B ) J = tr ( SW−1S B )

Pay attention to the fact that J is a scalar. It is of utmost importance

because scalar function is necessary for later optimization (we will
try to maximize J of course)
Feature extraction guidelines
Q: What will be regarded as a reasonable
feature?
Example: Features of a student: Useless. Includes no
information on gender.
• Number of eyes Useless. Very poor

• Hair color correlation with gender.

Useless. Very poor

• Wear glasses or not correlation with gender.

• Hair length Effective, but is hard to

measure.

• Shoe size Effective and simple.

• Height
Effective and simple.
• Weight
Effective and simple.
Features extraction guidelines
“When we have two or more classes, feature extraction
consists of choosing those features which are most effective
for preserving class separability” (Fukunaga p. 441)

• Correlation with the classification feature

• Easy to measure
• Maximizing the separability function
Features extraction guidelines
Q: How can one determine what feature
follows the guidelines?

A1: There is an extensive literature dealing

with features to specific applications (e.g.
symmetry for face recognition)

A2: A widespread approach is to emulate

human mechanism (e.g. treating all the
pixels as features in face recognition)
Example - handwritten recognition
The following 20 figures depict examples of handwritten
digits. Since almost everyone can recognize each of the
digits, it is reasonable to assume that treating all the
pixels as features will assure a successful pattern
recognition process.
Feature reduction
Complexity problem !!!

• While dealing with 28X28 pixels handwriting digits, the

feature space dimensions is 784.
• While dealing with face recognition, we may have about
200X200 pixels. It means that the feature space
dimensions is 40000.
• While dealing with voice recognition, we may have about
1[Sec]X16[kHz]. It means that the feature space dimensions
is 16000.
Feature reduction
Omit features !!!

Because of:
• The complexity of most of the
classification algorithms is O(n2).
• Learning time
• “Curse of dimensionality”
Feature reduction
“Feature selection, also known as subset selection or variable selection,
is a process commonly used in machine learning, wherein a subset of
the features available from the data are selected for application of a
learning algorithm. Feature selection is necessary either because it is
computationally infeasible to use all available features, or because of
problems of estimation when limited data samples (but a large
number of features) are present. The latter problem is related to the
so-called curse of dimensionality.”
(WIKIPEDIA)

Feature Feature
extraction Selection

High Low
Object dimension dimension
Pattern Pattern
Feature reduction
Q: How can one identify the irrelevant and the
redundant features?
A: Several options:
1. Feature subset selection
By grouping features one can represent every (or more efficiently the
promising) set of features n provide a small set of features yet highly
representative subset of the feature space.

2. Heuristics
In most applications, the relevant information resides in the low frequency
range (e.g. images), hence it is logical to reduce dimensionality by taking
the first coefficients of the Fourier/DCT transform.

3. Optimization approach: Karhunen-Loeve Transform (KLT),

Principal Component Analysis (PCA), Fisher Linear Discriminant Analysis
(FLDA)
We may select a subset of the features (or linear combinations of the
features) which best contribute to the separability of the clusters.
Feature selection - Example
The separability of these two classes is pretty good. But, do
we really need two features?

120
120
10
110
10
100
Males
90
90
weight [kg]

80
80
70
70
60
60
50
50
Females
40
40

30
0 5 10 15 20 25 30 30
120 130 140 150 160 170 180 190 200 210 220
height [cm]
30

0
120 130 140 150 160 170 180 190 200 210 220
Feature selection - Example
No! The same separability can be achieved by
projecting the patterns onto the blue axis i.e.
only one-dimension feature-space is needed.
120

110

100
Males
90 But how can we find this
“blue axis” or the
weight [kg]

80
feature subspace on
70 which we should
project the patterns?
60

50
Females Dimensionality
40
Reduction
Techniques
DCT, LDA, PCA …
1
0
-1
30
-1.2 120 130 140 150 160 170 180 190 200 210 220
-1
height [cm]
30
25
-0.8 20
15
-0.6 10
5
-0.4 0

March 2006 Alon Slapak 24 of 32

Feature Extraction & selection: Overview

• Introduction/Motivation WHY ?

• Basic definitions, Terminology WHAT ?

• Variable Ranking methods

HOW ?

• Feature subset selection

Problem: Where to focus attention?

• A universal problem of intelligent (learning)

agents is where to focus their attention.

• What aspects of the problem at hand are

important/necessary to solve?

• Discriminate between the relevant and

irrelevant parts of experience.
What is Feature selection?

• Feature selection:
Problem of selecting some subset of a
learning algorithm’s input variables upon
which it should focus attention, while
ignoring the rest
(DIMENSIONALITY REDUCTION)

• Humans/animals do that constantly!

Motivational example from Biology
[1]

Monkeys performing classification task

N. Sigala & N. Logothetis, 2002: Visual categorization shapes feature selectivity in the primate temporal cortex.

[1] Nathasha Sigala, Nikos Logothetis: Visual categorization shapes feature selectivity in the primate visual cortex. Nature Vol. 415(2002)
Motivational example from Biology

Monkeys performing classification task

Diagnostic features:
- Eye separation
- Eye height

Non-Diagnostic features:
- Mouth height
- Nose length
Motivational example from Biology

Monkeys performing classification task

Results:
– activity of a population of 150 neurons in the
anterior inferior temporal cortex was measured
– 44 neurons responded significantly differently to
at least one feature
– After Training: 72% (32/44) were selective to
one or both of the diagnostic features (and not for
the non-diagnostic features)
Motivational example from Biology

Monkeys performing classification task

Results:
(population
of neurons)
Feature Selection in ML?

Why even think about Feature Selection in ML?

- The information about the target class is inherent in the
variables!

- Naive theoretical view:

More features
=> More information
=> More discrimination power.

- In practice:
many reasons why this is not the case!

- Also:
Optimization is (usually) good, so why not try to optimize the
input-coding?
Feature Selection in ML ? YES!
- Many explored domains have hundreds to tens of thousands of
variables/features with many irrelevant and redundant ones!

- In domains with many features the underlying probability

distribution can be very complex and very hard to estimate (e.g.
dependencies between variables)!

- Irrelevant and redundant features can “confuse” learners!

- Limited training data!

- Limited computational resources!

- Curse of dimensionality!
Curse of dimensionality

1
positiveexamples
positive examples
negativeexamples
negative examples
0.9

1
0.8
0.9
0.8
0.7
0.7
0.6
0.6
0.5
x2x3

0.5
0.4
0.3
0.4
0.2
0.1
0.3
0
1
0.20.9
0.8 1
0.7 0.9
0.6 0.8
0.1 0.5 0.7
0.4 0.6
0.3 0.5
0.4
0.2 0.3
0 0.1 0.2
0 0.1 0.2 0.3 0
0.4 0.1 0.6
x2 0 0.5 0.7 0.8 0.9
0.9 11
x1
x1
Curse of dimensionality

• The required number of samples (to achieve the same

accuracy) grows exponentially with the number of
variables!
• In practice: number of training examples is fixed!
=> the classifier’s performance usually will degrade for a
large number of features!

In many cases the information that is

lost by discarding variables is made
up for by a more accurate
mapping/sampling in the lower-
dimensional space !
Example for ML-Problem

Gene selection from microarray data

– Variables:
gene expression coefficients corresponding to the amount of
mRNA in a patient‘s sample (e.g. tissue biopsy)

– Task: Separate healthy patients from cancer patients

– Usually there are only about 100 examples (patients)

available for training and testing (!!!)
– Number of variables in the raw data: 6.000 – 60.000
– Does this work ? ([8])

[8] C. Ambroise, G.J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expresseion data. PNAS Vol. 99 6562-6566(2002)
Motivation

• Especially when dealing with a large number

of variables there is a need for
dimensionality reduction!

• Feature Selection can significantly improve a

learning algorithm’s performance!
Overview

• Introduction/Motivation

• Basic definitions, Terminology

• Variable Ranking methods

• Feature subset selection

Problem setting

• Classification/Regression (Supervised Learning):

Given empirical data (training data)
=
L{
(
x
1,)
y,
1.
.
.,
(x
iy
,i)
,.
.
.,
(x
m,
ym)
}
X
Y

a learner has to find a hypothesis hH:X→ Y

that is used to assign a label y to unseen x.

- Classification: y is an integer (e.g. Y = {-1,1})

- Regression: y is real-valued (e.g. Y = [-1,1])
Features/Variables

- Take a closer look at the data:

X = f1  ...  f i  ...  f n

 x1 
 
- i.e. each instance x has n   
x=
attributes, variables, features,   
dimensions  
 xn 
Feature Selection - Definition

• Given a set of features F ={ f1 ,..., f i ,..., f n }

the Feature Selection problem is
to find a subset F '  F that “maximizes the learners
ability to classify patterns”.
Feature Extraction-Definition
• Given a set of features F ={ f1 ,..., f i ,..., f n }
the Feature Extraction (“Construction”) problem is
is to map F to some feature set F ''that maximizes the
learner’s ability to classify patterns.
''=
F ar
gma
x
G
 
()
G

• This general definition subsumes feature selection

(i.e. a feature selection algorithm also performs a mapping but
can only map to subsets of the input variables)

*  here is the set of all possible feature sets

Feature Selection / - Extraction

• Feature Selection: F

F‘

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯

f . selection
→{ fi1 ,..., fi j ,..., fim } i j  1,..., n ; j = 1,..., m
ia = ib  a = b; a, b  1,..., m

• Feature Extraction/Creation
F F‘

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯→

f . extraction
{g1 ( f1 ,..., f n ),..., g j ( f1,..., f n ),..., g m ( f1,..., f n )}
Feature Selection – Optimality ?

• In theory the goal is to find an optimal feature-subset (one that

maximizes the scoring function)

• In real world applications this is usually not possible

– For most problems it is computationally intractable to search the
whole space of possible feature subsets
– One usually has to settle for approximations of the optimal subset
– Most of the research in this area is devoted to finding efficient
search-heuristics
Relevance of features
• Relevance of a variable/feature:
– There are several definitions of relevance in literature

– Relevance of 1 variable, Relevance of a variable given other

variables, Relevance given a certain learning algorithm,..

– Most definitions are problematic, because there are problems

where all features would be declared to be irrelevant

– removal of fi alone will always result in a performance

deterioration of an optimal Bayes classifier.

[1] R. Kohavi and G. John Wrappers for features selection. Artificial Intelligence, 97(1-2):273-324, December 1997
[2] Ron Kohavi, George H. John: Wrappers for Feature Subset Selection. AIJ special issue on relevance (1996)
Relevance of features
• Relevance  Optimality of Feature-Set
– Classifiers induced from training data are likely to be
suboptimal (no access to the real distribution of the data)

– Relevance does not imply that the feature is in the optimal

feature subset

– Even “irrelevant” features can improve a classifier‘s

performance

– Defining relevance in terms of a given classifier (and

therefore a hypothesis space) would be better.
Overview

• Introduction/Motivation

• Basic definitions, Terminology

• Variable Ranking methods

• Feature subset selection

Ranking Criteria – Correlation
Correlation Criteria:
• Pearson correlation coefficient
cov( fi , y)
R ( fi , y ) =
var( fi ) var( y)

• Estimate for m samples: m

R( f i , y) =
 (f k =1 k ,i − fi )( y k −y )
 (f )  (y )
m 2 m 2

k =1 k,i
− fi k =1 k
−y
The higher the correlation between the feature and the target, the higher the score!
Ranking Criteria – Correlation
Ranking Criteria – Correlation
Correlation Criteria:
i, )  − 
• R (XY 1 ,1

• mostly R(xi,y)² or |R(xi,y)| is used

• measure for the goodness of linear fit of xi and y.

(can only detect linear dependencies between variable
and target.)

• what if y = XOR(x1,x2) ?
Ranking Criteria – Correlation
Questions:

• Can variables with small score be automatically discarded ?

• Can a useless variable (i.e. one with a small score) be useful together
with others ?
Yes

• Can two variables that are useless by themselves can be useful

together?)
this depends on the correlation between x1 and x2
Ranking Criteria – Correlation
• Can variables with small score
be discarded without further
consideration?
NO!

• Even variables with small

score can improve class
seperability!
• Here this depends on the
correlation between x1 and x2.
(Here the class conditional
distributions have a high
covariance in the direction
orthogonal to the line between
the two class centers)
Ranking Criteria – Correlation
• Example with high correlation
between x1 and x2.

(Here the class conditional

distributions have a high
covariance in the direction of
the two class centers)

• No gain in seperation ability by

using two variables instead of
just one!
Ranking Criteria – Correlation
• Can a useless variable be
useful together with
others ?
YES!
Ranking Criteria – Correlation
• correlation between variables and target are not enough to
assess relevance!

• correlation / covariance between pairs of variables has to be

considered too!
(potentially difficult)

• diversity of features
Ranking Criteria – Inf. Theory
Information Theoretic Criteria
• Most approaches use (empirical estimates of)
mutual information between features and the
target: p( x , y)
I ( xi , y) =   p ( xi , y ) log i
dxdy
xi y
p( xi ) p( y)

• Case of discrete variables: P( X = xi , Y = y )

I ( xi , y) =  x  P( X = xi , Y = y) log
i y
P ( X = x i )P (Y = y )
(probabilities are estimated from frequency counts)
Ranking Criteria – Inf. Theory
• Mutual information can also detect non-linear dependencies
among variables!

• But harder to estimate than correlation!

• It is a measure for “how much information (in terms of

entropy) two random variables share”
Overview

• Introduction/Motivation

• Basic definitions, Terminology

• Variable Ranking methods

• Feature subset selection

Feature Subset Selection

• Goal:
- Find the optimal feature subset.
(or at least a “good one.”)

• Classification of methods:
– Filters
– Wrappers
Feature Subset Selection
• You need:
– a measure for assessing the goodness of a feature
subset (scoring function)

– a strategy to search the space of possible feature

subsets

• Finding a minimal optimal feature set for an

arbitrary target concept is NP-hard
=> Good heuristics are needed!

[9] E. Amaldi, V. Kann: The approximability of minimizing nonzero variables and unsatisfied relations in linear systems. (1997)
Feature Subset Selection
Filter Methods
• Select subsets of variables as a pre-processing step,
independently of the used classifier!!

• Note that Variable Ranking-FS is a filter method

Feature Subset Selection
Filter Methods
• usually fast
• provide generic selection of features, not tuned by
given learner (universal)
• this is also often criticised (feature set not optimized
for used classifier)
• sometimes used as a pre-processing step for other
methods
Feature Subset Selection
Wrapper Methods
• Learner is considered a black-box

• Interface of the black-box is used to score subsets of

variables according to the predictive power of the learner
when using the subsets.

• Results vary for different learners

• One needs to define:
– how to search the space of all possible variable subsets ?
– how to assess the prediction performance of a learner ?
Feature Subset Selection
Wrapper Methods
Feature Subset Selection
Wrapper Methods
• The problem of finding the optimal subset is NP-hard!

• A wide range of heuristic search strategies can be used.

Two different classes:
– Forward selection
(start with empty feature set and add features at each step) we start
with a null model and then start fitting the model with each
individual feature one at a time and select the feature with the
minimum p-value.

– Backward elimination
(start with full feature set and discard features at each step) we start
with the full model (including all the independent variables) and
then remove the insignificant feature with highest p-value(>
significance level).
Feature Subset Selection
Wrapper Methods
• Bi-directional elimination (Stepwise Selection)
➢ It is similar to forward selection but the difference is while
adding a new feature it also checks the significance of already
added features and if it finds any of the already selected features
insignificant then it simply removes that particular feature
through backward elimination.

➢ Hence, It is a combination of forward selection and backward

elimination.

• predictive power is usually measured on a validation set or

by cross-validation
• By using the learner as a black box wrappers are universal
and simple!
• Criticism: a large amount of computation is required.
• Still gready

Pattern Recognition
No ratings yet
Pattern Recognition
66 pages
Portinale Saitta 2002a
No ratings yet
Portinale Saitta 2002a
22 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages
Foundations of Machine Learning: Module 3: Instance Based Learning and Feature Reduction
No ratings yet
Foundations of Machine Learning: Module 3: Instance Based Learning and Feature Reduction
40 pages
ML Feature Selection Guide
No ratings yet
ML Feature Selection Guide
40 pages
C3 - Feature Extraction in ML
No ratings yet
C3 - Feature Extraction in ML
20 pages
Pat Recogn
No ratings yet
Pat Recogn
145 pages
Dalal 2008
No ratings yet
Dalal 2008
6 pages
UNIT04
No ratings yet
UNIT04
35 pages
Pattern Analysis: Feature Methods
No ratings yet
Pattern Analysis: Feature Methods
14 pages
Pattern Recognition: C G (P) G (F (M) )
No ratings yet
Pattern Recognition: C G (P) G (F (M) )
143 pages
Feature Selection and Extraction
No ratings yet
Feature Selection and Extraction
26 pages
Data-Driven Healthcare Insights
No ratings yet
Data-Driven Healthcare Insights
38 pages
3c Feature Extraction
No ratings yet
3c Feature Extraction
19 pages
DimensionalityReduction (Filter and Wrapper Methods)
No ratings yet
DimensionalityReduction (Filter and Wrapper Methods)
47 pages
PR Assignment 01 - Seemal Ajaz (206979)
No ratings yet
PR Assignment 01 - Seemal Ajaz (206979)
7 pages
MCA Class Note Feature
No ratings yet
MCA Class Note Feature
5 pages
Feature Extraction Techniques Guide
No ratings yet
Feature Extraction Techniques Guide
16 pages
Object & Pattern Recognition Guide
No ratings yet
Object & Pattern Recognition Guide
51 pages
Curse of Dimensionality and Its Reduction
No ratings yet
Curse of Dimensionality and Its Reduction
5 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
Eigenvector-Based Feature Extraction For Classification
No ratings yet
Eigenvector-Based Feature Extraction For Classification
5 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
Pattern Recognition
No ratings yet
Pattern Recognition
45 pages
Lecture 1
No ratings yet
Lecture 1
25 pages
AI Unit-5 Notes
No ratings yet
AI Unit-5 Notes
25 pages
Explore Feature Engineering
No ratings yet
Explore Feature Engineering
10 pages
Feature Engineering and Dimensionality Reduction
No ratings yet
Feature Engineering and Dimensionality Reduction
146 pages
1 Introduction
No ratings yet
1 Introduction
81 pages
Data Mining: Dimensionality Reduction
No ratings yet
Data Mining: Dimensionality Reduction
135 pages
Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
No ratings yet
Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
19 pages
Pattern Recognition for CS Scholars
0% (1)
Pattern Recognition for CS Scholars
37 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
CHP 4
No ratings yet
CHP 4
72 pages
Unit 4
No ratings yet
Unit 4
109 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Conference 101719
No ratings yet
Conference 101719
7 pages
PP&DS 4
No ratings yet
PP&DS 4
82 pages
Unit 3
No ratings yet
Unit 3
55 pages
PR Some Solutions
No ratings yet
PR Some Solutions
26 pages
Heuristic Algorithms For Extracting Relevant Features in Signal Analysis
No ratings yet
Heuristic Algorithms For Extracting Relevant Features in Signal Analysis
8 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
Lecture4-Dimensionality Reduction Methods
No ratings yet
Lecture4-Dimensionality Reduction Methods
40 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
ML Module VI
No ratings yet
ML Module VI
24 pages
Presentation 1
No ratings yet
Presentation 1
15 pages
ML Module 6
No ratings yet
ML Module 6
29 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
Pattern Recognition & Image Processing
No ratings yet
Pattern Recognition & Image Processing
24 pages
Unit 3
No ratings yet
Unit 3
50 pages
University Institute of Engineering Department of Computer Science & Engineering
No ratings yet
University Institute of Engineering Department of Computer Science & Engineering
23 pages
ML Lecture 02
No ratings yet
ML Lecture 02
40 pages
Module 6
No ratings yet
Module 6
51 pages
Pattern Recognition: Lecturer
No ratings yet
Pattern Recognition: Lecturer
43 pages
3ML.03.Feature Reduction
No ratings yet
3ML.03.Feature Reduction
44 pages
Sta 5
No ratings yet
Sta 5
16 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
46 pages
Lecture 6 7
No ratings yet
Lecture 6 7
69 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
Lecture 2 3
No ratings yet
Lecture 2 3
72 pages
Rubric Essay
No ratings yet
Rubric Essay
1 page
Air Force Nurse Corps Guide
100% (1)
Air Force Nurse Corps Guide
39 pages
Resume Bhanuprasad 111 - 1686803026
No ratings yet
Resume Bhanuprasad 111 - 1686803026
2 pages
Modeling The Past: Archaeology, History, and Dynamic Networks John Terrell Full Access
No ratings yet
Modeling The Past: Archaeology, History, and Dynamic Networks John Terrell Full Access
163 pages
Personality and Intelligence at Work Exploring and Explaining Individual Differences at Work 1st Edition Adrian Furnham No Waiting Time
No ratings yet
Personality and Intelligence at Work Exploring and Explaining Individual Differences at Work 1st Edition Adrian Furnham No Waiting Time
176 pages
PAIRING SCHEMES (FSC) 2024
No ratings yet
PAIRING SCHEMES (FSC) 2024
6 pages
IIT-IsM Srinivas Brochure
No ratings yet
IIT-IsM Srinivas Brochure
11 pages
Lesson Plan MAPEH
No ratings yet
Lesson Plan MAPEH
9 pages
Airtel Marketing Strategy Study
No ratings yet
Airtel Marketing Strategy Study
5 pages
Essay On Library 100, 200 and 250 Words Leverage Edu
No ratings yet
Essay On Library 100, 200 and 250 Words Leverage Edu
1 page
Charles Osasogie Omogiate - Dissertation (Recent)
No ratings yet
Charles Osasogie Omogiate - Dissertation (Recent)
348 pages
Relationship Attachment Survey
100% (2)
Relationship Attachment Survey
3 pages
4 Introduction To Computatonal Science
No ratings yet
4 Introduction To Computatonal Science
6 pages
Tcs NQT Verbal Ability Sample Questions Set 2 1
No ratings yet
Tcs NQT Verbal Ability Sample Questions Set 2 1
2 pages
GE 2 - Worksheet 1 & 1.1
No ratings yet
GE 2 - Worksheet 1 & 1.1
2 pages
First+Merit+List+ (First+Phase) +for+External+Students+f.y.b.a+&+f.y.b.com+Academic+Year+2025 26+final
No ratings yet
First+Merit+List+ (First+Phase) +for+External+Students+f.y.b.a+&+f.y.b.com+Academic+Year+2025 26+final
1 page
Monitoring 2024
No ratings yet
Monitoring 2024
8 pages
Boys Hostel 4th Allotment List 2016
No ratings yet
Boys Hostel 4th Allotment List 2016
2 pages
Faith's Role in Life's Journey
No ratings yet
Faith's Role in Life's Journey
17 pages
Chapter 5 Power Point Humsr 101
No ratings yet
Chapter 5 Power Point Humsr 101
57 pages
Lucrecia Roces Kasilag - Wikipedia
No ratings yet
Lucrecia Roces Kasilag - Wikipedia
17 pages
21ID008000 SummaryPDFEn
No ratings yet
21ID008000 SummaryPDFEn
10 pages
Apaar Id Status PVT SCH Date 19.12.24
No ratings yet
Apaar Id Status PVT SCH Date 19.12.24
4 pages
Biology SBA Seminar for Teachers
No ratings yet
Biology SBA Seminar for Teachers
67 pages
DBMS Syllabus
No ratings yet
DBMS Syllabus
4 pages
Updated Resume - Alcantara, Jennifer
No ratings yet
Updated Resume - Alcantara, Jennifer
3 pages
Project Initiation Essentials
No ratings yet
Project Initiation Essentials
21 pages
College Application Resume Guide
100% (1)
College Application Resume Guide
8 pages
Agriculture Tech & Computer Skills
No ratings yet
Agriculture Tech & Computer Skills
2 pages
SIS Upgrade for Malagnat School
No ratings yet
SIS Upgrade for Malagnat School
79 pages

Lecture 5

Uploaded by

Lecture 5

Uploaded by

Lecture 5

Feature Extraction, & Selection

Some slides from Alon Slapak’s

The length of the

While xk - is the kth pattern in the cluster,

recognition algorithms, it is 110

customary to use synthesized data. 100

• The synthesized data may be drawn 90

from arbitrary distribution, but in 80

• But, it is not infrequent to come 50

across applications involving other 40

x2(:,i) =P2*sqrt(A2)* y2(:,i)+M2;

In fact, the separability

Q: How can one assess the separability

110 110 110

100 100 100

Hopeless Bad Good

A: Several separability criteria exist,

100 Larger Smaller

and the between-class scatter matrix as:

and the total-class scatter matrix as:

Pay attention to the fact that J is a scalar. It is of utmost importance

• Hair color correlation with gender.

Useless. Very poor

• Hair length Effective, but is hard to

• Shoe size Effective and simple.

• Correlation with the classification feature

A1: There is an extensive literature dealing

A2: A widespread approach is to emulate

• While dealing with 28X28 pixels handwriting digits, the

3. Optimization approach: Karhunen-Loeve Transform (KLT),

March 2006 Alon Slapak 24 of 32

• Basic definitions, Terminology WHAT ?

• Variable Ranking methods

• Feature subset selection

• A universal problem of intelligent (learning)

• What aspects of the problem at hand are

• Discriminate between the relevant and

• Humans/animals do that constantly!

Monkeys performing classification task

Monkeys performing classification task

Monkeys performing classification task

Monkeys performing classification task

Why even think about Feature Selection in ML?

- Naive theoretical view:

- In domains with many features the underlying probability

- Irrelevant and redundant features can “confuse” learners!

- Limited training data!

- Limited computational resources!

• The required number of samples (to achieve the same

In many cases the information that is

Gene selection from microarray data

– Task: Separate healthy patients from cancer patients

– Usually there are only about 100 examples (patients)

• Especially when dealing with a large number

• Feature Selection can significantly improve a

• Basic definitions, Terminology

• Variable Ranking methods

• Feature subset selection

• Classification/Regression (Supervised Learning):

a learner has to find a hypothesis hH:X→ Y

- Classification: y is an integer (e.g. Y = {-1,1})

- Take a closer look at the data:

• Given a set of features F ={ f1 ,..., f i ,..., f n }

• This general definition subsumes feature selection

*  here is the set of all possible feature sets

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯→

• In theory the goal is to find an optimal feature-subset (one that

• In real world applications this is usually not possible

– Relevance of 1 variable, Relevance of a variable given other

– Most definitions are problematic, because there are problems

– removal of fi alone will always result in a performance

– Relevance does not imply that the feature is in the optimal

– Even “irrelevant” features can improve a classifier‘s

– Defining relevance in terms of a given classifier (and

x2(:,i) =P2sqrt(A2) y2(:,i)+M2;