0% found this document useful (0 votes)
9 views15 pages

Dam 3

Lecture 10 discusses dimensionality reduction techniques, highlighting the curse of dimensionality and its implications for multivariate data analysis. It covers feature selection and extraction methods, including Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA), emphasizing their roles in preserving information and enhancing class discrimination. The lecture concludes with a comparison of PCA and LDA, illustrating their different objectives in dimensionality reduction.

Uploaded by

lencho03406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

Dam 3

Lecture 10 discusses dimensionality reduction techniques, highlighting the curse of dimensionality and its implications for multivariate data analysis. It covers feature selection and extraction methods, including Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA), emphasizing their roles in preserving information and enhancing class discrimination. The lecture concludes with a comparison of PCA and LDA, illustrating their different objectives in dimensionality reduction.

Uploaded by

lencho03406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Lecture 10: Dimensionality reduction

g The curse of dimensionality


g Feature extraction vs. feature selection
g Principal Components Analysis
g Linear Discriminant Analysis

Intelligent Sensor Systems 1


Ricardo Gutierrez-Osuna
Wright State University
Dimensionality reduction
g The “curse of dimensionality”
n Refers to the problems associated with multivariate data analysis as the
dimensionality increases
g Consider a 3-class pattern recognition problem
n Three types of objects have to be classified based on the value of a
single feature:

x1
n A simple procedure would be to
g Divide the feature space into uniform bins
g Compute the ratio of examples for each class at each bin and,
g For a new example, find its bin and choose the predominant class in that bin
n We decide to start with one feature and divide the real line into 3 bins
g Notice that there exists a lot of overlap between classes ⇒ to improve
discrimination, we decide to incorporate a second feature

Intelligent Sensor Systems 2


Ricardo Gutierrez-Osuna
Wright State University
Dimensionality reduction
g Moving to two dimensions increases the number of bins from 3
to 32=9
n QUESTION: Which should we maintain constant?
g The density of examples per bin? This increases the number of examples from
9 to 27
g The total number of examples? This results in a 2D scatter plot that is very
sparse
x2 Constant density x2 Constant # examples

x2

x1 x1

g Moving to three features …


n The number of bins grows to 33=27
n To maintain the initial density of examples, x1

the number of required examples grows to 81


n For the same number of examples the
3D scatter plot is almost empty x3

Intelligent Sensor Systems 3


Ricardo Gutierrez-Osuna
Wright State University
Dimensionality reduction
g Implications of the curse of dimensionality
n Exponential growth with dimensionality in the number of examples
required to accurately estimate a function
g In practice, the curse of dimensionality means that
n For a given sample size, there is a maximum number of features above
which the performance of our classifier will degrade rather than improve
g In most cases, the information

performance
that was lost by discarding some
features is compensated by a
more accurate mapping in lower-
dimensional space

dimensionality
g How do we beat the curse of dimensionality?
n By incorporating prior knowledge
n By providing increasing smoothness of the target function
n By reducing the dimensionality

Intelligent Sensor Systems 4


Ricardo Gutierrez-Osuna
Wright State University
Dimensionality reduction
g Two approaches to perform dim. reduction ℜN→ℜM (M<N)
n Feature selection: choosing a subset of all the features
 →[x i ]
feature
[x1 x 2 ...x N ] selection 1
x i2 ...x iM

n Feature extraction: creating new features by combining existing ones

 →[y1 y 2 ...y M ] = f ([x i ])


feature
[x1 x 2 ...x N ] extraction 1
x i2 ...x iM
g In either case, the goal is to find a low-dimensional representation of the data that
preserves (most of) the information or structure in the data
g Feature extraction is covered in more detail in CS790
g Linear feature extraction
n The “optimal” mapping y=f(x) is, in general, a non-linear function whose form is
problem-dependent
g Hence, feature extraction is commonly limited to linear projections y=Wx

 x1 
x   y1   w 11 w 12 L  x1 
w 1N   
 
M
2

       →  2  =  21
y   w w 22 L w 2N   2 
x
M
M M O M  
linear feature extraction

      
     
 yM   w M1 w M2 w MN   
 xN   xN 

Intelligent Sensor Systems 5


Ricardo Gutierrez-Osuna
Wright State University
Signal representation versus classification
g Two criteria can be used to find the “optimal” feature extraction
mapping y=f(x)
n Signal representation: The goal of feature extraction is to represent the samples
accurately in a lower-dimensional space
n Classification: The goal of feature extraction is to enhance the class-
discriminatory information in the lower-dimensional space

Feature 2
2
g Within the realm of linear feature 2 2
22 2 2
extraction, two techniques are 111
1
22 2
2 22
1 1 11 2 2
commonly used 1 1 2 2 2
1 111
n Principal Components (PCA) 111 1 2

Si
gn
g Based on signal representation 11

a
1

l r (PC

n
ep A
n Fisher’s Linear Discriminant (LDA)

A) io
D at
re )

(L ific
se
g Based on classification

nt

ss
a

la
tio

C
n
Feature 1

Intelligent Sensor Systems 6


Ricardo Gutierrez-Osuna
Wright State University
Principal Components Analysis
g Let us illustrate PCA with a two dimensional problem
n Data x follows a Gaussian density as depicted in the figure
n Vectors can be represented by their 2D coordinates:
x = x1u1 + x 2 u2 = (x1, x 2 )u1 ,u2
n We seek to find a 1D representation x’ “close” to x
x’ = y v = (y )v
n Where “closeness” is measured by u2 v
the mean squared error over all points
in the distribution x’

(y )v = argminE [[x’− x] ] 2 x2
x
y

x1 u1

Intelligent Sensor Systems 7


Ricardo Gutierrez-Osuna
Wright State University
Principal Components Analysis
g RESULT (for proof check CS790 notes)
n It can be shown that the “optimal”1D representation consists of
projecting the vector x over the direction of maximum variance in the
data (e.g., the longest axis in the ellipse)
g This result can be generalized for more than two dimensions

The
The optimal*
optimal* approximation
approximation of
of aa random
random vector x∈ℜNN by
vector x∈ℜ by aa linear
linear combination
combination of of
M
M (M<N)
(M<N) independent
independent vectors
vectors is
is obtained
obtained by by projecting
projecting the
the random
random vector
vector xx onto
onto
the eigenvectors vvii corresponding
the eigenvectors corresponding to to the
the largest eigenvalues λλii of
largest eigenvalues of the
the covariance
covariance
matrix
matrix of (Σxx))
of xx (Σ

Intelligent Sensor Systems 8


Ricardo Gutierrez-Osuna
Wright State University
Principal Components Analysis
g Summary

x’ = y 1 v 1 + y 2 v 2 L + y M v M
 x1 

 y1  v 1
T
  v11 v12 L v1N   
y   T  v v 22   x2 
x’=  2  =  2  x =  21
v x 
M     M O  3 
   T   M 
y M  v M  v M1 v M2 L v MN   x 
 N

n where vk is the eigenvector corresponding to the kth largest eigenvalue


of the covariance matrix

Intelligent Sensor Systems 9


Ricardo Gutierrez-Osuna
Wright State University
Linear Discriminant Analysis, two-classes
g The objective of LDA is to perform dimensionality reduction
while preserving as much of the class discriminatory
information as possible
n Assume we have a set of N-dimensional samples (x1, x2, …, xN), P1 of
which belong to class ω1, and P2 to class ω2. We seek to obtain a scalar
y by projecting the samples x onto a line

x2
n Of all the possible lines we would
like to select the one that maximizes
the separability of the classes

x1

Intelligent Sensor Systems 10


Ricardo Gutierrez-Osuna
Wright State University
Linear Discriminant Analysis
g In a nutshell, we want
n Maximum separation between the means of the projection
n Minimum variance within each projected class

x2

µ1 µ1’ - µ2’
σ1’

maximize
( 1 ’− 2 ’)
2
µ2

1’ − 2 ’
2 2
σ2’

x1

Intelligent Sensor Systems 11


Ricardo Gutierrez-Osuna
Wright State University
Linear Discriminant Analysis
g RESULT (for proof check CS790 notes)
n It can be shown that the optimal projection matrix W* is the one whose columns
are the eigenvectors corresponding to the largest eigenvalues of the following
generalized eigenvalue problem

 V T SB V 
[
V = v1 | v 2 | L | v C−1 ] = argmin  T  ⇒ (SB − iS W )v i = 0
 V SW V 

n Where SB and SW are the BETWEEN-CLASS and WITHIN-CLASS covariance


matrices
C C
S W = ∑ Si = ∑ ∑ (x − i )(x − i )T
i=1 i=1 x∈ i

C
SB = ∑ Pi ( i − )( i − )T
i=1

1 1
where i =
Ni x∈
∑ x and = ∑x
N ∀x
i

Intelligent Sensor Systems 12


Ricardo Gutierrez-Osuna
Wright State University
PCA Versus LDA
4 1
6 10
4 2
77 44 29
4
7 2 58 93
6
10710410 10 4 46
6 6 1 2
1 921 9 5 8
2 7
710 47 4 6 4 2 51 58 9 38 9 9
6 5 9
82 59 38
9 10
7 7 10
10
10
7 77 7107 710
10
10
10 10
710
4
7 6 6 2 2 5 938
44 4 6 16 2212125 15829958 3 8 93
6 774 77 4
0 7 10 6
4 4 7 2 10 7
1010 41010 44
6 6 9
85 8 8 2 8 77 7 9 6 4 2
axis 2
6 6 9 1 21 295 5 10107 4 6 46 109
7 10107 7

axis 3
PCA -2

-4
7
10
10 7 4
4 46 2 61 21 29 85
1
6 6 5 5 33
15 31 2 8
9 3 335
53
8
83
53
0
10
7
10
10
74410
7
4
104
101
5 6
44
466
10
5 1
2 64 2
2
664 61566221
6 921
1
2
5
6
8
48
2
6 92 159
5 15
5
5
8 9 9
5
8128922 8 29 1
5 9 5319395 3988 93
7
10 15 1 3 10 4 61769
62129855162 5 8 313 5 8
29 9
-5 13 1 3 122 5 888 9
-6 3 8 313 8 3 8
5
3
2 83 3 3 8
53 3 5
-10 3 3 5
-8 -5 3
0
2 0
-10 8 5
-5

-10
6 -5 0 5 10 10 -10
axis 2
axis 1 axis 1

-3
x 10
7
15 7 77
7 7777 -3
9
7777
x 10
777 9999
95 9599
10 99
9
95
98 5
9 55 52
7 7 5
8885
88
9
888822552
9
5 5
22
52
22522
22
5
22
8888 8 211122
0 11111 1
13111 111 6 7 77 7
axis 2

7 77
LDA
axis 3
5
95 4
9999995555 9 -5 333 66666 44444 44 777 777
7
8 99859555555555
99
333
3
3
3 3 6666666 444444
995 5 2 22 -10
3 333 3 66666 44 4
4
88 88
888 58
8 22222 3 4
1010
0 88
338
3 2222 22 -15 3 10 10
1010
10
1010
333383 2
11
11 10
10
10 10
1010
10
3338
33333 1
11
2 12 1 6 66 4 444 -0.02 15
3 111111 666666666 4444444 4 10 10
-5 3 11 6 44 10 10
10 0
6 666 4 44 1010
10
10
10 5 -3
4 1010
10
10
10 0.02 0
x 10
1010
-0.02 -0.01 0 0.01 0.02 0.03 0.04
10
0.05
0.04 -5
axis 2
axis 1 axis 1

Intelligent Sensor Systems 13


Ricardo Gutierrez-Osuna
Wright State University
Limitations of LDA
g LDA assumes unimodal Gaussian likelihoods
n If the densities are significantly non-Gaussian, LDA may not preserve
any complex structure of the data needed for classification
µ1= µ2=µ ω2
µ1 µ1= µ2=µ
ω1
ω1 ω2 ω1

ω2
ω2 ω1 µ2

g LDA will fail when the discriminatory


x2
information is not in the mean but
rather in the variance of the data

A
PC

LD
A

x1
Intelligent Sensor Systems 14
Ricardo Gutierrez-Osuna
Wright State University
Limitations of LDA
g LDA has a tendency to overfit training data
n To illustrate this problem, we generate an artificial dataset
g Three classes, 50 examples per class, with the exact same likelihood: a multivariate
Gaussian with zero mean and identity covariance
g As we arbitrarily increase the number of dimensions, classes appear to separate
better, even though they come from the same distribution
10 dimensions 50 dimensions
3 1
2 1 1
2 22 2 2 1
1 1 1 2 121
2 12 21 2 3 2
2 3 1 1
1 2
11 22 3 23 1 1 1 1 1
3 1 2 1 11 1
3
2 31122133 313 1 2 1 1 1
1 3 1322322 11 212 2
1
2 1 3311 11 3
123
2 3 3 32 2 3 1 3113 323 1 1 22 21
32
133 13 1 1 22231 2
1 1 1311211 1213 32
3
1 23 13 2 1 33 3331 1 2 1122
3 22
1 2 3
1121 1 2 32123 3 3 3 33 313 2332 2 121 2
2
1 21
2
3 3231 3 23 2313
2
1 2 22 2 2 2 2
3
3 1
3 1 3 3 3133
212 3 3122222 2 2 2 2
21 1 1333 2121 3 3 2 3
331333 3 32
2
33 3 3 3 1 33
3 2 3
1 3 2

100 dimensions 150 dimensions


1 22
3
3 1 1 11 11 1 22222222
2
22 2
22 222
1 111 1 1 1
1 11
2
222
22222 2
22222222
2
3 3 1 111 111 11 1 2
3 333 3 333 1 111 1 111 1111 33 33333
3333 3 1
3 3333 11
11 333333
33
3333333
3 333 3 3333333
3 3 33333 3 3 2 2 3333333 3
3 333 22212 22 2
122 3 33 3
3 3 3 2 2 22 2 22 2 2
3 3 222
22 22 222222222 2
2 2 2 1111111111
2 2222 2 22 1111111111111
1
11 1111 1111
2 1 1 1 1

Intelligent Sensor Systems 15


Ricardo Gutierrez-Osuna
Wright State University

You might also like