0% found this document useful (0 votes)
144 views15 pages

Lda1 PDF

This document provides an overview of linear discriminant analysis (LDA) for dimensionality reduction while preserving class discriminatory information. It discusses LDA for two classes, defining measures of separation between class projections to find the optimal projection vector that maximizes discrimination. The within-class and between-class scatter matrices are introduced to express the Fisher criterion, which is maximized to find the projection direction that separates classes while minimizing within-class variance. Variants and limitations of LDA are also mentioned.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views15 pages

Lda1 PDF

This document provides an overview of linear discriminant analysis (LDA) for dimensionality reduction while preserving class discriminatory information. It discusses LDA for two classes, defining measures of separation between class projections to find the optimal projection vector that maximizes discrimination. The within-class and between-class scatter matrices are introduced to express the Fisher criterion, which is maximized to find the projection direction that separates classes while minimizing within-class variance. Variants and limitations of LDA are also mentioned.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

LECTURE 10: Linear Discriminant Analysis

g Linear Discriminant Analysis, two classes


g Linear Discriminant Analysis, C classes

g LDA vs. PCA example

g Limitations of LDA

g Variants of LDA

g Other dimensionality reduction methods

Introduction to Pattern Analysis 1


Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, two-classes (1)
g The objective of LDA is to perform dimensionality reduction while
preserving as much of the class discriminatory information as
possible
n Assume we have a set of D-dimensional samples {x(1, x(2, …, x(N}, N1 of which
belong to class ω1, and N2 to class ω2. We seek to obtain a scalar y by projecting
the samples x onto a line
y = wTx

n Of all the possible lines we would like to select the one that maximizes the
separability of the scalars
g This is illustrated for the two-dimensional case in the following figures
x2 x2

x1 x1
Introduction to Pattern Analysis 2
Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, two-classes (2)
g In order to find a good projection vector, we need to define a measure
of separation between the projections
g The mean vector of each class in x and y feature space is
1 ~ = 1 1
µi =
Ni
∑x
x∈ωi
and µi
Ni
∑y = N ∑w
y∈ωi
T
x = w Tµi
i x∈ωi

g We could then choose the distance between the projected means as our objective
function
~ = w T (µ − µ )
~ −µ
J( w ) = µ1 2 1 2

g However, the distance between the projected means is not a very good measure since it
does not take into account the standard deviation within the classes
x2

µ1
This axis yields better class separability
µ2

x1

This axis has a larger distance between means

Introduction to Pattern Analysis 3


Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, two-classes (3)
g The solution proposed by Fisher is to maximize a function that represents the
difference between the means, normalized by a measure of the within-class
scatter
n For each class we define the scatter, an equivalent of the variance, as
~
∑ (y − µ~ )
2
si2 = i
y∈ωi

g where the quantity+~ (~s


2
)
s22 is called the within-class scatter of the projected examples
1
n The Fisher linear discriminant is defined as the linear function wTx that maximizes the
criterion function
~ −µ
µ ~ 2
1 2
J( w ) = ~2 ~2
s +s 1 2

n Therefore, we will be looking for a projection


x2
where examples from the same class are
projected very close to each other and, at the µ1

same time, the projected means are as farther µ2


apart as possible

x1

Introduction to Pattern Analysis 4


Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, two-classes (4)
n In order to find the optimum projection w*, we need to express J(w) as an explicit function of w
n We define a measure of the scatter in multivariate feature space x, which are scatter matrices

∑ (x − µ )(x − µ )
T
Si = i i
x∈ωi

S1 + S 2 = S W
g where SW is called the within-class scatter matrix
n The scatter of the projection y can then be expressed as a function of the scatter matrix in
feature space x
~
∑ (y − µ~ ) ∑ (w ) = ∑ w (x − µ )(x − µ ) w = w S w
2 2 T
si2 = i = T
x − w Tµi T
i i
T
i
y∈ωi x∈ωi x∈ωi
~
s12 + ~
s22 = w T S W w
n Similarly, the difference between the projected means can be expressed in terms of the means
in the original feature space

(µ~1 − µ~2 )2 = (w Tµ1 − w Tµ2 )2 = w T (1


µ1 − µ2 )(µ1 − µ2 ) w = w T SB w
44 42444 3
T

SB

g The matrix SB is called the between-class scatter. Note that, since SB is the outer product of two vectors,
its rank is at most one
n We can finally express the Fisher criterion in terms of SW and SB as
w T SB w
J( w ) = T
w SW w
Introduction to Pattern Analysis 5
Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, two-classes (5)
n To find the maximum of J(w) we derive and equate to zero

[J( w )] = d  wT SB w  = 0 ⇒
T
d
dw dw  w S W w 

[
⇒ w SW w
T

dw
][
d w T SB w ] [
− w SB w
T d w TS W w
dw
][
=0 ⇒
]
[ ] [
⇒ w T S W w 2SB w − w T SB w 2S W w = 0 ]
n Dividing by wTSWw

[w S w ]S w − [w S w ] S
T T
w =0 ⇒
[w S w ] [w S w ]
W B
T B T W
W W

⇒ SB w − JS W w = 0 ⇒
⇒ S −W1SB w − Jw = 0

n Solving the generalized eigenvalue problem (SW-1SBw=Jw) yields

 w T SB w 
 = S W (µ1 − µ2 )
−1
w* = argmax  T
w  w SW w 

g This is know as Fisher’s Linear Discriminant (1936), although it is not a discriminant but rather a
specific choice of direction for the projection of the data down to one dimension

Introduction to Pattern Analysis 6


Ricardo Gutierrez-Osuna
Texas A&M University
LDA example
g Compute the Linear Discriminant projection for 10

the following two-dimensional dataset


n X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)} 8

n X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}
g SOLUTION (by hand) 6

n The class statistics are: x2


4
 0.80 - 0.40   1.84 - 0.04 
S1 =  ; S2 =  
- 0.40 2.60  - 0.04 2.64 
2
µ1 = [3.00 3.60 ]; µ2 = [8.40 7.60 ] wLDA
n The within- and between-class scatter are 0
0 2 4 6 8 10
29.16 21.60   2.64 - 0.44  x1
SB =  ; S =
 W - 0.44 5.28 
21.60 16.00   
n The LDA projection is then obtained as the solution of the generalized eigenvalue problem
11.89 - λ 8.81
S -W1 SB v = λv ⇒ S -W1 SB − λI = 0 ⇒ = 0 ⇒ λ = 15.65
5.08 3.76 - λ
11.89 8.81  v1   v 1   v 1   0.91
 5.08 3.76  v  = 15.65 v  ⇒ v  = 0.39 
  2   2  2  
n Or directly by

w* = S −W1 (µ1 − µ2 ) = [− 0.91 − 0.39 ]


T

Introduction to Pattern Analysis 7


Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, C-classes (1)
g Fisher’s LDA generalizes very gracefully for C-class problems
n Instead of one projection y, we will now seek (C-1) projections [y1,y2,…,yC-1] by means of
(C-1) projection vectors wi, which can be arranged by columns into a projection matrix
W=[w1|w2|…|wC-1]:
T
yi = w i x ⇒ y = W T x
g Derivation x2

n The generalization of the within-class scatter is SW1


µ1
C
S W = ∑ Si
i=1
SB1
1
∑ (x − µ )(x − µ ) ∑x
T
where Si = i i and µi = µ3
x∈ωi Ni x∈ωi SB3
µ
The generalization for the between-class scatter is SW3
n SB2
C
µ2
SB = ∑ Ni (µi − µ)(µi − µ)
T

i=1

1 1
where µ = ∑
N ∀x
x = ∑ Niµi
N x∈ωi
SW2
x1
g where ST=SB+SW is called the total scatter matrix

Introduction to Pattern Analysis 8


Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, C-classes (2)
n Similarly, we define the mean vector and scatter matrices for the projected samples as
C
~ =1 ~
S W = ∑ ∑ (y − µ
~ )(y − µ
~ )T
µi
Ni
∑y
y∈ωi i=1 y∈ωi
i i

~ =1 y
C
~
µ ∑ SB = ∑ Ni (µ ~)(µ
~ −µ
i
~)T
~ −µ
i
N ∀y i=1

n From our derivation for the two-class problem, we can write

~
S W = W TS W W
~
SB = W T SB W

n Recall that we are looking for a projection that maximizes the ratio of between-class to
within-class scatter. Since the projection is no longer a scalar (it has C-1 dimensions), we
then use the determinant of the scatter matrices to obtain a scalar objective function:

~
SB W T SB W
J( W ) = ~ =
SW W TSW W

n And we will seek the projection matrix W* that maximizes this ratio

Introduction to Pattern Analysis 9


Ricardo Gutierrez-Osuna
Texas A&M University
Linear Discriminant Analysis, C-classes (3)
n It can be shown that the optimal projection matrix W* is the one whose columns are the
eigenvectors corresponding to the largest eigenvalues of the following generalized
eigenvalue problem

 W T SB W 
[
W* = w | w | L | w
*
1
*
2
*
C −1 ] = argmax  T  ⇒ (SB − λ iS W )w i = 0
*

 W S W W 

g NOTES
n SB is the sum of C matrices of rank one or less and the mean vectors are constrained by
1 C
∑ µi = µ
C i=1
g Therefore, SB will be of rank (C-1) or less
g This means that only (C-1) of the eigenvalues λi will be non-zero
n The projections with maximum class separability information are the eigenvectors
corresponding to the largest eigenvalues of SW-1SB
n LDA can be derived as the Maximum Likelihood method for the case of normal class-
conditional densities with equal covariance matrices

Introduction to Pattern Analysis 10


Ricardo Gutierrez-Osuna
Texas A&M University
LDA Vs. PCA: Coffee discrimination with a gas sensor array
g These figures show the performance of PCA and
LDA on an odor recognition problem 60

n Five types of coffee beans were presented to an array


of chemical gas sensors 40

For each coffee type, 45 “sniffs” were performed and

Sensor response
n
20
the response of the gas sensor array was processed in
order to obtain a 60-dimensional feature vector
0
g Results
n From the 3D scatter plots it is clear that LDA -20
outperforms PCA in terms of class discrimination
n This is one example where the discriminatory -40
information is not aligned with the direction of 0 50 100 150 200
maximum variance Sulaw es y Keny a A rabian Sumatra Colombia

PCA LDA

2
223114 1 1 1 5
32333
2 11345 11 11 5 5
35 23 32 242 41 45 1 3111111111
3
5 5 55 5
4 2 43 15 5 7.42 1
3 3 11111
111 1 5
32 11 452223
3
4
4
23
23 14
1315
4
4 4 312 5 3 311
33333131131 13 14
31 555 5 5
555
30 2 2 5 131 155 3141252
454425
2121
2
3
1
322532 4
2 333331313313 3 4 5
5 5555 555 55
4 21
333435 43142 4 7.4
5555 55 5
2 5 135 5 1 4 2 1154355
34
51
5551353 44
51415 3 333 331 3 444444 4 4 5 55 555
2 1 4 3
4
4 35
5 5
5 4 7.38 3 33 3 4
333 44 4 22 22 5
5
2 5
2 333 44444444
25 3 5
4 4 4 2
2511 4 2 413 234311
1 4 5 44 4 2222
axis 3

axis 3
3 3 5433
51 2 2 51 51 7.36 44 44 4 4 22 22
4 22
2
2 22
20
4 44 444 2 222222 2
2
3 213 35 52 22
5 1 4 4 2 22 2 22
1 5
7.34
15 4 44 2
7.32 2 2 2
10 100 0.4
5 90
2
-260 2 -1.96 0.35
-240 80 -1.94
-220 -1.92
70 -1.9
-200 -1.88 0.3
axis 2 axis 2
axis 1 axis 1

Introduction to Pattern Analysis 11


Ricardo Gutierrez-Osuna
Texas A&M University
Limitations of LDA
g LDA produces at most C-1 feature projections
n If the classification error estimates establish that more features are needed, some other
method must be employed to provide those additional features
g LDA is a parametric method since it assumes unimodal Gaussian likelihoods
n If the distributions are significantly non-Gaussian, the LDA projections will not be able to
preserve any complex structure of the data, which may be needed for classification
µ1=µ2=µ
ω2

µ1 ω1 µ1=µ2=µ
ω1 ω2
ω1

ω2
ω2 ω1
µ2

g LDA will fail when the discriminatory information is not in the mean but rather
in the variance of the data
x2

PC

A
LD
A

x1

Introduction to Pattern Analysis 12


Ricardo Gutierrez-Osuna
Texas A&M University
Variants of LDA
g Non-parametric LDA (Fukunaga)
n NPLDA removes the unimodal Gaussian assumption by computing the between-class
scatter matrix SB using local information and the K Nearest Neighbors rule. As a result of this
g The matrix SB is full-rank, allowing us to extract more than (C-1) features
g The projections are able to preserve the structure of the data more closely
g Orthonormal LDA (Okada and Tomita)
n OLDA computes projections that maximize the Fisher criterion and, at the same time, are
pair-wise orthonormal
g The method used in OLDA combines the eigenvalue solution of SW-1SB and the Gram-Schmidt
orthonormalization procedure
g OLDA sequentially finds axes that maximize the Fisher criterion in the subspace orthogonal to all
features already extracted
g OLDA is also capable of finding more than (C-1) features
g Generalized LDA (Lowe)
n GLDA generalizes the Fisher criterion by incorporating a cost function similar to the one we
used to compute the Bayes Risk
g The effect of this generalized criterion is an LDA projection with a structure that is biased by the cost
function
g Classes with a higher cost Cij will be placed further apart in the low-dimensional projection
g Multilayer Perceptrons (Webb and Lowe)
n It has been shown that the hidden layers of multi-layer perceptrons (MLP) perform non-linear
discriminant analysis by maximizing Tr[SBST†], where the scatter matrices are measured at
the output of the last hidden layer

Introduction to Pattern Analysis 13


Ricardo Gutierrez-Osuna
Texas A&M University
Other dimensionality reduction methods (1)
g Exploratory Projection Pursuit (Friedman and Tukey)
n EPP seeks an M-dimensional (M=2,3 typically) linear projection of the data that maximizes a
measure of “interestingness”
n Interestingness is measured as departure from multivariate normality
g This measure is not the variance and is commonly scale-free. In most proposals it is also affine
invariant, so it does not depend on correlations between features . [Ripley, 1996]
n In other words, EPP seeks projections that separate clusters as much as possible and keeps
these clusters compact, a similar criterion as Fisher’s, but EPP does NOT use class labels
n Once an interesting projection is found, it is important to remove the structure it reveals to
allow other interesting views to be found more easily

UNINTERESTING INTERESTING
x2 x2

x1 x1

Introduction to Pattern Analysis 14


Ricardo Gutierrez-Osuna
Texas A&M University
Other dimensionality reduction methods (2)
g Sammon’s Non-linear Mapping (Sammon)
n This method seeks a mapping onto an M-dimensional space that preserves the
inter-point distances of the original N-dimensional space
g This is accomplished by minimizing the following objective function

E(d, d' ) = ∑
[d(P ,P ) − d(P ,P )]
i j i
'
j
' 2

i≠ j d(Pi ,Pj )

n The original method did not obtain an explicit mapping but only a lookup table for the elements in
the training set
n Recent implementations using artificial neural networks (MLPs and RBFs) do provide an explicit
mapping for test data and also consider cost functions (Neuroscale)
n Sammon’s mapping is closely related to Multi-Dimensional Scaling (MDS), a family of multivariate
statistical methods commonly used in the social sciences

x2 x2
Pj P’j

d(Pi, Pj)= d(P’i, P’j) ∀ i,j P’i


Pi
x1

x3 x1

Introduction to Pattern Analysis 15


Ricardo Gutierrez-Osuna
Texas A&M University

You might also like