0% found this document useful (0 votes)
112 views86 pages

Bayes Classification for Fish Sorting

This document provides an overview of Bayes classification and how it can be used to classify fish into different species using optical sensing. It discusses key concepts like conditional probability, Bayes' theorem, class-conditional probabilities, posterior probabilities, and the Bayes decision rule. It then presents an example of using length as a feature to classify between salmon and sea bass using training samples to estimate class-conditional probability distributions and applying the Bayes decision rule to assign new samples to a class.

Uploaded by

Rohit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views86 pages

Bayes Classification for Fish Sorting

This document provides an overview of Bayes classification and how it can be used to classify fish into different species using optical sensing. It discusses key concepts like conditional probability, Bayes' theorem, class-conditional probabilities, posterior probabilities, and the Bayes decision rule. It then presents an example of using length as a feature to classify between salmon and sea bass using training samples to estimate class-conditional probability distributions and applying the Bayes decision rule to assign new samples to a class.

Uploaded by

Rohit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Bayes Classification

Richa Singh

Slides are prepared from several information sources including Duda, Hart, Stork
Probability
§ Conditional probability of A given B:

§ Deriving chain rule from above:

2
Bayes Theorem

4
Bayes Theorem

5
Bayes Classification
An Example

• “Sorting incoming fish on a conveyor according to


species using optical sensing”

Sea bass
Species
Salmon

Let us build a machine learning system that classifies


between Sea Bass and Salmon
7
Fish Classification: Salmon vs. Sea Bass

• Set up a camera and take


some sample images
• Preprocessing involves image
enhancement and
segmentation;
• separate touching or
occluding fishes and
• extract fish contour

8
State of Nature/Prior
• Prior probabilities reflect domain expert’s
knowledge of how likely it is that each type of
fish will appear, before we actually see it.

– State of nature is a random variable: P(w1), P(w2)


– Uniform priors: The catch of salmon and sea bass
is equi-probable (P(w1) = P(w2))
– P(w1) + P(w2) = 1 (exclusivity and exhaustivity)

9
Problem Analysis

• Extract features from the images


• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…
• This is the set of all suggested features to explore for
use in our classifier

10
Representation: Fish Length as Feature

Training Samples

11
Class-conditional Probabilities

• Use of the class–conditional information

• P(x|w1) and P(x|w2) describe the difference


in feature (length or lightness) between the
populations of sea-bass and salmon

12
Class-conditional PDF

13
Bayes’ Classification
• Posterior, likelihood, prior, evidence

– Evidence: In case of two categories

16
Posterior Probabilities

17
Bayes’ Decision
• Decision given the posterior probabilities

• Therefore, whenever we observe a particular


x, the probability of error is :

18
Review
• Classification based on a single feature
• Two class classification
• Sample is assigned to one of the two classes
• The cost of making a false accept or a false
reject is same

20
Bayesian Decision Theory

• Generalization of the preceding ideas


– Use of more than one feature
– Use more than two states of nature
– Allowing actions other than decide on the state of nature
• Allowing actions other than classification primarily allows the
possibility of rejection
• Refusing to make a decision in close or bad cases!
– Introduce a loss function which is more general than the
probability of error
• The loss function states how costly each action taken is
21
Multiple Features
• Digit Recognition

Classifier 5

• X1,…,Xn Î {0,1} (Black vs. White pixels)


• Y Î {5,6} (predict whether a digit is a 5 or a 6)

22
The Bayes Classifier
• Let’s expand this for our digit recognition task:

• To classify, we will simply compute these two probabilities


and predict based on which one is greater

23
Model Parameters
• How many parameters are required to specify the likelihood?
– (Supposing that each image is 30 x 30 pixels)

• # of parameters for modeling P(X1,…,Xn|Y):


– 2(2n-1)

• # of parameters for modeling P(X1|Y),…,P(Xn|Y)


– 2n

)
0 1
90 -
2(2

25
Model Parameters
• The problem with explicitly modeling P(X1,…,Xn|Y) is that
there are usually way too many parameters:
– We’ll run out of space
– We’ll run out of time
– And we’ll need tons of training data (which is usually not
available)

26
The Naïve Bayes Model
• The Naïve Bayes Assumption: Assume that all features are
independent given the class label Y

• Equationally speaking:

27
The Bayes Classifier
• A good strategy is to predict:

– (for example: what is the probability that the image


represents a 5 given its pixels?)

• So … How do we compute that?


The Bayes Classifier
• Let’s expand this for our digit recognition task:

• To classify, we’ll simply compute these two probabilities and predict based
on which one is greater
Naïve Bayes Training
• Now that we’ve decided to use a Naïve Bayes classifier, we need to train it
with some data:

MNIST Training Data


The Normal Density
• Univariate density
– Continuous density
– A lot of processes are asymptotically Gaussian
– Handwritten characters, speech sounds are ideal or
prototype corrupted by random process (central limit
theorem)

1 ( "
1 x−µ %
2+
2
N(x; µ, σ ) = exp *− $ '-
2π σ *) 2 # σ & -,

35
Univariate density

36
The Normal Density
• Multivariate density: Multivariate normal
density in d dimensions is:
2 1 # 1 t −1 &
N(x; µ, σ ) = 1/2
exp %− (x − µ ) Σ (x − µ )(
(2π )d/2 Σ $ 2 '

x = (x1, x2, …, xd)t


µ = (µ1, µ2, …, µd)t mean vector
S = d*d covariance matrix
|S| and S-1 are determinant and inverse respectively
37
Multivariate density

38
Discriminant Functions for the
Normal Density

• Minimum error-rate classification can be achieved by the


discriminant function

• g1(x) = ln p(x | w1) + ln P(w1) – ln p(x)


• g2(x) = ln p(x | w2) + ln P(w2) – ln p(x)

39
Discriminant Functions for the
Normal Density

• Minimum error-rate classification can be achieved by


the discriminant function
• gi(x) = ln P(x | wi) + ln P(wi)

• Case of multivariate normal


1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2

2 1 # 1 t −1 &
N(x; µ, σ ) = 1/2
exp %− (x − µ ) Σ (x − µ )(
40
(2π )d/2 Σ $ 2 '

Analyzing Covariance Matrix
1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2

• Two cases:
– Features are independent
– Features are dependent

Si = s i 2.I
Case Si = s2.I (I stands for the identity matrix)
Case Si = S (covariance of all classes are identical but arbitrary!)
Case Si = actual covariance

41
Discriminant Functions for the Normal
Density

• Case Si = s2.I (I stands for the identity matrix)


– sij = 0: Features are statistically independent
– sii is same for all the features
1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2
1/s2 Constant for all Constant for all
the classes the classes

42
Discriminant Functions for the Normal
Density

• Case Si = s2.I (I stands for the identity matrix)


– sij = 0: Features are statistically independent)
– sii is same for all the features
1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2

43
Discriminant Functions for the Normal
Density…

• Disregarding xtx, we get a linear discriminant function


gi (x) = w it x + w i0
where :
µi 1 t
w i = 2 ; w i0 = − 2 µi µi + ln P(ω i )
σ 2σ

(ω i0 is called the threshold for the ith category!)

44
Discriminant Functions for the Normal
Density…

• A classifier that uses linear discriminant functions is


called “a linear machine”
• The decision surfaces for a linear machine are
hyperplanes defined by gi(x) = gj(x)

45
Discriminant Functions for the Normal
Density…
– The hyperplane separating Ri and Rj

1 s2 P( w i )
x0 = ( µ i + µ j ) - ln ( µi - µ j )
2 µi - µ j
2
P( w j )

is always orthogonal to the line linking the means!

1
if P(ω i ) = P(ω j ) then x 0 = ( µi + µ j )
2
2
gi (x) = − x − µ i
46
47
48
49
50
51
52
53
Incorporating Risk Function
Bayesian Decision Theory – Risk Function
Loss matrix

• Loss function
• sea bass: 600 per kg
• Salmon: 300 per kg

0 −300 Seabass
𝐿=
+300 0 Salmon
Seabass Salmon

Zero-one loss function

0 1
𝐿=
1 0
55
Bayesian Decision Theory – Risk Function

• Let {w1, w2,…, wc} be the set of c states of


nature (or “categories”)
• Let {a1, a2,…, aa} be the set of possible
actions
• Let l(ai | wj) be the loss incurred for taking
action ai when the true state of nature is wj

56
Two-category Classification

• Actions:
• a1 : deciding w1
• a2 : deciding w2

• lij = l(ai | wj)


• Loss incurred for deciding ai when the true
state of nature is wj

57
Two-category Classification

• a1: deciding w1
• a2: deciding w2
• lij = l(ai | wj)
• Loss incurred for deciding ai when the
true state of nature is wj
• Conditional risk:

58
Loss11*cc(x| w1)*prior(w1)/p(x)+loss12*cc(x| w2)*prior(w2)/p(x) =
Loss21*cc(x| w1)*prior(w1)/p(x)+loss22*cc(x| w2)*prior(w2)/p(x)

Loss11*cc(x|w1)*prior(w1) +loss12*cc(x|w2)*prior(w2) =
Loss21*cc(x|w1)*prior(w1) + loss22*cc(x|w2)*prior(w2)

Loss11*cc(x|w1)*prior(w1) - Loss21*cc(x|w1)*prior(w1) =
-loss12*cc(x|w2)*prior(w2) + loss22*cc(x|w2)*prior(w2)
59
Loss11*cc(x| w1)*prior(w1) - Loss21*cc(x| w1)*prior(w1) =
-loss12*cc(x| w2)*prior(w2) + loss22*cc(x| w2)*prior(w2)

cc(w1)*prior(w1)(Loss11 - Loss21) =
cc(w2)*prior(w2)( -loss12*+ loss22)

60
Two-category Classification

• Our rule is the following:


if R(a1 | x) < R(a2 | x)
• Action a1: “decide w1” is taken
• This results in the equivalent rule :
• Decide w1 if:

• and decide w2 otherwise

61
Bayesian Decision Theory – Continuous
Features…
• Overall risk
R = Sum of all R(ai | x) for i = 1,…,a
Conditional risk

• Minimizing R Minimizing R(ai | x) for i


= 1,…, a
j =c

• R(α i | x) = ∑ λ (α i | ω j )P(ω j | x) for i = 1,…,a


j =1

62

Bayesian Decision Theory –
Continuous Features…

• Select the action ai for which R(ai | x) is


minimum

• R is minimum and is called the Bayes risk =


best performance that can be achieved!

63
Likelihood Ratio

• The preceding rule is equivalent to the following rule:

• If

• Then take action a1 (decide w1)


• Otherwise take action a2 (decide w2)

64
Likelihood Ratio…
• Optimal decision property

“If the likelihood ratio exceeds a threshold


value independent of the input pattern x, we
can take optimal actions”

65
Example: Checking on a course

• A student needs to make a decision which


courses to take, based only on first lecture’s
impression
• From student’s previous experience:
Quality of the course good fair bad

Probability (prior) 0.2 0.4 0.4

• These are prior probabilities.


66
Example: Checking on a course
• The student also knows the class-conditionals:

Pr(x|wj) good fair bad


Interesting lecture 0.8 0.5 0.1
Boring lecture 0.2 0.5 0.9

• The loss function is given by the matrix

l(ai | wj) good course fair course bad course


Taking the course 0 5 10
Not taking the course 20 5 0

67
Example: Checking on a course

• We can take decisions according to two rules:

• Posterior probability
• Risk minimization

68
Example: Checking on a course
• Given that the first class was interesting, the
student wants to make an optimal decision.
• Pr(bad|interesting)
• Pr(fair|interesting)
• Pr(good|interesting)

• Pr(interesting|bad)
• Pr(interesting|fair)
• Pr(interesting|good)

• Pr(interesting)
69
Example: Checking on a course
• The probability to get the “interesting lecture”(x= interesting):
• Pr(interesting)= Pr(interesting|good course)* Pr(good course)
+ Pr(interesting|fair course)* Pr(fair course)
+ Pr(interesting|bad course)* Pr(bad course)
=0.8*0.2+0.5*0.4+0.1*0.4=0.4

• Consequently, Pr(boring)=1-0.4=0.6
• Suppose the lecture was interesting. Then we want to compute the
posterior probabilities of each one of the 3 possible “states of nature”.

70
Example: Checking on a course
Pr(good course|interesting lecture)
Pr(interesting|good)Pr(good) 0.8*0.2
= = = 0.4
Pr(interesting) 0.4
Pr(fair|interesting)
Pr(interesting|fair)Pr(fair) 0.5*0.4
= = = 0.5
Pr(interesting) 0.4
• We can get Pr(bad|interesting) = 0.1 either by the same method,
or by noting that it complements the above two to 1.
• Now, we have all we need to make an intelligent decision about
an optimal action

71
Example: Checking on a course

• The student needs to minimize the conditional


risk c
R(a i | x) = å l (a i | w j ) P(w j | x)
j =1

R(taking|interesting) =
Pr(good|interesting) l(taking good course) +
Pr(fair|interesting) l(taking fair course) +
Pr(bad|interesting) l(taking bad course)
= 0.4*0 + 0.5*5 + 0.1*10 = 3.5
72
Example: Checking on a course

c
R(a i | x) = å l (a i | w j ) P(w j | x)
j =1

R(not taking|interesting) =
Pr(good|interesting)l(not taking good course) +
Pr(fair|interesting)l(not taking fair course) +
Pr(bad|interesting)l(not taking bad course)

= 0.4*20+0.5*5+0.1*0 = 10.5
73
Constructing an optimal decision function

• So, if the first lecture was interesting, the


student will minimize the conditional risk by
taking the course.
• In order to construct the full decision
function, we need to define the risk
minimization action for the case of boring
lecture, as well.

74
Exercise
• Select the optimal decision where: W = {w1,
w2}
• P(x | w1) N(2, 0.5) (Normal distribution)
• P(x | w2) N(1.5, 0.2)

• P(w1) = 2/3 é1 2ù
l=ê ú
• P(w2) = 1/3 ë 3 4 û

75
Questions?

76
Discriminant Functions for the Normal
Density…
• Case Si = S (covariance of all classes are
identical but arbitrary!)
1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2

– Expand the term and disregard the quadratic


expression
€ where :
t −1 1 t −1
gi (x) = w x + w i0
i
w i = ∑ µ; w i0 = − µi ∑ µi + ln P(ω i )
2

77
Discriminant Functions for the Normal
Density…

1
x 0 = ( µi + µ j ) −
[
ln P(ω i ) /P(ω j ) ] .( µi − µ j )
t −1
2 ( µi − µ j ) Σ ( µi − µ j )

• Comments about this hyperplane:


€ – It passes through x0
– It is NOT orthogonal to the line linking the means.
– What happens when P(ωi)= P(ωj) ?
– If P(ωi) != P(ωj), then x0 shifts away from the more
likely mean.
78
Discriminant Functions for the Normal
Density…
1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2

• When P(ωi) is the same for each of the c classes


• Mahalanobis distance classifier

1
gi (x) = − (x − µi ) t Σ −1
i (x − µi )
2

79

Mahalanobis Distance

Euclidean Distance:

Mahalanobis Distance:

80
Mahalanobis Distance

Euclidean Distance:

All points are equidistant

81
Mahalanobis Distance
Class 2

Class 1

82
Mahalanobis Distance
Covariance Matrix:

é0.3 0.2ù
S=ê ú
C ë 0.2 0.3û
B
A: (0.5, 0.5)

A B: (0, 1)
C: (1.5, 1.5)

Compute
Euclid(A,B) squared
versions
Euclid(A,C)
Mahal(A,B)
Mahal(A,C) 83
Mahalanobis Distance
Covariance Matrix:

é0.3 0.2ù
S=ê ú
ë 0.2 0.3û
C
A: (0.5, 0.5)
B B: (0, 1)
C: (1.5, 1.5)
A

Euclid(A,B) = 0.5
Euclid(A,C) = 2
Mahal(A,B) = 5
Mahal(A,C) = 4 84
Discriminant Functions for the
Normal Density…

The contour lines are elliptical in


shape because the covariance
matrix is not diagonal. However,
both densities show the same
elliptical shape. The prior
probabilities are the same, and so
the point x0 lies halfway between
the 2 means. The decision boundary
is not orthogonal to the red line.
Instead, it is tilted so that its points
are of equal distance to the contour
lines in w1 and those in w2.

85
86
87
Discriminant Functions for the Normal
Density…

• Case Si = arbitrary
– The covariance matrices are different for each category

g i ( x ) = x tWi x + w it x = w i 0
where :
1 -1
Wi = - S i
2
w i = S i- 1 µ i
1 t -1 1
wi0 = - µ i S i µ i - ln S i + ln P ( w i )
2 2

88
Discriminant Functions for the Normal
Density…

P(ω1)=P(ω2)
Disconnected
decision
regions

89
90
91
Discriminant Functions for the Normal
Density

92
Decision Regions for Two-Dimensional
Gaussian Data

93
Decision Regions for Two-Dimensional
Gaussian Data

94
Thanks.

95

You might also like