CS 677 Pattern Recognition
Week 2-3: Introduction to Pattern
Recognition and Bayesian Decision Theory
                     Dr. Amr El-Wakeel
Lane Department of Computer Science and Electrical Engineering
                          Spring 24
          Machine Perception
• Build a machine that can recognize patterns:
  – Speech recognition
  – Fingerprint identification
  – OCR (Optical Character Recognition)
  – Medical diagnosis and event detection
  – DNA sequence identification
                           2
                  Example
• “Sorting incoming Fish on a conveyor
  according to species using optical sensing”
                      Sea bass
     Species
                      Salmon
                        3
                             Example
• Sorting Fish: incoming fish are sorted
  according to species using optical
  sensing (sea bass or salmon?)
• Problem Analysis:
  ▪ set up a camera and take some sample
    images to extract features
  ▪ Consider features such as length,                 Sensing
    lightness, width, number and shape of
    fins, position of mouth, etc.
  ▪ This is the set of some suggested features     Segmentation
    to explore for use in our classifier!
                                                 Feature Extraction
                   Pre-processing
  – Use a segmentation operation to isolate fishes from
    one another and from the background
• Information from a single fish is sent to a feature
  extractor whose purpose is to reduce the data by
  measuring certain features
• The features are passed to a classifier
                           5
                  Classification
– Select the length of the fish as a possible feature for
  discrimination
                             6
• The length is a poor feature alone!
• Select the lightness as a possible feature.
                        7
• Threshold decision boundary and cost
  relationship
  – Move our decision boundary toward smaller values of
    lightness in order to minimize the cost (reduce the
    number of sea bass that are classified salmon!)
                  Task of decision theory
                           8
             Multiple Features
• Adopt the lightness and add the width of
  the fish
      Fish                 xT = [x1, x2]
                       Lightness       Width
                       9
Multiple Features
        10
• We might add other features that are not correlated with
  the ones we already have. A precaution should be taken
  not to reduce the performance by adding such “noisy
  features”
• Ideally, the best decision boundary should be the one
  which provides an optimal performance such as in the
  following?
                           11
12
• However, our satisfaction is premature
  because the central aim of designing a
  classifier is to correctly classify novel input
              Issue of generalization!
                          13
14
      Pattern Recognition Systems
• Sensing
  – Use of a transducer (camera or microphone)
  – PR system depends of the bandwidth, the resolution
    sensitivity distortion of the transducer
• Segmentation and grouping
  – Patterns should be well separated and should not
    overlap
                           15
Pattern Recognition Systems
             16
        Pattern Recognition Systems
• Feature extraction
   – Discriminative features
   – Invariant features with respect to translation, rotation and
     scale.
• Classification
   – Use a feature vector provided by a feature extractor to assign
     the object to a category
• Post Processing
   – Exploit context input dependent information other than from
     the target pattern itself to improve performance
                                 17
            The Design Cycle
•   Data collection
•   Feature Choice
•   Model Choice
•   Training
•   Evaluation
•   Computational Complexity
                      18
• Computational Complexity
  – What is the trade-off between computational
   ease and performance?
  – (How an algorithm scales as a function of the
    number of features, patterns or categories?)
  – Labelling 20 X 20 binary pixel, 220X20 = 10120
    patterns. scale
                         19
      Introduction
Bayesian Decision Theory–
  Continuous Features
                    Introduction
• The sea bass/salmon example
  – State of nature, prior
     • State of nature is a random variable
     • The catch of salmon and sea bass is equiprobable
        – P(1) = P(2) (uniform priors)
        – P(1) + P( 2) = 1 (exclusivity and exhaustivity)
                                  21
• Decision rule with only the prior information
  – Decide 1 if P(1) > P(2) otherwise decide 2
• Use of the class–conditional information
• P(x | 1) and P(x | 2) describe the difference
  in lightness between populations of sea and
  salmon (particular density function for
  random variable X)
                          22
23
• Posterior, likelihood, evidence
   – P(j | x) = P(x | j)P (j) / P(x)     (Bayes formula)
   – Where in case of two categories
                         j=2
               P ( x ) =  P ( x |  j )P (  j )
                         j =1
   – Posterior = (Likelihood * Prior) / Evidence
                                24
25
• Decision given the posterior probabilities
  X is an observation for which:
  if P(1 | x) > P(2 | x)      True state of nature = 1
  if P(1 | x) < P(2 | x)      True state of nature = 2
  Therefore:
      whenever we observe a particular x, the probability of
  error is :
             P(error | x) = P(1 | x) if we decide 2
             P(error | x) = P(2 | x) if we decide 1
                               26
• Minimizing the probability of error
• Decide 1 if P(1 | x) > P(2 | x);
  otherwise decide 2
  Therefore:
            P(error | x) = min [P(1 | x), P(2 | x)]
                  (Bayes decision)
                           27
Bayesian Decision Theory – Continuous Features
• Generalization of the preceding ideas
  – Use of more than one feature
  – Use more than two states of nature
  – Allowing actions and not only decide on the state of
    nature
  – Introduce a loss of function which is more general
    than the probability of error
                           28
• Allowing actions other than classification primarily
  allows the possibility of rejection
   – Rejection in the sense of abstention
   – Don’t make a decision if the alternatives are too close
   – This must be tempered by the cost of indecision
• The loss function states how costly each action taken is
                               29
Let {1, 2,…, c} be the set of c states of nature
(or “categories”)
Let {1, 2,…, a} be the set of possible actions
Let (i | j) be the loss incurred for taking
  action i when the state of nature is j
                          30
                       Overall risk
R = Sum of all R(i | x) for i = 1,…,a and all x
           Conditional risk
Minimizing R          Minimizing R(i | x) for i = 1,…, a
                          j =c
           R(  i | x ) =   (  i |  j )P (  j | x )
                          j =1
              for each action i (i = 1,…,a)
Note: This is the risk specifically for observation x
                                 31
Select the action i for which R(i | x) is minimum
        R is minimum and R in this case is called the
        Bayes risk = best performance that can be achieved!
                              32
• Two-category classification
       1 : deciding 1
       2 : deciding 2
       ij = (i | j)
loss incurred for deciding i when the true state of nature is j
Conditional risk:
               R(1 | x) = 11P(1 | x) + 12P(2 | x)
                 R(2 | x) = 21P(1 | x) + 22P(2 | x)
                                 33
Our rule is the following:
                     if R(1 | x) < R(2 | x)
              action 1: “decide 1” is taken
Substituting the def. of R() we have :
decide 1 if:
               11 P(1 | x) + 12P(2 | x) <
                       21 P(1 | x) + 22P(2 | x)
                   and decide 2 otherwise
                             34
We can rewrite
         11 P(1 | x) + 12P(2 | x) <
                21 P(1 | x) + 22P(2 | x)
As
     (21- 11) P(1 | x) > (12- 22) P(2 | x)
                         35
Finally, we can rewrite
                   (21- 11) P(1 | x) >
                           (12- 22) P(2 | x)
using Bayes formula and posterior probabilities to get:
decide 1 if:
                 (21- 11) P(x | 1) P(1) >
                         (12- 22) P(x | 2) P(2)
                   and decide 2 otherwise
                               36
If 21 > 11 then we can express our rule as a
   Likelihood ratio:
The preceding rule is equivalent to the following rule:
              P ( x |  1 ) 12 − 22 P (  2 )
           if                        .
              P ( x |  2 ) 21 − 11 P (  1 )
                 Then take action 1 (decide 1)
               Otherwise take action 2 (decide 2)
                                        37
Optimal decision property
“If the likelihood ratio exceeds a threshold
value independent of the input pattern x, we
can take optimal actions”
                     38
            Minimum-Error-Rate Classification
• Actions are decisions on classes
  If action i is taken and the true state of nature is j then:
  the decision is correct if i = j and in error if i  j
• Seek a decision rule that minimizes the
  probability of error which is the error rate
                               39
• Introduction of the zero-one loss function:
                              0 i = j
              (  i , j ) =                    i , j = 1 ,..., c
                              1 i  j
  Therefore, the conditional risk is:
                           j =c
           R(  i | x ) =   (  i |  j ) P (  j | x )
                           j =1
                        =  P(  j | x ) = 1 − P(  i | x )
                            j 1
  “The risk corresponding to this loss function is the average
  probability error”
                                      40
• The Bayes decision rule depends on minimizing risk
• Minimizing the risk requires selecting the i that
  maximizes P (i | x)
  (since R (i | x) = 1 – P (i | x))
• For Minimum error rate
   – Decide i if P (i | x) > P (j | x) j  i
                                41
  • Regions of decision and zero-one loss function, therefore
    (using the likelihood ratio formula:
    12 − 22 P (  2 )                            P( x | 1 )
Let          .          =   then decide  1 if :               
    21 − 11 P (  1 )                            P( x |  2 )
  • If  is the zero-one loss function which means:
                        0 1
                    =    
                        1 0
                                 P(  2 )
                   then   =             = a
                                 P( 1 )
                          0 2                2 P(  2 )
                   if  =       then   =            = b
                            1 042             P( 1 )
43
          Classifiers, Discriminant Functions
                 and Decision Surfaces
• The multi-category case
  – Set of discriminant functions gi(x), i = 1,…, c
  – The classifier assigns a feature vector x to class i
    if:
                       gi(x) > gj(x) j  i
                          44
45
• Let gi(x) = - R(i | x)
  (max. discriminant corresponds to min. risk!)
• For the minimum error rate, we take
                       gi(x) = P(i | x)
  (max. discrimination corresponds to max. posterior!)
                  gi(x)  P(x | i) P(i)
               gi(x) = ln P(x | i) + ln P(i)
                            46
• Feature space divided into c decision regions
             if gi(x) > gj(x) j  i then x is in Ri
  (Ri means assign x to i)
• The two-category case
  – A classifier is a “dichotomizer” that has two discriminant
    functions g1 and g2
    Let g(x)  g1(x) – g2(x)
            Decide 1 if g(x) > 0 ; Otherwise decide 2
                               47
– The computation of g(x)
    g( x ) = P (  1 | x ) − P (  2 | x )
                P( x | 1 )       P( 1 )
           = ln              + ln
                P( x |  2 )      P(  2 )
                     48
49
               The Normal Density
• Univariate density
      –   Density which is analytically tractable
      –   Continuous density
      –   A lot of processes are asymptotically Gaussian
      –   Handwritten characters, speech sounds are ideal or prototype
          corrupted by random process (central limit theorem)
                                 1         1 x− 
                                                   2
                    P( x ) =         exp  −      ,
                                2       2    
  Where:
   = mean (or expected value) of x
  2 = expected squared deviation or variance
                                   50
51
• Multivariate density
   – Multivariate normal density in d dimensions is:
                           1                   1                      
      P( x ) =                             exp − ( x −  )  ( x −  )
                                                           t −1
                  ( 2 )                      2                      
                          d/2       1/ 2
       where:
      x = (x1, x2, …, xd)t (t stands for the transpose vector form)
      = (1, 2, …, d)t mean vector
       = d*d covariance matrix
     | | and  -1 are determinant and inverse respectively
                                           52
 Discriminant Functions for the Normal Density
• We saw that the minimum error-rate
  classification can be achieved by the
  discriminant function
  gi(x) = ln P(x | i) + ln P(i)
• Case of multivariate normal
                            −1
               1                         d       1
   g i ( x) = − ( x − i )  ( x − i ) − ln 2 − ln i + ln P(i )
                          t
               2             i           2       2
                                  53
       Case i = 2I (I stands for the identity matrix)
       • What does “i = 2I” say about the dimensions?
       • What about the variance of each dimension?
Note : both  i and (d/2) ln are independent of i in
                         −1
            1                         d       1
g i ( x) = − ( x − i )  ( x − i ) − ln 2 − ln  i + ln P (i )
                       t
            2             i           2       2
Thus we can simplify to :
              − i
                      2
g i ( x) = −              + ln P (i )
              2 2
where  denotes the Euclidean norm
                                     54
• We can further simplify by recognizing that
  the quadratic term xtx implicit in the
  Euclidean norm is the same for all i.
 g i ( x) = w ti x + wi 0 (linear discrimina nt function)
 where :
             μi                 1
      wi =         ; wi 0 = −         μ i μ i + ln P (i )
                                        t
                2
                          2        2
(i 0 is called the threshold for the ith category!)
                           55
• A classifier that uses linear discriminant
  functions is called “a linear machine”
• The decision surfaces for a linear machine
  are pieces of hyperplanes defined by:
                   gi(x) = gj(x)
The equation can be written as:
     wt(x-x0)=0
                    56
– The hyperplane separating Ri and Rj
        1                  2             P (i )
   x 0 = (μ i + μ j ) −                ln          (μ i − μ j )
        2               μi − μ j
                                   2
                                          P ( j )
always orthogonal to the line linking the
means!
                                 1
   if P(i ) = P( j ) then x 0 = (μ i + μ j )
                                 2
                        57
58
59
60
• Case i =  (covariance of all classes are identical
  but arbitrary!)
  Hyperplane separating Ri and Rj
  Has the equation
  w t (x − x 0 ) = 0
  Where
  w = Σ −1 (μ i − μ j )
  and
       1
  x 0 = (μ i + μ j ) −
                                              
                           ln P (i ) / P ( j )
                                                   .(μ i − μ j )
                                      −1
       2               (μ i − μ j ) Σ (μ i − μ j )
                                   t
   (the hyperplane separating Ri and Rj is generally not
      orthogonal to the line between the means!)
                             61
62
63
• Case i = arbitrary
   – The covariance matrices are different for each category
                 g i ( x ) = x tWi x + w it x = w i 0
   where :
                 1 −1
         Wi = −  i
                 2
         wi =  i− 1  i
                 1 t −1        1
         wi 0 = −  i  i  i − ln  i + ln P (  i )
                 2             2
   The decision surfaces are hyperquadratics
     (Hyperquadrics are: hyperplanes, pairs of hyperplanes, hyperspheres,
     hyperellipsoids, hyperparaboloids, hyperhyperboloids)
                              64
65
66
                                                                   67
   Bayes Decision Theory – Discrete Features
• Components of x are binary or integer valued, x can take only
  one of m discrete values
                     v1, v2, …, vm
➔ concerned with probabilities rather than probability densities
  in Bayes Formula:
                        P ( x |  j ) P ( j )
      P ( j | x) =
                               P ( x)
     where
                  c
      P ( x) =  P ( x |  j ) P ( j )
                 j =1
                                                  68
  Bayes Decision Theory – Discrete Features
• Conditional risk is defined as before: R(|x)
• Approach is still to minimize risk:
          = arg min R( i | x)
           *
                     i
   Bayes Decision Theory – Discrete Features
• Case of independent binary features in 2
  category problem
  Let x = [x1, x2, …, xd ]t where each xi is either 0
  or 1, with probabilities:
            pi = P(xi = 1 | 1)
            qi = P(xi = 1 | 2)
                       69
                                                              70
  Bayes Decision Theory – Discrete Features
• Assuming conditional independence, P(x|i) can be written
  as a product of component probabilities:
                            d
            P (x | 1 ) =  pixi (1 − pi )1− xi
                           i =1
            and
                            d
            P (x | 2 ) =  qixi (1 − qi )1− xi
                           i =1
            yielding a likelihood ratio given by :
                                      xi              1− xi
            P(x | 1 )       pi 
                             d
                                            1 − pi 
                       =                      
            P(x | 2 ) i =1  qi            1 − qi 
                                                               71
  Bayes Decision Theory – Discrete Features
• Taking our likelihood ratio
                                xi              1− xi
      P (x | 1 )    d
                        pi          1 − pi 
                  =                     
      P (x | 2 ) i =1  qi           1 − qi 
     and plugging it into Eq. 31
                 p (x | 1 )      p (1 )
      g (x) = ln             + ln
                 p (x | 2 )      p (2 )
      yields :
                 d
                      pi              1 − pi       p (1 )
      g (x) =   xi ln + (1 − xi ) ln         + ln
              i =1    qi              1 − qi       p (2 )
• The discriminant function in this case is:
              d
   g ( x ) =  w i x i + w0
             i =1
   where :
                 pi ( 1 − q i )
        w i = ln                  i = 1 ,..., d
                 q i ( 1 − pi )
   and :
                  1 − pi
                    d
                              P( 1 )
        w0 =  ln        + ln
             i =1 1 − qi      P(  2 )
   decide  1 if g(x)  0 and  2 if g(x)  0
                        72