CS 677 Pattern Recognition
Week 1: Introduction
                     Dr. Amr El-Wakeel
Lane Department of Computer Science and Electrical Engineering
                          Spring 24
                     Course Instructor
• Dr. Amr El-Wakeel, Assistant Professor, Lane Department of CSEE
• Research Interests: CAVs, ITS, internet of things, and healthcare informatics
e-mail: ase00006@mix.wvu.edu
Office: AERB 253,
Office hours: Mondays 1-2 pm
Or by e-mail appointment
Acknowledgment: Dr. Omid Dehzangi
Course design and material
                                                                                  2
           Class Overview
• Textbook
• Homeworks, exams, grading
• Course topics
                           Text Book
• Main Book:
   – Pattern Classification by Duda, Hart and Stork, Second Edition, ISBN: 9-
     780471056690
• Suggested Material that will help you:
   – C. M. Bishop, "Pattern Recognition and Machine Learning", 2006
   – Angel R. Martinez and Wendy L. Martinez, Computational Statistics
     Handbook with MATLAB (3rd Edition or later)
     Class Overview
Homework                  15 %
Course project reports,   50%
codes and presentations
Exams (dates will be      35%
discussed in class):
Exam 1: 15%
Exam 2: 20 %
                         Course Topics
•   Course description: The course will introduce graduate students to several topics
    in machine learning and pattern recognition, including the following suggested
    topics:
      -Introduction to pattern recognition
      -Bayesian decision theory
      -Nearest-neighbor
      -Linear discriminant functions
      -Linear regression
      -Logistic regression
      -Gradient descent
      -Support vector machines
      -Clustering
      -Feature Extraction and Reduction
      -Neural Networks
      -Selected topics (if time permits)
                                                                                    6
                               Terminology
• Pattern Recognition: “the act of taking raw data and taking an action based on the
  category of the pattern.”
• Common Applications: speech recognition, fingerprint identification (biometrics),
  Anomaly detection, DNA sequence identification
• Related Terminology:
▪ Data mining: the process of finding anomalies, patterns and correlations within large
data sets to predict outcomes.
▪ Machine Learning: The ability of a machine to improve its performance based on
previous results.
▪ Machine Understanding: acting on the intentions of the user
generating the data.
• Related Fields: artificial intelligence, signal processing and discipline-specific research
  (e.g., target recognition, speech recognition, natural language processing).
    License Plate Recognition
8
    Biometric Recognition
9
     Fingerprint Classification
10
     Face Detection
11
     Autonomous Systems
12
           Medical Applications
     Skin Cancer Detection   Breast Cancer Detection
13
     Land Cover Classification
           (from aerial or satellite images)
14
          Knowledge Discovery Process
                                      Knowledge Interpretation
                                 Pattern Recognition
                                 Machine learning
                                 Data Mining
                Data Transformation
                Feature Extraction
      Preprocessed            Selection
      Data
Data Cleaning
           Data Integration
            Databases
                     Sampling Data
▪ “Big” data arises in many forms:
   ▪   Physical Measurements: from science (physics, astronomy)
   ▪   Medical data: biometric sequences, detailed time series
   ▪   Activity data: GPS location, body sensor activities
   ▪   Business data: customer behavior tracking at fine detail
▪ Common themes:
   ▪ Data is large, and growing
   ▪ There are important patterns
     and trends in the data
   ▪ We don’t fully know where to look
     or how to find them
                Reducing the Data
▪ Although “big” data is about more than just the volume…
         …most big data is big!
▪ It is not always possible to store the data in full
   • Many applications (telecoms, ISPs, search engines, Sensor data)
       can’t keep everything
▪ It is inconvenient to work with data in full
   • Just because we can, doesn’t mean we should (Human
       behavior)
▪ It is faster to work with a compact summary
   • Better to explore data on a laptop
   than a cluster
              Sampling the Data
▪ Sampling has an intuitive semantics
   • We obtain a smaller data set with the same structure
▪ Estimating on a sample is often straightforward
   • Run the analysis on the sample that you would on the full
     data
   • Some rescaling/reweighting may be necessary
▪ Sampling is general and agnostic to the analysis to be
  done
   • Though sampling can be tuned to optimize some criteria
▪ Sampling is (usually) easy to understand
   • So prevalent that we have an intuition about sampling
Sampling as a Mediator of Constraints
                        Data Characteristics
                             (Correlations)
                            Sampling
 Resource Constraints                          Query Requirements
(Bandwidth, Storage, CPU,                       (Ad Hoc, Accuracy,
         GPU)                                   Aggregates, Speed)
               SAMPLING……
▪ What is your population of interest?
     • To whom do you want to generalize your
       results?
         –All doctors
         –School children
         –Nationality
         –Women aged 15-45 years
         –Other
▪ Can you sample the entire population?
                                                20
21
         STUDY POPULATION
SAMPLE
         TARGET POPULATION
                             22
         Population definition
▪ A population can be defined as including all people
  or items with the characteristic one wishes to
  understand.
▪ Because there is very rarely enough time or money
  to gather information from everyone or everything
  in a population, the goal becomes finding a
  representative sample (or subset) of that
  population.
                                                    23
              Data Everywhere!
▪ Lots of data is being collected and warehoused
  – Web data, e-commerce
  – Sensor data
  – Purchases at department/
    grocery stores
  – Bank/Credit Card
    transactions
  – Social Network
                 Type of Data
▪ Relational Data (Structured data:
  Tables/Transaction/Legacy Data)
▪ Matrix (Biometrics)
▪ Text Data (Unstructured data: Web)
▪ XML (Semi-structured Data)
▪ Graph Data
   • Social Network, Semantic Web (RDF), …
▪ Streaming Data
   • You can only scan the data once
      What to do with these data?
▪ Aggregation and Statistics
  – Data warehouse
▪ Indexing, Searching, and Querying
  – Keyword based search
  – Pattern matching (e.g. XML)
▪ Knowledge discovery
  – Data Analytic
  – Statistical Modeling
          Random Sample and Statistics
▪ Population: is used to refer to the set or universe of all entities
  under study.
▪ However, looking at the entire population may not be feasible,
  or may be too expensive.
▪ Instead, we draw a random sample from the population, and
  compute appropriate statistics from the sample, that give
  estimates of the corresponding population parameters of
  interest.
        Ex: Time Series Analysis
•   Example: Stock Market
•   Predict future values
•   Determine similar patterns over time
•   Classify behavior
                                           28
               Nature of Data
    Many Observations on Many
           Variables
Data File:   OBS No.
                1
                         Target Var.
                             0
                                       Var. 1
                                        63
                                                Var. 2
                                                  .
                                                         .
                                                         .
                                                             .
                                                             .
                                                                 Var. 100
                2            1          54        .      .   .      .
                3            0          44        .      .   .      .
                 .            .          .        .      .   .      .
                 .            .          .        .      .   .      .
                 .            .          .        .      .   .      .
                 .            .          .        .      .   .      .
                 .            .          .        .      .   .      .
             1,500,000       1          32        .      .   .      .
         Types of Problems
▪ Customer and Student Retention
▪ Detection of patient’s symptoms
▪ Credit Scoring (Auto or Home Loans)
▪ Bond Ratings
▪ Detection of Fraudulent Insurance Claims
▪ Is a Newly Introduced Product Meeting with
  Consumer Acceptance or Rejection?
▪ Who is a likely Donor to your Charity?
▪ Early Detection of a Stolen or Compromised
  Credit Card
    Data Rich, Information Poor
▪ The Amount of Raw Data Stored in Corporate Databases is
  Exploding
▪ Most of this information is recorded instantaneously and with
  minimal cost
▪ Data bases are measured in gigabytes and terabytes (One
  terabyte = one trillion bytes. A terabyte is equivalent to about 2
  million books!)
▪ Walmart uploads 20 million point-of-sale transactions to 500
  parallel processing storage devices each day.
▪ Raw data by itself, however does not provide much
  information. That is where Data analytic comes in!
                                                                  31
Learning/Modeling/Decision making
        Learning Task Examples
• Classification maps data into predefined groups or
  classes
   – Supervised learning
   – Pattern recognition
   – Prediction
• Regression is used to map a data item to a real valued
  prediction variable
• Clustering groups similar data together into clusters
   – Unsupervised learning
   – Segmentation
   – Anomaly detection
• Dimensionality Reduction transformation of data
  from a high-dimensional space into a low-
  dimensional space retaining meaningful properties of
  the original data                                        33
      Prediction Problem
▪ Early Detection of a Stolen or
  Compromised Credit Card
 Not So Interested in How or Why the
  Credit Card was Stolen but Instead
   Whether Recent Transactions are
Indicative of a Stolen or Compromised
              Credit Card
                      Prediction:
             Classification vs. Regression
▪ Classification:
   – predicts categorical class labels
   – classifies data (constructs a model) based on the training set
     and the values (class labels) in a classifying attribute and
     uses it in classifying new data
▪ Regression:
   – models continuous-valued functions, i.e., predicts unknown
     or missing values
▪ Typical Applications
   –   credit approval
   –   target marketing
   –   medical diagnosis
   –   treatment effectiveness analysis
                                                                35
         Classification: Definition
▪ Given a collection of records (training set )
   – Each record contains a set of attributes, one of the attributes is the
     class.
▪ Find a model for class attribute as a function of the
  values of other attributes.
▪ Goal: previously unseen records should be
  assigned a class as accurately as possible.
   – A test set is used to determine the accuracy of the model. Usually,
     the given data set is divided into training and test sets, with training
     set used to build the model and test set used to validate it.
                                                                              36
    Classification—A Two-Step Process
▪ Model construction: describing a set of predetermined classes
   – Each tuple/sample is assumed to belong to a predefined class, as
     determined by the class label attribute
   – The set of tuples used for model construction: training set
   – The model is represented as classification rules, decision trees, or
     mathematical formulae
▪ Model usage: for classifying future or unknown objects
   – Estimate accuracy of the model
      • The known label of test sample is compared with the classified
         result from the model
      • Accuracy rate is the percentage of test set samples that are
         correctly classified by the model
      • Test set is independent of training set, otherwise over-fitting will
         occur
                                                                               37
Illustrating Classification Task
      Tid   Attrib1     Attrib2   Attrib3   Class           Learning
      1     Yes        Large      125K      No
                                                            algorithm
      2     No         Medium     100K      No
      3     No         Small      70K       No
      4     Yes        Medium     120K      No
                                                    Induction
      5     No         Large      95K       Yes
      6     No         Medium     60K       No
      7     Yes        Large      220K      No                  Learn
      8     No         Small      85K       Yes                 Model
      9     No         Medium     75K       No
      10    No         Small      90K       Yes
                                                                        Model
 10
                  Training Set
                                                                Apply
      Tid   Attrib1     Attrib2   Attrib3   Class               Model
      11    No         Small      55K       ?
      12    Yes        Medium     80K       ?
      13    Yes        Large      110K      ?       Deduction
      14    No         Small      95K       ?
      15    No         Large      67K       ?
 10
                      Test Set
                                                                                38
             An Example Data Set and Decision Tree
3    sunny     med    big    yes
4    sunny      no   small   yes
5    sunny     big    big    yes
6    rainy      no   small   no                  outlook
7    rainy     med   small   yes
                                   sunny                    rainy
8    rainy     big    big    yes
9    rainy      no    big    no
10   rainy     med    big    no     yes                    company
                                                no                   big
                                                             med
                                           no           sailboat           yes
                                                small         big
                                                     yes      no
  Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions
  as legitimate or fraudulent
• Classifying if the a driver subject is stressed or not
• Categorizing news stories as finance,
  weather, entertainment, sports, etc
                                                           40
             Issues regarding classification
Issues (1): Data Preparation
 • Data cleaning
     – Preprocess data in order to reduce noise and handle missing values
 • Relevance analysis (feature selection)
     – Remove the irrelevant or redundant attributes
 • Data transformation
     – Generalize and/or normalize data
                                                                            41
           Issues regarding classification
Issues (2): Evaluating Classification Methods
• Predictive accuracy
• Speed and scalability
   – time to construct the model
   – time to use the model
• Robustness
   – handling noise and missing values
• Scalability
   – The concept of generalization
• Interpretability:
   – understanding and insight provided by the model
• Goodness of rules
   – decision tree size
   – compactness of classification rules
                                                       42
Error Analysis for a Two Class Problem
                   fth
Negative                     Positive
                                             1: True Negative (TN)
               1         3                   2: False Negative (FN)
                                             3: True Positive (TP)
                                              Confusion matrix
                                             4: False Positive (FP)
           2             4
                                        43
Evaluation Criteria
  Accuracy= TP – FP/P+N
            44
          Multi-class classification
▪ Multi-class vs. binary
  classification
   –   one vs. all (one vs. many, one
       vs. rest)
   –   N classes ➔ train N
       classifiers
   –   each classifier uses one class
       for the positive examples and
       the rest classes for the
       negative examples
▪ Combine the results: select the
  classifier with the highest
  confidence score
                                        45
 Bias, variance, generalization error
▪ A model underfits the training data, if it does not
  capture all of the structure available from the data.
  (b)
▪ A model overfits if it captures too many of the
  idiosyncrasies of the training data. (d)
                                                          46
▪ What does it mean to overfit or underfit?
▪ Assume we are doing regression.
▪ Suppose we have a training set
                   Strain = {( x (1) , y (1) ),..., ( x ( m ) , y ( m ) )}
  from some distribution D.
▪ Define the average training error of a hypothesis h
                                   1 m
                    Strain (h ) =  (h ( x (i ) ) − y (i ) ) 2
                                   m i =1
▪ We are interested in generalization error,
                            (h ) = E( x , y ) from D [(h ( x) − y ) 2 ]
▪ Both underfitting or overfitting lead to high generalization error.
  (previous figure)
                                                                             47
(a) linear regression fits of linear function to 3 different training sets
     randomly selected over the interval [0,4] ➔ low variance
(b) linear model after parameters are averaged over 50,000 trials ➔ high bias
     (underestimate in the mid-range, overestimate near the ends)
(c) linear regression fits of fourth-order polynomial to 3 random training sets
     ➔ high variance
(d) model after averaged over 50,000 trials ➔ low bias
                                                                                  48
▪ We can’t directly find out the generalization error.
▪ Instead, we estimate generalization error using the test error.
                               m
                  S (h ) =  (h ( xtest
                   test
                                      (i )
                                           ) − ytest
                                                (i ) 2
                                                     )
                              i =1
▪ Diagram on the right
  shows how training and
  test error vary as a
  function of model
  complexity.
▪ (e.g.) model complexity:
  degree of polynomial;
  size of decision tree (depth);
  number of features
                                                                    49
50
▪ Define training error as the proportion of training
  examples that are misclassified.
                           1 m
            Strain (h ) =  I { h ( xtrain
                                        (i )
                                              )  ytrain
                                                   (i )
                                                         )}
                           m i =1
where I {} is an indicator function such that I{true}=1,
I{false}=0.
▪ Generalization error is defined as the probability of a
  new example being misclassified
                  S (h ) = P( x , y ) from D (h ( x)  y )
                                                                51
   Bias and variance in practice
▪ How to choose a model with a good tradeoff between bias and
  variance:
    ▪ what if your learned model gives poor generalization error?
    ▪ collect more data? use fewer features? more features? adopt a different learning
      algorithm?
▪ If your model has high bias, it is too simple.
▪ If your model has high variance, it is too complex.
▪ Compare training and test errors
    ▪ if they are very different, your model is likely to be high variance
    ▪ if they are almost the same, your model is likely to be high bias
                                                                                     52
        Regression: Least Squares Fitting
▪ Given: data points, functional form,
  find constants in function
▪ Example: given (xi, yi), find line through them;
  i.e., find a and b in y = ax+b
▪ Example: for fitting a line, minimize  =  ( yi − (axi + b) )
                                            2                   2
                                                                          y=ax+b
                                                               (x6,y6)
                             (x3,y3)
                                                 (x5,y5)
    (x1,y1)
                                                                         (x7,y7)
                                       (x4,y4)
                  (x2,y2)
            Function Approximation
▪ You might do this because you actually care
  about those numbers…
  – Example: measure position of falling object,
    fit parabola
                 time
                                   p = –1/2 gt2
     position
                                    Estimate g from fit
  Supervised vs. Unsupervised Learning
• Supervised learning (classification)            teacher provides
                                                   a category label
   – Supervision: The training data (observations,
     measurements, etc.) are accompanied by labels indicating
     the class of the observations
   – New data is classified based on the training set
• Unsupervised learning (clustering)           “natural groupings”
   – The class labels of training data is unknown
   – Given a set of measurements, observations, etc. with the
     aim of establishing the existence of classes or clusters in
     the data
                                                                   55
         Clustering
income
                      education
         age
                                  56
         Recognition or Understanding?
•Which of these images are most scenic?
•How can we develop a system to automatically determine
 scenic beauty?
                  Features Are Confusable
• Regions of overlap represent the     • In real problems, features are
  classification error                   confusable and represent
                                         actual variation in the data.
• Error rates can be computed with
  knowledge of the joint probability   • The traditional role of the
  distributions.                         signal processing engineer
                                         has been to develop better
• Context is used to reduce overlap
                                         features.
  (e.g. more features).
                     Correlation
• Degrees of difficulty:      • Real data is often much harder:
                          Feature selection - Example
No! The same separability can be achieved by
  projecting the patterns onto the blue axis i.e.
  only one-dimension feature-space is needed.
                                                            120
                                                            110
                                                            100
                                                                                                 Males
                                                             90
                                              weight [kg]
                                                             80
                                                             70
                                                             60
                                                             50
                                                                                                         Females
                                                             40
   1
   0
  -1
                                                             30
-1.2                                                          120           130   140   150   160 170 180     190   200   210   220
       -1
                                                                                                height [cm]
                                                                       30
                                                                  25
            -0.8                                            20
                                              15
                   -0.6                  10
                                     5
                          -0.4   0
               The Design Cycle
       Start
   Collect Data       Key issues:
                        • “There is no data like more data.”
                        • Perceptually-meaningful features?
Choose Features
                        • How do we find the best model?
                        • How do we estimate parameters?
  Choose Model
                        • How do we evaluate performance?
                      Goal of the course:
 Train Classifier       • Introduce you to mathematically
                          rigorous ways to train and evaluate
                          models.
Evaluate Classifier
       End
                 Common Mistakes
• I got 100% accuracy on...
▪ Almost any algorithm works some of the time, but few real-world
problems have ever been completely solved.
▪ Training on the evaluation data is forbidden.
▪ Once you use evaluation data, you should discard it.
• My algorithm is better because...
▪ Statistical significance and experimental design play a big role in
determining the validity of a result.
▪ There is always some probability a random choice of an algorithm will
produce a better result.
• Hence, in this course, we will also learn how to evaluate algorithms.