Data Mining:
Concepts and Techniques
                   (3rd ed.)
          — Classification —
  Jiawei Han, Micheline Kamber, and Jian Pei
  University of Illinois at Urbana-Champaign &
            Simon Fraser University
 ©2011 Han, Kamber & Pei. All rights reserved.
                                                 1
             Chapter 8. Classification
   Classification: Basic Concepts
                                         2
       Classification: Formal Definition
   Given a database D={t1,t2,…,tn} and a set of
    classes C={C1,…,Cm}, the Classification
    Problem is to define a mapping f:DgC where
    each ti is assigned to one class.
   Actually divides D into equivalence classes.
   Prediction is similar, but may be viewed as
    having infinite number of classes.
   Because the class label of each training tuple
    is provided, this step is also known as
    supervised learning
                                             December 9, 2023
                                                            3
         Classification—A Two-Step Process
   Model construction: describing a set of predetermined classes
      Each tuple/sample is assumed to belong to a predefined class, as
       determined by the class label attribute
      The set of tuples used for model construction is training set
      The model is represented as classification rules, decision trees, or
       mathematical formulae
   Model usage: for classifying future or unknown objects
      Estimate accuracy of the model
           The known label of test sample is compared with the classified
            result from the model
           Accuracy rate is the percentage of test set samples that are
            correctly classified by the model
           Test set is independent of training set, otherwise over-fitting will
            occur
      If the accuracy is acceptable, use the model to classify data tuples
       whose class labels are not known
                                                                          December 9, 2023
                                                                                         4
        Training, Test and Validation Sets
   Training set: A set of examples used for learning,
    that is to fit the parameters of the classifier.
   Test set: A set of examples used only to assess the
    performance of a fully-specified classifier.
   Validation set: A set of examples used to tune the
    parameters of a classifier, for example to choose the
    number of hidden units in a neural network.
                                                    December 9, 2023
                                                                   5
         Process (1): Model Construction
                                                 Classification
                                                  Algorithms
                  Training
                    Data
NAME RANK                    YEARS TENURED         Classifier
M ike   A ssistan t P ro f     3     no            (Model)
M ary   A ssistan t P ro f     7     yes
B ill   P ro fesso r           2     yes
Jim     A sso ciate P ro f     7     yes
                                             IF rank = ‘professor’
D ave   A ssistan t P ro f     6     no
                                             OR years > 6
Anne    A sso ciate P ro f     3     no
                                             THEN tenured = ‘yes’
                                                           December 9, 2023
                                                                          6
   Process (2): Using the Model in Prediction
                                        Classifier
                      Testing
                       Data                             Unseen Data
                                                     (Jeff, Professor, 4)
NAME RANK                        YEARS TENURED
Tom         A ssistan t P ro f     2     no          Tenured?
M erlisa    A sso ciate P ro f     7     no
G eo rg e   P ro fesso r           5     yes
Jo sep h    A ssistan t P ro f     7     yes
                                                                 December 9, 2023
                                                                                7
Classification Example
                         8
              Classification Examples
   Teachers classify students’ grades as A, B, C, D, or F.
   Identify mushrooms as poisonous or edible.
   Predict when a river will flood.
   Identify individuals with credit risks.
   Speech recognition
   Pattern recognition
                                                       December 9, 2023
                                                                      9
             Classification Ex: Grading
   If x >= 90 then grade =A.                        x
   If 80>=x<90 then grade =B.
                                             <90         >=90
   If 70>=x<80 then grade =C.
   If 60>=x<70 then grade =D.                   x       A
   If x<50 then grade =F.
                                           <80       >=80
                                             x       B
   What is the grade for a new x?    <70        >=70
     Classify x                        x        C
                                     <50     >=60
                                       F     D
                                                             December 9, 2023
                                                                           10
             Issues: Data Preparation
   Data cleaning
       Preprocess data in order to reduce noise and handle
        missing values
   Relevance analysis (feature selection)
       Remove the irrelevant or redundant attributes
   Data transformation
       Generalize and/or normalize data
                                                              December 9, 2023
                                                                            11
     Issues: Evaluating Classification Methods
   Accuracy
      classifier accuracy: predicting class label
      predictor accuracy: guessing value of predicted attributes
   Speed
      time to construct the model (training time)
      time to use the model (classification/prediction time)
   Robustness: handling noise and missing values
   Scalability: efficiency in disk-resident databases
   Interpretability
      understanding and insight provided by the model
   Other measures, e.g., goodness of rules, such as decision tree
    size or compactness of classification rules
                                                             December 9, 2023
                                                                           12
    Accuracy of Classification Models
   In classification problems, the primary source for
    accuracy estimation is the confusion matrix
                                 True Class                                 TP  TN
                            Positive    Negative      Accuracy 
                                                                       TP  TN  FP  FN
                Positive
                             True         False                                 TP
                                                      True Positive Rate 
      Predicted Class
                            Positive     Positive
                                                                              TP  FN
                           Count (TP)   Count (FP)
                                                                                 TN
                                                      True Negative Rate 
                                                                               TN  FP
    Negative
                             False        True
                            Negative     Negative
                           Count (FN)   Count (TN)                    TP                    TP
                                                     P recision               Recall 
                                                                    TP  FP               TP  FN
Estimation Methodologies for Classification
      Simple split (or holdout or test sample estimation)
          Split the data into 2 mutually exclusive sets training (~70%)
           and testing (30%)
                                                   Model
                                Training Data   Development
                          2/3
           Preprocessed                               Classifier
               Data
                          1/3                     Model
                                                                   Prediction
                                                Assessment
                                Testing Data                       Accuracy
                                                 (scoring)
          For ANN, the data is split into three sub-sets (training [~60%],
           validation [~20%], testing [~20%])
             Confusion Matrix Example
   Suppose we have a binary classification problem where we are trying
    to predict whether an email is spam (positive class) or not spam
    (negative class). We have a dataset with 100 emails, and our model
    predicts the following:
   True Positive (TP): 40 emails were correctly predicted as spam.
   True Negative (TN): 50 emails were correctly predicted as not spam.
   False Positive (FP): 5 emails were incorrectly predicted as spam (they
    are actually not spam).
   False Negative (FN): 5 emails were incorrectly predicted as not spam
    (they are actually spam).
                                                                             15
             Confusion Matrix Example
   True Positive (TP) = 40: The model correctly predicted 40 emails as
    spam.
   True Negative (TN) = 50: The model correctly predicted 50 emails as
    not spam.
   False Positive (FP) = 5: The model incorrectly predicted 5 non-spam
    emails as spam.
   False Negative (FN) = 5: The model incorrectly predicted 5 spam
    emails as non-spam.
   Accuracy                    Precision               Recall
                                                                          16
                            Example 2
   Let's consider a medical diagnosis example for a disease (D) where a
    model predicts whether a patient has the disease or not:
   True Positive (TP): 90 patients were correctly predicted to have the
    disease.
   True Negative (TN): 885 patients were correctly predicted to not have
    the disease.
   False Positive (FP): 10 patients were incorrectly predicted to have the
    disease (they are actually disease-free).
   False Negative (FN): 15 patients were incorrectly predicted to not
    have the disease (they actually have the disease).
                                                                              17
Solution
           18
                            Example 3
   Given The following Confusion Matrix
   Accuracy: Overall, how often is the classifier correct?
      (TP+TN)/total = (100+50)/165 = 0.91
   Misclassification Rate: Overall, how often is it wrong?
      (FP+FN)/total = (10+5)/165 = 0.09
      equivalent to 1 minus Accuracy
      also known as "Error Rate"
                                                              19
                                    Sol.
   True Positive Rate: When it's actually yes, how often does it predict yes?
      TP/actual yes = 100/105 = 0.95
      also known as "Sensitivity" or "Recall"
   False Positive Rate: When it's actually no, how often does it predict yes?
      FP/actual no = 10/60 = 0.17
   True Negative Rate: When it's actually no, how often does it predict no?
      TN/actual no = 50/60 = 0.83
      equivalent to 1 minus False Positive Rate
      also known as "Specificity"
   Precision: When it predicts yes, how often is it correct?
      TP/predicted yes = 100/110 = 0.91
   Prevalence: How often does the yes condition actually occur in our
    sample?
      actual yes/total = 105/165 = 0.64
                                                                                 20
      Issues: Underfitting and Overfitting
   Underfitting and Overfitting are two factors that
    contribute to the poor performance of DM (machine
    learning) systems.
   Underfitting occurs when a model has not learnt the
    patterns in the training data well and is unable to
    generalize adequately on the new data. An underfit
    model performs poorly on training data and produces
    incorrect predictions.
   Overfitting occurs when a model performs
    exceptionally well on training data but poorly on test
    data (fresh data).
                                                    December 9, 2023
                                                                  21
Issues: Underfitting and Overfitting
                                  December 9, 2023
                                                22
        Issues: Underfitting and Overfitting
   Reasons for underfitting
       Low variance and high bias
       The amount of the training dataset utilised is insufficient.
       The model is rather simplistic.
       Training data has not been cleaned and contains noise.
   Techniques to reduce underfitting:
       Increase the model’s complexity.
       Expand the amount of features by undertaking feature
        engineering.
       Take out the noise from the data.
       To improve outcomes, increase the number of epochs or
        the period of training.
                                                                December 9, 2023
                                                                              23
        Issues: Underfitting and Overfitting
   Reasons for overfitting
       Low bias and high variance
       The model is rather complicated.
       The amount of training data is insufficient
   Techniques to reduce overfitting:
       Increase the training data.
       Model complexity should be reduced.
       Early termination during the training phase.
      Low bias and variance (check the following link)
    https://www.javatpoint.com/bias-and-variance-in-machine-
    learning#:~:text=High%2DBias%2C%20Low%2DVariance,unde
    rfitting%20problems%20in%20the%20model.
                                                       December 9, 2023
                                                                     24