0% found this document useful (0 votes)

11 views24 pages

CH 6

The document discusses classification in data mining, defining it as a supervised learning process that assigns data tuples to predefined classes. It outlines the two-step process of model construction and model usage, along with the importance of training, test, and validation sets. Additionally, it addresses issues like data preparation, evaluating classification methods, and the challenges of underfitting and overfitting.

Uploaded by

baloleylana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views24 pages

CH 6

Uploaded by

baloleylana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Data Mining:

Concepts and Techniques

(3rd ed.)

— Classification —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 8. Classification

 Classification: Basic Concepts

2
Classification: Formal Definition

 Given a database D={t1,t2,…,tn} and a set of

classes C={C1,…,Cm}, the Classification
Problem is to define a mapping f:DgC where
each ti is assigned to one class.
 Actually divides D into equivalence classes.
 Prediction is similar, but may be viewed as
having infinite number of classes.
 Because the class label of each training tuple
is provided, this step is also known as
supervised learning
December 9, 2023
3
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees, or

mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

 The known label of test sample is compared with the classified

result from the model

 Accuracy rate is the percentage of test set samples that are

correctly classified by the model

 Test set is independent of training set, otherwise over-fitting will

occur
 If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
December 9, 2023
4
Training, Test and Validation Sets

 Training set: A set of examples used for learning,

that is to fit the parameters of the classifier.
 Test set: A set of examples used only to assess the
performance of a fully-specified classifier.
 Validation set: A set of examples used to tune the
parameters of a classifier, for example to choose the
number of hidden units in a neural network.

December 9, 2023
5
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’
December 9, 2023
6
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
December 9, 2023
7
Classification Example

8
Classification Examples
 Teachers classify students’ grades as A, B, C, D, or F.
 Identify mushrooms as poisonous or edible.
 Predict when a river will flood.
 Identify individuals with credit risks.
 Speech recognition
 Pattern recognition

December 9, 2023
9
Classification Ex: Grading

 If x >= 90 then grade =A. x

 If 80>=x<90 then grade =B.
<90 >=90
 If 70>=x<80 then grade =C.
 If 60>=x<70 then grade =D. x A
 If x<50 then grade =F.
<80 >=80
x B

 What is the grade for a new x? <70 >=70

 Classify x x C
<50 >=60
F D
December 9, 2023
10
Issues: Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data

December 9, 2023
11
Issues: Evaluating Classification Methods

 Accuracy
 classifier accuracy: predicting class label

 predictor accuracy: guessing value of predicted attributes

 Speed
 time to construct the model (training time)

 time to use the model (classification/prediction time)

 Robustness: handling noise and missing values

 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model

 Other measures, e.g., goodness of rules, such as decision tree

size or compactness of classification rules

December 9, 2023
12
Accuracy of Classification Models

 In classification problems, the primary source for

accuracy estimation is the confusion matrix
True Class TP  TN
Positive Negative Accuracy 
TP  TN  FP  FN
Positive

True False TP
True Positive Rate 
Predicted Class

Positive Positive
TP  FN
Count (TP) Count (FP)
TN
True Negative Rate 
TN  FP
Negative

False True
Negative Negative
Count (FN) Count (TN) TP TP
P recision  Recall 
TP  FP TP  FN
Estimation Methodologies for Classification

 Simple split (or holdout or test sample estimation)

 Split the data into 2 mutually exclusive sets training (~70%)
and testing (30%)
Model
Training Data Development
2/3

Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)

 For ANN, the data is split into three sub-sets (training [~60%],
validation [~20%], testing [~20%])
Confusion Matrix Example
 Suppose we have a binary classification problem where we are trying
to predict whether an email is spam (positive class) or not spam
(negative class). We have a dataset with 100 emails, and our model
predicts the following:
 True Positive (TP): 40 emails were correctly predicted as spam.
 True Negative (TN): 50 emails were correctly predicted as not spam.
 False Positive (FP): 5 emails were incorrectly predicted as spam (they
are actually not spam).
 False Negative (FN): 5 emails were incorrectly predicted as not spam
(they are actually spam).

15
Confusion Matrix Example
 True Positive (TP) = 40: The model correctly predicted 40 emails as
spam.
 True Negative (TN) = 50: The model correctly predicted 50 emails as
not spam.
 False Positive (FP) = 5: The model incorrectly predicted 5 non-spam
emails as spam.
 False Negative (FN) = 5: The model incorrectly predicted 5 spam
emails as non-spam.
 Accuracy Precision Recall

16
Example 2
 Let's consider a medical diagnosis example for a disease (D) where a
model predicts whether a patient has the disease or not:

 True Positive (TP): 90 patients were correctly predicted to have the

disease.
 True Negative (TN): 885 patients were correctly predicted to not have
the disease.
 False Positive (FP): 10 patients were incorrectly predicted to have the
disease (they are actually disease-free).
 False Negative (FN): 15 patients were incorrectly predicted to not
have the disease (they actually have the disease).
17
Solution

18
Example 3
 Given The following Confusion Matrix

 Accuracy: Overall, how often is the classifier correct?

 (TP+TN)/total = (100+50)/165 = 0.91

 Misclassification Rate: Overall, how often is it wrong?

 (FP+FN)/total = (10+5)/165 = 0.09

 equivalent to 1 minus Accuracy

 also known as "Error Rate"

19
Sol.
 True Positive Rate: When it's actually yes, how often does it predict yes?
 TP/actual yes = 100/105 = 0.95

 also known as "Sensitivity" or "Recall"

 False Positive Rate: When it's actually no, how often does it predict yes?
 FP/actual no = 10/60 = 0.17

 True Negative Rate: When it's actually no, how often does it predict no?
 TN/actual no = 50/60 = 0.83

 equivalent to 1 minus False Positive Rate

 also known as "Specificity"

 Precision: When it predicts yes, how often is it correct?

 TP/predicted yes = 100/110 = 0.91

 Prevalence: How often does the yes condition actually occur in our
sample?
 actual yes/total = 105/165 = 0.64
20
Issues: Underfitting and Overfitting

 Underfitting and Overfitting are two factors that

contribute to the poor performance of DM (machine
learning) systems.
 Underfitting occurs when a model has not learnt the
patterns in the training data well and is unable to
generalize adequately on the new data. An underfit
model performs poorly on training data and produces
incorrect predictions.
 Overfitting occurs when a model performs
exceptionally well on training data but poorly on test
data (fresh data).

December 9, 2023
21
Issues: Underfitting and Overfitting

December 9, 2023
22
Issues: Underfitting and Overfitting

 Reasons for underfitting

 Low variance and high bias
 The amount of the training dataset utilised is insufficient.
 The model is rather simplistic.
 Training data has not been cleaned and contains noise.
 Techniques to reduce underfitting:
 Increase the model’s complexity.
 Expand the amount of features by undertaking feature
engineering.
 Take out the noise from the data.
 To improve outcomes, increase the number of epochs or
the period of training.

December 9, 2023
23
Issues: Underfitting and Overfitting

 Reasons for overfitting

 Low bias and high variance
 The model is rather complicated.
 The amount of training data is insufficient
 Techniques to reduce overfitting:
 Increase the training data.
 Model complexity should be reduced.
 Early termination during the training phase.

 Low bias and variance (check the following link)

https://www.javatpoint.com/bias-and-variance-in-machine-
learning#:~:text=High%2DBias%2C%20Low%2DVariance,unde
rfitting%20problems%20in%20the%20model.
December 9, 2023
24

Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Evaluation of Predictive Models Final
No ratings yet
Evaluation of Predictive Models Final
6 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Chap 5 Learning
No ratings yet
Chap 5 Learning
56 pages
ClassificationandPrediction Module3
No ratings yet
ClassificationandPrediction Module3
88 pages
Unit 2 Classification
No ratings yet
Unit 2 Classification
59 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
6.data Mining - Classification
No ratings yet
6.data Mining - Classification
37 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
14 pages
Classification Data Mining
No ratings yet
Classification Data Mining
84 pages
05 - Machine Learning
No ratings yet
05 - Machine Learning
31 pages
0 Machine Learning Overview and Metrics LT
No ratings yet
0 Machine Learning Overview and Metrics LT
84 pages
AI Evaluation
No ratings yet
AI Evaluation
18 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
12 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Lecture 20 - Evaluation Metrics
No ratings yet
Lecture 20 - Evaluation Metrics
27 pages
Classification
No ratings yet
Classification
53 pages
Unit3 Evaluating Models
No ratings yet
Unit3 Evaluating Models
10 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
316 pages
Ai Unit 5
No ratings yet
Ai Unit 5
13 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Data Mining and Classification Basics
No ratings yet
Data Mining and Classification Basics
129 pages
Unit4 PPT
No ratings yet
Unit4 PPT
126 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Module 6
No ratings yet
Module 6
24 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Machine Learning Note
No ratings yet
Machine Learning Note
40 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Unit Iii
No ratings yet
Unit Iii
67 pages
Risk Security and Regulatory Compliance
No ratings yet
Risk Security and Regulatory Compliance
12 pages
3ML.02.MainConcepts Evaluation
No ratings yet
3ML.02.MainConcepts Evaluation
35 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Intro to Machine Learning Steps
No ratings yet
Intro to Machine Learning Steps
35 pages
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
No ratings yet
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
17 pages
ML Chap 2
No ratings yet
ML Chap 2
60 pages
GR 10 - Final Evaluation
No ratings yet
GR 10 - Final Evaluation
45 pages
4.8 Estimating The Performance of A Classifier
No ratings yet
4.8 Estimating The Performance of A Classifier
19 pages
Confusion Matrix
No ratings yet
Confusion Matrix
43 pages
SML Updated UNIT 4
No ratings yet
SML Updated UNIT 4
44 pages
Unit - 5
No ratings yet
Unit - 5
57 pages
DM Unit - 3
No ratings yet
DM Unit - 3
21 pages
Development and Deployment Setup: Data Collection
No ratings yet
Development and Deployment Setup: Data Collection
8 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
BSC ML CH1
No ratings yet
BSC ML CH1
63 pages
CH 3
No ratings yet
CH 3
33 pages
Intermediate Analytics-Regression-Week 3-1
No ratings yet
Intermediate Analytics-Regression-Week 3-1
44 pages
CLASSIFICATION
No ratings yet
CLASSIFICATION
36 pages
IBT2022 Student Report
No ratings yet
IBT2022 Student Report
4 pages
DLL Week 6
No ratings yet
DLL Week 6
3 pages
Analytical Aexposition Text
No ratings yet
Analytical Aexposition Text
5 pages
Research Design On McDonalds
No ratings yet
Research Design On McDonalds
31 pages
A-Level Chemistry Exam Guide
No ratings yet
A-Level Chemistry Exam Guide
5 pages
Culinary Arts - Hospitality Management
No ratings yet
Culinary Arts - Hospitality Management
7 pages
Nationalities
No ratings yet
Nationalities
3 pages
BE Biomedical SEM 1 To 8 Curriculum
No ratings yet
BE Biomedical SEM 1 To 8 Curriculum
8 pages
Grant, Wong, & Osterling (2007)
No ratings yet
Grant, Wong, & Osterling (2007)
12 pages
January 2018 MS
No ratings yet
January 2018 MS
14 pages
Bridge Construction & Architecture Major For College by Slidesgo
No ratings yet
Bridge Construction & Architecture Major For College by Slidesgo
55 pages
Pre Test Results in MAPEH 9
No ratings yet
Pre Test Results in MAPEH 9
3 pages
OT Believers Indwelt-McCabe
No ratings yet
OT Believers Indwelt-McCabe
50 pages
(Ebook PDF) Laboratory Manual For Physical Geology 16Th Edition Install Download
No ratings yet
(Ebook PDF) Laboratory Manual For Physical Geology 16Th Edition Install Download
44 pages
LP About The UK - 0
No ratings yet
LP About The UK - 0
8 pages
CBR Policy for Health Experts
No ratings yet
CBR Policy for Health Experts
150 pages
Business & IT Training Solutions
No ratings yet
Business & IT Training Solutions
5 pages
Tvl-Ict-Css: Quarter 3 - Module 7-8: Installing and Configuring Computer System (Iccs)
100% (4)
Tvl-Ict-Css: Quarter 3 - Module 7-8: Installing and Configuring Computer System (Iccs)
26 pages
Co Tefyl PDF
No ratings yet
Co Tefyl PDF
5 pages
Sample RWAC and Attendance Okay
No ratings yet
Sample RWAC and Attendance Okay
8 pages
116 KC Expert Designation
No ratings yet
116 KC Expert Designation
23 pages
Delta Application Form and Pre-Interview Task 0
100% (1)
Delta Application Form and Pre-Interview Task 0
7 pages
Mathematics N5 Study Notes
No ratings yet
Mathematics N5 Study Notes
3 pages
MKT 512
No ratings yet
MKT 512
12 pages
Vertical PM TLE 2Q
No ratings yet
Vertical PM TLE 2Q
3 pages
2english 10 Module
100% (2)
2english 10 Module
39 pages
PDF Measuring Academic Research How To Undertake A Bibliometric Study 1st Edition Ana Andres (Auth.) Download
100% (23)
PDF Measuring Academic Research How To Undertake A Bibliometric Study 1st Edition Ana Andres (Auth.) Download
45 pages
Kindersocialskills: With Tom Cat and Tabby Cat
50% (4)
Kindersocialskills: With Tom Cat and Tabby Cat
26 pages
Q, 2. (A) Define Intelligence Test (Individual and Group Test) Answer Intelligence Test
No ratings yet
Q, 2. (A) Define Intelligence Test (Individual and Group Test) Answer Intelligence Test
4 pages
22 Speaking
No ratings yet
22 Speaking
21 pages