0% found this document useful (0 votes)
6 views125 pages

Week 4

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views125 pages

Week 4

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

sainathgunda99@gmail.

com
DLZNK464L9
Data Warehouse

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning objectives
Upon completion, you will be able to:
• Explain Data Warehouse and Metadata Repository.
• Explain the differences between OLTP vs OLAP
• Identify different application areas of the three data warehouse models: enterprise warehouse, data mart, and
virtual data warehouse.
• Explain star, snowflake, and fact constalation schemas for data warehouse
• Explain the use of star query mode to support roll-up and drill-down operations.
• Explain what a data cube and what a data cuboid is.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Introduction to Data Warehouse

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Basic concepts of data warehouse
● OLTP VS OLAP
● Metadata Repository
● Need for a data warehouse

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
What is a data warehouse?
● Defined in many different ways, but not rigorously.
○ A decision support database maintained separately from the organization’s operational database.
○ Support decision making by providing a solid platform of consolidated, historical data for analysis.
● “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support
of management’s decision-making process.”—W. H. Inmon
○ A physically separate store of data transformed from the operational environment.
○ Operational update of data does not occur in the data warehouse environment.
sainathgunda99@gmail.com
DLZNK464L9

○ Does not require transaction processing, recovery, and concurrency control mechanisms
○ Requires only two operations in data accessing: Initial loading of data and access to data
○ "Data warehousing”:
■ The process of constructing and using data warehouses.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Metadata Repository
● Metadata is the data defining warehouse objects. It stores:
● Description of the structure of the data warehouse:
○ Schema, view, dimensions, concept hierarchies, derived data definition, data mart locations, and
contents.
● Operational metadata
○ Data lineage (history of migrated data and transformation path), the currency of data (active,
archived, or purged), monitoring information (warehouse usage statistics, error reports, audit
trails).
sainathgunda99@gmail.com
DLZNK464L9

● The algorithms used for summarization


● The mapping from the operational environment to the data warehouse
● Data related to system performance
● Business data
○ business terms and definitions, ownership of data, charging policies

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Online Transactional Processing vs Online Analytical Processing

OLTP (operational DBMS) OLAP (data warehouse)

Users Clerk, IT professional Knowledge worker

Function Day to day operations Decision support

DB Design
sainathgunda99@gmail.com application-oriented subject-oriented
DLZNK464L9

Data Current, up-to-date, detailed, flat- Historical, summarized,


rotational, isolated multidimensional,
integrated, consolidated

Usage Repetitive ad-hoc

Access read/write, index/hash on primary key Lots of scan

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
OLTP vs OLAP

OLTP OLAP

Unit Work Short, simple transaction Complex query

# Records Tens Millions


accessed
sainathgunda99@gmail.com
DLZNK464L9

# Users Thousands Hundreds

DB Size 100MB-GB 100GB-TB

Metric Transaction throughput Query throughput,


informative response

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Need for a Separate Data Warehouse
● High performance for both systems
○ DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
○ Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation
● Different functions and different data:
○ Missing data: Decision support requires historical data, which operational DBs do not typically
maintain.
○ Data consolidation: DS requires consolidation (aggregation, summarization) of data from
sainathgunda99@gmail.com
DLZNK464L9

heterogeneous sources.
○ Data quality: different sources typically use inconsistent data representations, codes, and formats,
which have to be reconciled.

Note: There are more and more systems that perform OLAP analysis directly on relational databases

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We discussed that the data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile data
collection.
● We talked about metadata which is a data-defining warehouse object.
● We learned the difference between OLTP and OLAP, such as OLTP has an application-oriented DB design,
whereas OLAP has a subjective-oriented DB design.
● We discussed the need for a data warehouse separated from day-to-day operational systems.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Data Warehouse Models and ETL

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Multi-tiered architecture of a data warehouse
● Extraction, Transformation, and Loading ( ETL ) operations
● Three data warehouse models

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Multi-tiered Architecture
Monitor
&
Integrator OLAP
Meta data Server
Other
sources
Analysis Query
Reports
sainathgunda99@gmail.com
DLZNK464L9
Extract
Data Serve Data Mining
Operational Transform Warehouse
DB’s Load
Refresh

Front-end tools
OLAP Engine
Data Sources Data Marts

Data Storage
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Extraction,Transformation, and Loading (ETL)
● Data extraction
○ get data from multiple, heterogeneous, and external sources.
● Data cleaning
○ detect errors in the data and rectify them when possible.
● Data transformation
○ Convert data from legacy or host format to warehouse format.
● Load
sainathgunda99@gmail.com
DLZNK464L9 ○ Sort, summarize, consolidate, compute views, check integrity, and build indexes and partitions.
● Refresh
○ Propagate the updates from the data sources to the warehouse.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Three Data Warehouse Models
● Enterprise warehouse
○ collects all of the information about subjects spanning the entire organization.
● Data Mart
○ A subset of corporate-wide data that is of value to specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart.
○ Independent vs. dependent (directly from warehouse) data mart
sainathgunda99@gmail.com
DLZNK464L9● Virtual warehouse
○ A set of views over operational databases.
○ Only some of the possible summary views may be materialized.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We understood Multi-tiered architecture.
● We discussed three Data Warehouse Models: Enterprise warehouse, Data Mart, and Virtual warehouse.
● We learned the Extraction, Transformation, and Loading (ETL) process, which includes:
o Data extraction: To get data from multiple, heterogeneous, and external sources.
o Data cleaning: To detect errors in the data and rectify them.
o Data transformation: To convert data from legacy or host format to warehouse format.
sainathgunda99@gmail.com
DLZNK464L9
o Load: To sort, summarize, consolidate, compute views, check integrity, and build indexes and partitions.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Data Cubes

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Introduction to data cubes
● Three types of measures used in a data cube
● How a cube is related to cuboids
● How data cubes supports OLAP operations: roll-up, drill-down, dice, slice, and pivot

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Cubes
● A data warehouse is based on a multidimensional data model which views data in the form of a data cube.
● Data is represented as data cubes.
● A data cube records some measures along relevant dimensions. A sales data cube may record the sale’s
amount of items along the dimensions such as date, country, types of products.
● Data cubes can be implemented as multi-dimensional arrays or as relational tables.
● In data warehousing literature, an n-D base cube is called a base cuboid. The topmost 0-D cuboid, which
holds the highest level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
A Cube is a lattice of cuboids

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
A Sample Data Cube

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Cube Measures: Three categories
● Distributive: if the result derived by applying the function to n aggregate values is the same as that derived
by applying the function on all the data without partitioning.
○ E.g., count(), sum(), min(), max()
● Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer),
each of which is obtained by applying a distributive aggregate function.
○ E.g., avg(), min_N(), standard_deviation()
● Holistic: if there is no constant bound on the storage size needed to describe a sub aggregate.
sainathgunda99@gmail.com
DLZNK464L9

○ E.g., median(), mode(), rank()

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Cubes implemented as relational tables
● Data schemas used to construct data cubes are somewhat different from those used to construct operational
relational databases, although the underlying entity-relation modeling is the same.
● A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
○ Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year).
○ Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Conceptual Modeling of Data Warehouses

● Modeling data warehouses using dimension tables and fact tables


○ Star schema: A fact table in the middle connected to a set of dimension tables.
○ Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape similar to snowflake.
○ Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore
sainathgunda99@gmail.com
DLZNK464L9

called a galaxy schema or a fact constellation.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Example of Star Schema
time
item
time_key
day Sales Fact Table item_key
day_of_the_week item_name
month brand
Time_key
quarter type
year
sainathgunda99@gmail.com
DLZNK464L9 Item Key supplier_type

branch_key
location
location_key
location_key
branch
units_sold street
branch_key city
branch_name dollars_sold state_or_province
country
branch _type avg_sales
Measures
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Example of Snowflake Schema
time item
Supplier
Supplier_key
time_key item_key Supplier_type
day Sales Fact Table item_name
day_of_the_week brand
month type
Time_key
quarter supplier_type
year
sainathgunda99@gmail.com
DLZNK464L9 Item Key
location
branch_key location_key
Street
location_key
City_key
branch
units_sold
branch_key city
Branch_nam dollars_sold
City_key
e avg_sales City
branch_type State_or_pro
Measures
vience
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Example of Fact Constellation
Shipping Fact Table
time
time_key
time_key item
day item_key
Sales Fact Table item_key
day_of_the_week item_name shipper_key
month Time Key brand_type
quarter from _location
supplier_type
year
sainathgunda99@gmail.com Item Key
DLZNK464L9
to_location
branch_key
location dollars_cost
location_key location_key
branch units_shipped
units_sold street
branch_key
dollars_sold Shipper
Branch_name
branch_type avg_sales shipper_key
Measures shipper_name
location_key
shipper_type
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
A Star-Net Query Model

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
OLAP and Star Net video)

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
33
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We discussed data cubes provide a multi-dimensional view of data in a data warehouse.
● We talked about the architecture of a data warehouse using a pictorial representation.
● We discussed the 3 different categories of data cube measures: Distributive, Algebraic, and Holistic.
● We discussed cube is a lattice of cuboids using a pictorial representation.
● We discussed Star schema, Snowflake schema, and Fact constellation schema to store the data.
● We learnt Star-Net Query Model is used to support OLAP queries.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9 Introduction to Classification

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning objectives
Upon completion, you will be able to:

• Explain Classification along with Model construction, Model Evaluation, and Performance Improvement
techniques.
• Explain Decision Tree and scenarios of splitting.
• Illustrate the concepts of Entropy, Information Gain, and Gini Impurity.
• Explain the concepts of Pruning.
• Illustrate Naïve Bayes classification method
sainathgunda99@gmail.com
• Explain Model evaluation and selection.
DLZNK464L9

• List techniques to improve classification effectiveness.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9 Classification Overview

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Supervised vs Unsupervised learning
● Classification vs Regression
● Classification - Model construction
● Classification - Model Evaluation

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Supervised vs Unsupervised Learning
● Supervised learning (e.g., classification)
○ Supervision: The training data (observations, measurements, etc.) is accompanied by labels indicating
the class of the observations.
○ New data is classified based on the model derived from the training set.
○ Risk of overfitting: model fits too well on the training set and not perform well on new data
● Unsupervised learning (e.g., clustering, association rules)
sainathgunda99@gmail.com
DLZNK464L9
○ Do not need class labels
○ Discover useful patterns in the dataset (clusters, association rules, features).
○ Need to verify the patterns learned are really patterns and not accidental.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classification vs Regression
● Classification
○ Predicts categorical class labels (nominal).
○ The number of labels/classes predicted can range from 1 to many.
○ 1 class: one-class classification/description
○ 2 classes: binary classification problem
○ N classes: multiple-class classification problem
○ One observation can have one or multiple labels (product rankings).
sainathgunda99@gmail.com
DLZNK464L9
○ Many problems can be represented as classification problems.
○ Election: win or lose, which candidates will win?
○ Spending habits: conservative, impulsive, compulsive spending
○ Handwriting identification: 26 letters or millions of words
○ Goodness is measured by how well predicted labels match the given labels (“gold standard”).

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classification vs Regression
● Regression (Numeric Prediction)
○ Models continuous-valued functions, i.e., predict unknown values.
○ Goodness is measured by the closeness of the predicted value to the true value: Mean Squared Error
or Mean Absolute Error.
● Many applications
○ Next season’s crop yields
sainathgunda99@gmail.com
DLZNK464L9
○ Tomorrow’s stock prices

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classification - A Two-Step Process
● Learning step (induction): construct the model
● Classification step (deduction): using the model to classify new data
● Model construction: Describing a set of predetermined classes
○ Each tuple/sample belongs to a predefined class, as determined by the class label attribute.
○ The set of tuples used for model construction is a training set (more general, “development set”).
○ The set of tuples used for tuning the model hyperparameters or preventing overfitting is the validation set (optional).
sainathgunda99@gmail.com ○ Training and validation sets are from the same labeled set.
DLZNK464L9
○ Iterations of evaluate model effectiveness and tuning the parameters and/or features.
○ The model is represented as classification rules, decision trees, mathematical formulae, etc.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classification - A Two-Step Process
● Model evaluation: Estimate model performance on new data
○ Evaluate the trained model on a separate test set, in the same way as evaluation during the
construction
■ The known label of the test sample is compared with the classified result from the model.
■ Performance is measured by the percentage of test set samples correctly classified by the model
(many metrics).
sainathgunda99@gmail.com
DLZNK464L9 ■ Test set is independent of training or validation sets (otherwise biased).
● If the performance is acceptable, use the model to classify new data.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classification - Model Construction Process

Training & Classification


Validation Data Algorithms

Iterations

Name Rank Years Tenure

Mike
sainathgunda99@gmail.com
DLZNK464L9
Assistant Prof 3 no

Mary Assistant prof 7 yes Classifier or


Model
Bill professor 2 yes

Jim Associate Prof 7 yes

Dave Assistant prof 6 no


IF Rank = ‘Professor’
Anne Associate Prof 3 no
OR years >6
Then tenured = ‘yes’
Attributes/features Class labels
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classification - Model Test and Deploy

Trained Classifier pr
Test Data Trained Model

deploy
sainathgunda99@gmail.com
DLZNK464L9
Name Rank Years Tenure

Tom Assistant Prof 2 no Unseen Data

Merlisa Professor 7 yes


Jeff, Professor,4
George Professor 2 yes

Joseph Associate Prof 7 yes


Tenured?

This file is meant for personal use by sainathgunda99@gmail.com only.


Yes
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We discussed that supervised learning needs a class label to train a model, whereas unsupervised learning
does not need a class label.
● We learned classification vs. regression that classification is to predict nominal class labels, whereas regression
works with continuous-valued functions, i.e., predicts unknown values.
● We talked about model construction uses training and validation datasets.
● We discussed model evaluation on a separate test dataset before deployment to classify future/unseen
sainathgunda99@gmail.com
DLZNK464L9
objects.
● We talked about the risk of overfitting in supervised learning and the need to control that risk.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Introduction to Decision Tree

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Introduction to Decision Tree
● Algorithm for Decision Tree induction
● Splitting scenario for Decision Tree

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Decision Trees
● Decision trees can be used to solve both classification and regression problems. The name implies that it
employs a flowchart-like tree structure to display the predictions that result from a series of feature-based
splits. It begins with a root node and ends with a decision made by the leaves.
● Decision trees can approximate any classification/regression function arbitrarily closely.
● Risks overfitting
● Many decision tree algorithms differ in two aspects:
sainathgunda99@gmail.com
DLZNK464L9
○ Attribute selection measures
○ Tree pruning methods

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Decision Tree: An Introductory Example

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Decision Tree: An Introductory Example

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Algorithm for Decision Tree Induction
● Basic (Hunt’s) algorithm (a greedy algorithm)
○ Tree is constructed in a top-down, recursive, divide-and-conquer manner.
○ At the start, all the training examples are at the root.
○ Examples are partitioned/split recursively based on selected attributes.
○ Splitting attributes are selected on the basis of a heuristic or statistical measure (e.g., information
gain).
● Condition for stopping partitioning when growing a full tree (one of these)
sainathgunda99@gmail.com
DLZNK464L9

○ All samples for a given node belong to the same class.


○ There are no suitable attributes for further partitioning – the most frequent class of the training set is
used to label this leaf node
○ There are no samples for the node: use the most frequent label of parent node

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Decision Tree: Splitting results in partitions that are ‘purer’

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Splitting Scenarios

Discrete valued attribute

Continues valued attribute


sainathgunda99@gmail.com
DLZNK464L9

Discrete valued attribute in


binary tree

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Splitting Scenarios

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We talked about decision trees, a flowchart-like tree structure used to solve classification and regression
problems.
● We discussed a greedy algorithm in which a decision tree is constructed in a top-down, recursive, divide-and-
conquer manner.
● We learned splitting results in partitions (child nodes) that are purer in class composition than the parent
DLZNK464L9 node.
sainathgunda99@gmail.com

● We learned the splitting scenario for the decision tree using a flowchart-like tree structure.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attribute Selection Measure : Entropy
sainathgunda99@gmail.com
DLZNK464L9
and Information Gain

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we’ll be able to learn about:
● Attribute selection measures used in Decision Tree
● Information entropy
● Conditional entropy
● Information Gain index
● Gain Ratio
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Information Entropy
● A measure of uncertainty associated with a random variable (information needed to describe this variable).
● For a discrete random variable, Y, that takes m distinct values (y1, …, ym).
○ H(Y)= where pi=P(Y=yi )
○ H(Y) ∈[0, log2 m]
○ In classification, Y is our class label variable. Values such as ym are the class labels.
● Interpretation:
○ Higher entropy => higher impurity [many classes with equal chance to appear].
sainathgunda99@gmail.com
DLZNK464L9

○ Lower entropy => low impurity [e.g., one class dominates].

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Conditional Entropy
● Conditional entropy (or equivocation) quantifies the amount of information needed to describe the
outcome of Y given that the value of another random variable X is known.
○ Is the weighted sum of entropies over all conditions.

○ because splitting on any X results in partitions that are purer (no messier) than
the parent
sainathgunda99@gmail.com
DLZNK464L9

○ “=“ when X is independent of Y.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attribute Selection Measure: Information Gain(ID3/C4.5)
● Let pi be the probability that an arbitrary tuple in D belongs to class Ci, i∈[1, m], estimated by |Ci, D|/|D|.
● Entropy (Expected information) of D w.r.t. C:

● Conditional Entropy: Entropy after using A to split D into v partitions:

sainathgunda99@gmail.com
DLZNK464L9
● Information gained by branching on attribute A

Gain(A) = Info(D) - InfoA (D)


● Select the attribute with the highest information gain.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attribute Selection Measure: Information Gain Example: Data

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attribute Selection Measure: Information Gain Example
● Class P: buys_computer = “yes”, 9 Info age(D) = 5/14*I(2,3) + 4/14*I(4,0) +5/14
● Class N: buys_computer = “no”, 5 *I(3,2) = 0.694

5/14 *I(2,3) means ‘age<=30’ has 5 out of


Test Age, Income, Student, and Credit_Rating one by one. 14 samples, with 2 yes’s and 3 no’s
sainathgunda99@gmail.com
DLZNK464L9

Age pi ni l(pi,ni) Gain(age) = Info (D) - Infoage (D) = 0.246


Gain (income) = 0.029
Gain (student) = 0.151
<=30 2 3 0.971 Gain (credit_rating) = 0.048

31…40 4 0 0
Gain(age) is the largest,
so split on age
>40 3 2 0.971

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Gain Ratio for Attribute Selection (C4.5)
● Information Gain measure is biased towards attributes with a large number of distinct values, e.g., Student
ID.
● C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalize info gain w.r.t. attribute value
distribution).
● The attribute with the maximum gain ratio is selected as the splitting attribute
■ GainRatio(A) = Gain(A)/SplitInfo(A)
sainathgunda99@gmail.com
DLZNK464L9

■ SplitInfo is the entropy of the attribute A w.r.t. its possible values. More values, greater SplitInfo.

● E.g: Income has three values: low, medium, high

Splitinfoinome (D)= -4/14*log2(4/14)-6/14*log2(6/14)-4/14*log2(4/14)= 1.557

gain_ratio(income) = 0.029/1.557 = 0.019

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We discussed Information Entropy, a measure of uncertainty associated with a random variable.
● We learned Conditional Entropy, which quantifies the amount of information needed to describe the outcome
of Y given X’s values.
● We learned Information Gain and Gain ratio as attribute selection measures.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attribute Selection Measure :
sainathgunda99@gmail.com
DLZNK464L9
Gini Impurity

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Gini impurity index and its computation.
● Comparison between attribute selection measures.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Gini Impurity Index (CART, IBM Intelligent Miner)
● If a data set D contains examples from n classes, gini impurity index, gini(D) is defined as:

○ where pj is the relative frequency of class j in D. ∈ [0-1)

● The data set D is split on A into two subsets D1 and D2, the gini index gini(D) given the partition is defined as
sainathgunda99@gmail.com
DLZNK464L9
the reduction in impurity:

● The attribute provides the largest reduction in impurity (or the smallest ginisplit(D) ) chosen to split the node
and grow two branches.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Computation of Gini Index
● Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
Gini (D) = 1-(9/14)^2 - (5/14)^2=0.459
● Try attribute income, first partitions D into 10 in D1: {low, medium} and 4 in D2:{high}
Gini income ϵ {low,medium} (D) = (10/14) Gini (D1) + (4/14) Gini (D2)
=10/14(1-(7/10)^2-(3/10)^2)+4/14(1-(2/4)^2-(2/4)^2)
sainathgunda99@gmail.com = 0.443
= Gini income {high} (D)
DLZNK464L9
ϵ

then, Gini{low,high} = 0.458;


Gini{medium,high} = 0.450.
Thus, the best split on income is the {low,medium} (and {high}) since it would produce the highest reduction
in Gini index.
• Compared to other attributes age, student, and credit rating,
we found Gini age ϵ {young,senior} =0.357 is the best splitting condition.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Comparing Attribute Selection Measures
● Optimal tree: minimize the number of tests to classify an input. NP-complete (1967)
● Shallower trees are quicker to learn, while deeper trees tend to be more accurate but risk overfitting.
● The three measures, in general, result in similar performance.
○ Information gain: biased in favor of multivalued attributes.
○ Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others.
○ Gini index: biased in favor of multivalued attributes.
sainathgunda99@gmail.com
DLZNK464L9 ■ Faster to compute because it doesn’t use log().

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We discussed the Gini impurity index where if an attribute has a large reduction in impurity, it is chosen to
split the node and grow two branches.
● We talked about the comparison between attribute selection measures. Information gain is biased in favor of
multivalued attributes, the Gain ratio tends to prefer unbalanced splits in which one partition is much smaller
than the others, and the Gini index is biased in favor of multi-valued attributes.
● But they generally give similar performance.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Decision Tree Pruning

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we’ll learn about:
● Overfitting
● Introduction to Pruning
● Classification in large databases
● Importance of decision tree

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Overfitting and Tree Pruning
● Overfitting: An induced tree may overfit the training data.
○ Too many branches; some may reflect anomalies due to noise or outliers.
○ Poor accuracy for unseen samples.
● Two approaches to avoid overfitting
○ Pre Pruning: Halt tree construction early. Do not split a node if this would result in the goodness
measure falling below a threshold.
sainathgunda99@gmail.com
DLZNK464L9 ■ Difficult to choose an appropriate threshold.
■ Set thresholds on tree depth or min number of objects in a node
○ Post Pruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned
trees.
● May use a validation dataset different from the training data to decide which is the “best-pruned tree”.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Tree Pruning
● General strategy
○ Compare a goodness measurement (Info Gain, Gain Ratio, Gini) with/without a candidate branch.
○ If the measurement is better without the branch, prune it.
○ Pruning a tree from the bottom up.
● In terms of decision trees, the impact of tree punning is more significant than the splitting measures on the
performance of the resulting trees.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classification in Large Databases
● Why is decision tree induction popular?
○ Relatively faster learning speed (than other classification methods).
○ Convertible to simple and easy-to-understand classification rules.
○ Comparable classification accuracy with other methods.
● Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
● To train DT on a large dataset, when can not read all training examples into memory
○ Compute and store only contingency table for each attributes (AVC: Attribute, Value, Class Label, and
sainathgunda99@gmail.com
DLZNK464L9

counts)

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We discussed that an induced tree might overfit the training data
● We discussed pruning and that the impact of tree pruning is more significant than the splitting measures on
the performance of the resulting trees.
● We talked about classification in large databases and scalability.
● We learned the importance of a decision tree that is relatively faster and produces human-readable rules.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Bayesian Classification Methods

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Need for Bayesian classification methods
● Naive Bayes classifier
● Advantages and disadvantages of Naïve Bayes classifier
● Avoiding the zero-probability problem

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Bayesian Classifiers: Why?
● Probability theorem-supported classifiers: performs probabilistic prediction, i.e., predicts class membership
probabilities.
● Foundation: Based on Bayes’ Theorem.
● Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with
decision trees and some shallow neural network classifiers
● Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed data.
sainathgunda99@gmail.com
DLZNK464L9
● Baseline benchmark: They can provide a standard of optimal decision-making against which other methods
can be measured.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Deriving the Maximum Posterior: Bayes’ theorem
● Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-
attribute vector X = (x1, x2, …, xn).
● Suppose there are m classes C1, C2, …, Cm.
● Classification is equivalent to deriving the maximum posteriori, i.e., the maximal P(Ci|X). (P(Ci) is the prior
probability).
● This can be derived from Bayes’ theorem
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Deriving the Maximum Posterior
● Since P(X) is constant for all classes, only needs to be computed

● Predicts X belongs to Ci if the probability P(Ci|X) is the highest among all the P(Ck|X) for all m classes

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Naïve Bayes Classifier
● A simplifying assumption: attributes are (i.e., no dependence relation between attributes given the class):
conditionally independent

● This greatly reduces the computation cost: Only counts attribute value distribution in different classes.
● If Ak is categorical, P(Ak=xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in
D)
sainathgunda99@gmail.com
DLZNK464L9
● If Ak is continuous-valued, P(Ak=xk|Ci) is usually computed based on Gaussian distribution with a mean μ and
standard deviation σ and P(Ak=xk|Ci)

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Naïve Bayes Classifier: Example Training Dataset
Class:
C1: buys_computer = ‘yes’
C2: buys_compute = ‘no’

Dataset to be classified :

X = (age
<=30,Income=medium,student=yes,Credt_rating=Fair)
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Naïve Bayes Classifier: An Example

● X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

● P(Ci):P(buys_computer = “yes”) = 9/14 = 0.643


○ P(buys_computer = “no”) = 5/14= 0.357
● Compute P(X|Ci) for each attribute value

sainathgunda99@gmail.com
DLZNK464L9 P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
○ P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
○ P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
○ P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Naïve Bayes Classifier: An Example
● Compute P(X|Ci) for each attribute value
○ P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
○ P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
○ P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
○ P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

sainathgunda99@gmail.com

DLZNK464L9
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
● Compute overall P(X|Ci) for this X: for the two classes:
○ P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
○ P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

● Finally, P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028


○ P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
● Therefore, X belongs to class (“buys_computer = yes”)

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Avoiding the Zero-Probability Problem
● Naïve Bayesian prediction requires each conditional probability to be non-zero. Otherwise, the predicted
probability will be zero.

● Ex. Suppose the “buy” training dataset with 1000 tuples, income=low (0), income= medium (990), and income
= high (10), now we need to predict the class for a new example with low income.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Avoiding the Zero-Probability Problem
● Use Laplacian correction (or “Laplace estimator”)
○ Adding 1 to each attribute value count:
■ Prob(income = low|buy=yes) = 1/1003
■ Prob(income = medium|buy=yes) = 991/1003
■ Prob(income = high|buy=yes) = 11/1003
○ The “corrected” probability estimates are close to their original counterparts.
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Naïve Bayes Classifier: Comments
● Advantages
○ Easy to implement, incremental.
○ Good results were obtained for many classification tasks, despite the violation of the conditional
independent assumption.
● Disadvantages
○ Assumption: class conditional independence, therefore loss of accuracy.
○ Practically, dependencies could exist among variables.
sainathgunda99@gmail.com
DLZNK464L9

■ E.g., age, family history, smoking, lung cancer.


■ Dependencies among these cannot be modeled by Naïve Bayes Classifier.
■ Bayesian Belief Network addresses this issue to some extent.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We learned the Bayesian classification methods are based on Bayes Theorem.
● We discussed the Naive Bayes classifier that simplifies the assumption and reduces the computation cost.
● We talked about Naïve Bayesian prediction that requires each conditional probability to be non-zero.
Otherwise, the predicted probability will be zero.
● We learned the advantages and disadvantages of the Naïve Bayes classifier, like it is easy to implement but
DLZNK464L9 risks low accuracy.
sainathgunda99@gmail.com

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Model Evaluation and Selection

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we’ll learn about:
● Classifier Effectiveness Evaluation Metrics

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Model Evaluation and Selection
● Models are trained, but how good are they for new data?
● Evaluation metrics: How can we measure the effectiveness?
● Use validation sets of labeled tuples instead of the training set to refine/tune and optimize model
performance.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
● Given m classes, an entry, Ci,j in a
confusion matrix indicates # of
tuples in class i that were labeled
C1: POSITIVE C2:Negative by the classifier as class j.
Predicted
● Shown here a matrix for a binary
classification problem, but the
Actual
idea applies to multi-class
sainathgunda99@gmail.com
DLZNK464L9 classification (add C3, …, Cn)
C1:POSITIVE True Positives (TP) False Negative ● May have extra rows/columns to
(FN) provide totals.

C2: Negative False Positives (FP) True Negative


(TN)

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classifier Evaluation Metrics: Confusion Matrix
Example of confusion matrix:

Actual class/ buy_computer= buy_computer Total


Predicted yes =no
class

sainathgunda99@gmail.com
buy_compute
DLZNK464L9
6954 46 7000
r=yes

buy_compute 412 2588 3000


r=no

Total 7366 2364 10000

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classifier Evaluation Metrics

● Classifier Accuracy, or recognition rate: percentage of • True Positive Rate = TP/P


tuples that are correctly classified • False Negative Rate = FN/P
Accuracy = (TP + TN)/All • False Positive Rate = FP/N
• True Negative Rate = TN/N
● Error rate: 1 – accuracy, or
(FP + FN)/All
sainathgunda99@gmail.com
DLZNK464L9

A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classifier Evaluation Metrics: Sensitivity and Specificity

● Accuracy and Error Rate are not good for imbalance class classification problem:
○ One class may be rare, e.g. fraud, or HIV-positive.
○ Significant majority belongs to the negative class and a small number of data belongs
to the positive class: accuracy and error rate dominated by negative class.

○ Sensitivity: True Positive Rate, recall.


sainathgunda99@gmail.com
DLZNK464L9
○ Specificity: True Negative Rate

Accuracy = 9820/10000 = 98.2%


P N Total

P 10 90 100 Sensitivity = 10/100 = 10%


Specificity = 9810/9900 = 99.1%
N 90 9810 9900
Sensitivity + Specificity reflect the true
Total 100 9000 10000 performance

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classifier Evaluation Metrics:Precision and Recall, and F-
measures
● Precision: Exactness – what % of tuples that the classifier labeled as positive are actually positive.
precision = TP/ (TP+FP)
● Recall/Sensitivity: Completeness – what % of positive tuples did the classifier label as positive?
recall = TP/(TP+FN)
● Perfect score for Precision and Recall: 1.0
sainathgunda99@gmail.com
DLZNK464L9 ● Inverse relationship between precision & recall.
● F measure (F1 or F-score): harmonic mean of precision and recall,
F1 = (2* Precision * recall ) / ( precision + recall )

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classifier Evaluation Metrics:Precision and Recall, and F-
measures
Fß: weighted measure of precision and recall:
● Assigns ß times as much weight to recall as to the precision. ß > 1, weight recall more.

Fβ= (1+β2)*precision*recall / (β2* precision + recall)

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classifier Evaluation Metrics: Example

Actual Cancer = yes Cancer = no Total


class\Predicted
class

Cancer = yes 90 210 300


sainathgunda99@gmail.com
DLZNK464L9
Cancer = no 140 9560 9700

Total 230 9770 10000

● Accuracy = (90+9560) / 10000 = 96.50%


● Precision = 90/230 = 39.13%, Recall = 90/300 = 30.00%
● Sensitivity= 90/300 = 30%, Specificity= 9560/9700 = 98.56%

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Estimate Model Effectiveness
● For classification and numerical regression:
● Performance metrics obtained from one run is often not a reliable estimate of the classifier’s performance.
○ Different splits of the labeled dataset into training and validation often result in different performance
scores – which score should we trust? Certainly not the best one!
○ To obtain unbiased performance score: need validation datasets different from training.
○ To obtain a reliable estimate: needs to evaluate the model multiple times.
sainathgunda99@gmail.com
DLZNK464L9

Always hold out the final test dataset!

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Review

Labeled data Usage

Training / Validation data Model Selection

sainathgunda99@gmail.com
DLZNK464L9
Performance Tuning,
Performance estimate

Test data Final Evaluation

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We discussed the various evaluation metrics using a confusion matrix: Precision, Recall, and F1-score, to
predict the effectiveness of a classifier.
● Regression and classification: unbiased and reliable estimate of model performance requires a separate test
dataset and multiple runs of the model.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
Performance Estimate Methods

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Hold out, cross-validation, and bootstrap
● Model selection: ROC curves
● Issues affecting model selection

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Evaluating Classifier Effectiveness:Holdout
● Holdout method (for large data sets)
○ Given data is randomly partitioned into two independent sets.
■ Training set (>=2/3) for model construction
■ Validation set (<= 1/3) for evaluation
○ Random sampling: a variation of holdout
■ Repeat holdout k times, effectiveness = avg. of the k effectiveness
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Evaluating Classifier Effectiveness: Cross-Validation Methods
● Cross-validation (k-fold, k = 3, 5, 10, for medium-sized data)
○ Randomly partition the data into k mutually exclusive subsets, each approximately equal in size.
○ At i-th iteration, use Di as the test/validation set and others as a training set.
○ Variations:
○ Leave-one-out: k folds where k = No. of tuples, for small-sized data.
○ Stratified cross-validation: folds are stratified so that class dist. in each fold is approximate; the same as that in
sainathgunda99@gmail.com
the initial data
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Evaluating Classifier Effectiveness: Bootstrap
● Bootstrap
● Used on small data sets, tends to be optimistic.
● Samples the labeled tuples with replacement
● I.e., each time a tuple is selected, it is put back into the training set and is equally likely to be selected again.
● Several bootstrap methods and a common one is .632 bootstrap
○ A data set with d tuples is sampled d times, with replacement,
sainathgunda99@gmail.com
DLZNK464L9
○ Resulting in a training set of d samples.
○ The data tuples that did not make it into the training set ended up forming the validation set.
○ About 63.2% of the original data end up in training, and the remaining 36.8% form the
validation set (since (1 – 1/d)d ≈ e-1 = 0.368).

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Evaluating Classifier Effectiveness: Bootstrap
● There are several bootstrap methods and a common one is .632 bootstrap.
○ Repeat the sampling procedure k times (e.g. 30 to hundreds), the overall accuracy of the model:

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Compare Performance: Classifier Models M1 vs. M2
● Suppose we have 2 classifiers, M1 and M2, which one is better?
● Use a performance estimate method to obtain K performance scores for each model.
● These performance scores are just estimates of scores on the population, including future data cases.
● What if the differences in scores between the 2 models were simply by chance?

sainathgunda99@gmail.com Use a test of statistical significance:
DLZNK464L9
○ Unpaired two sample T-test: compare two models trained/validated on different data sets [scores are not paired]

○ Paired two sample T-Test: compare two models trained/validated on same data sets [scores are in pairs]

○ ANOVA: compare > two models

○ Obtain a confidence level for the test:


○ “at 95% confidence level, performances were statistically significantly different”

○ or “performance differences were not statistically significant at 95% conf. level”.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Model Selection: ROC Curves
● ROC (Receiver/Relative Operating Characteristics) curves: for visual comparison of
classification models.
● Originated from the signal detection theory.
● Shows the tradeoff between the true positive rate and the false positive rate.
● The Area Under the ROC Curve (AUC) is a measure of the accuracy of the model.
● Vertical axis represents the true positive rate.
● Horizontal axis represents the false positive rate.
sainathgunda99@gmail.com
DLZNK464L9
● The plot also shows two model performance curves and a diagonal line.
● The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate the model.
● A model with perfect accuracy will have an area of 1.0.

● To use ROC, the model must return the probability of predicated class for each
observation.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Model Selection: ROC Curves
● Sort observations in decreasing order: the one that is most likely to belong to the positive class appears at the
top of the list.
● From the first row down, use the prob as the thresholds to predict class labels
● Compute TP, FP, TN, and FN.
● Compute TP rate = TP/P, FP rate = FP/N

● Plot TP rate and FP rate pairs.


sainathgunda99@gmail.com
DLZNK464L9

● ROC is useful for balanced classes.


o Another curve known as Precision-Recall Curve is useful for imbalanced classes.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Model Selection: ROC Curves

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Issues Affecting Model Selection
● Effectiveness (precision, recall, etc.)
○ Classifier performance: predicting the class label.
● Speed
○ Time to construct the model (training time).

sainathgunda99@gmail.com
Time to use the model (classification/prediction time).
DLZNK464L9
● Robustness: handling noise and missing values
● Scalability: efficiency in disk-resident databases or the cloud.
● Interpretability
○ Understanding and insight provided by the model.
● Portability: maintain performance level on new data
● Other measures, e.g., the goodness of rules, such as decision tree size or compactness
of classification rules.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We discussed the Holdout method (for large data sets) in which data is randomly partitioned into two
independent sets.
● We learned Cross-validation (k-fold, k = 3, 5, 10, for medium-sized data) that randomly partitions the data into
k mutually exclusive subsets, each approximately equal in size.
● We discussed that Bootstrap used on small data sets, tends to be optimistic.
sainathgunda99@gmail.com
● We discussed T-tests and ROC curves for comparison of classification models.
DLZNK464L9

● We learned about other issues that affect the model selection, such as effectiveness, robustness, scalability,

etc.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Techniques to improve classification
sainathgunda99@gmail.com accuracy: Ensemble methods and
Imbalanced classes
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we’ll be able to learn:
● Techniques to improve classification accuracy.
● Ensemble methods
● Imbalanced class problems.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Ensemble Methods: Improve Performance
● Ensemble methods
○ Train k models using the training data.
○ Let all k models to make predictions on a new data.
○ Combine predictions from k models to make the final prediction.
○ Combine a series of k-learned models, M1, M2, …, Mk, with the aim of creating an improved model M*.
● Popular ensemble methods
○ Bagging: majority vote or averaging the prediction over a collection of classifiers.
sainathgunda99@gmail.com
DLZNK464L9

○ Boosting: weighted vote with a collection of classifiers.


○ Random Forests: building a large number of ‘random’ trees.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Bagging: Bootstrap Aggregation
● Analogy: Diagnosis based on multiple doctors’ majority vote.
● Training
○ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement
from D (i.e., bootstrap).
○ A classifier model Mi is trained using a training set Di.
● Classification: classify an unknown sample X
sainathgunda99@gmail.com
DLZNK464L9 ○ Each classifier Mi returns its class prediction.
○ The bagged classifier M* counts the votes and assigns the class with the most votes to X.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Bagging: Bootstrap Aggregation
● Prediction: can be applied to the prediction of continuous values by taking the average value of each
prediction for a given test tuple.
● Effectiveness:
○ Often significantly better than a single classifier derived from D.
○ For noise data: not considerably worse, more robust.

sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Boosting
● Analogy: Consult several doctors based on a combination of weighted diagnoses—weight assigned based on
the previous diagnosis accuracy.
● How does boosting work?
○ Equal weights are initially assigned to each training tuple.
○ A series of k classifiers is iteratively learned.
○ After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay
sainathgunda99@gmail.com more attention to the training tuples that were misclassified by M
DLZNK464L9 i.
○ The final M* combines the votes of each individual classifier, where the weight of each classifier's vote
is a function of its accuracy.
● Boosting algorithm can be extended for numeric prediction.
● Compared to bagging, boosting tends to have greater accuracy, but it also risks overfitting the model to
misclassified data.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Adaboost
● Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd), initially, all the weights of tuples are set the same
(1/d).
● Generate k classifiers in k rounds. At round i,
○ Tuples from D are sampled (with replacement) to form a training set Di of the same size.
○ Each tuple’s chance of being selected is based on its weight.
○ A classification model Mi is derived from Di.
○ Its error rate is calculated using Di as a validation set.
sainathgunda99@gmail.com
DLZNK464L9

○ If a tuple is misclassified, its weight is increased, otherwise, it is decreased.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Adaboost
● Model error rate: err(Xj) is the misclassification error of tuple Xj(=1 or 0). Classifier Mi error rate is the sum of
the weights of the misclassified tuples:

sainathgunda99@gmail.com
● The weight of classifier Mi’s vote is
DLZNK464L9

● Sum of classifiers’ weights for the classes is used to select the final class.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Random Forest
● Random Forest:
○ Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of
attributes at each node to determine the split. Grow full trees without pruning.
○ During classification, each tree votes, and the most popular class is returned.
● Two methods to construct Random Forest:
○ Forest-RI (random input selection): At each node, randomly select a few attributes (much less than the no
sainathgunda99@gmail.comof available attributes) as candidates for the split. This is commonly used method.
DLZNK464L9
○ Forest-RC (random linear combinations): Used when only a few attributes are available. Creates new
attributes (or features) that are a random linear combination of randomly selected attributes (to reduce the
correlation between individual classifiers).
● Comparable in accuracy to Adaboost but more robust to errors and outliers.
● Insensitive to the number of attributes selected for consideration at each split and faster than bagging or
boosting, with no overfitting.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Classification of Class-Imbalanced Data Sets
● Class-imbalance problem: Rare positive examples but numerous negative ones, e.g., outlier detection,
medical diagnosis, fraud, oil spill, fault, etc.
● Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for
class-imbalanced data.
● Evaluation metrics developed
○ Sensitivity + Specificity
sainathgunda99@gmail.com
DLZNK464L9 ○ Precision + Recall
○ F_β score: allow different weights for precision and recall
● These metrics better reflect system performance in case of class imbalance.
● These metrics help to tune systems to improve rare class performance.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Methods for Class-Imbalanced Problem
● Typical methods for imbalance data in 2-class classification:
○ Oversampling: re-sampling of data from the positive class.
○ Generate synthetic samples similar to the original. Works better on numerical attributes.
■ SMOTE: generate a fixed number of examples for each minority sample.
■ ADASYN: generate a different number of examples.
○ Under-sampling: randomly eliminate tuples from the negative class.
○ Threshold-moving: moves the decision threshold, t, so that the rare class tuples are easier to classify,
sainathgunda99@gmail.com
DLZNK464L9
and hence, less chance of costly false negative errors.
● Class imbalance problem remains difficult on multiclass tasks.
● Convert a multiclass task to a set of binary classification tasks.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
● We discussed various ensemble methods to improve the performance of the model, such as Bagging,
Boosting, and Random Forest.
● We discussed Bagging, which can be applied to the prediction of continuous values by taking the average
value of each prediction for a given test tuple.
● We talked about Boosting and one example Adaboost, which tends to have greater accuracy, but it also risks
DLZNK464L9 overfitting the model to misclassified data.
sainathgunda99@gmail.com

● We learned the Random forest method in which each classifier is a decision tree classifier and is generated
using a random selection of attributes at each node to determine the split.
● We discussed different solutions to the class-imbalanced classification problem.

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Outcomes
You should now be able to:
• Outline Supervised vs. Unsupervised learning, Classification - Model construction, and Model Evaluation.
• Recall and summarize the Decision Tree.
• Summarize the concepts of Entropy, Information Gain, and Gini Impurity.
• Recall and restate the concepts of Pruning.
• Implement Bayesian classification methods.
sainathgunda99@gmail.com
DLZNK464L9

• Summarize Model evaluation and selection.


• List and explain techniques to improve classification accuracy

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Thank You
sainathgunda99@gmail.com
DLZNK464L9

This file is meant for personal use by sainathgunda99@gmail.com only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.

You might also like