sainathgunda99@gmail.
com
DLZNK464L9
                                      Data Warehouse
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
     Learning objectives
         Upon completion, you will be able to:
         • Explain Data Warehouse and Metadata Repository.
         • Explain the differences between OLTP vs OLAP
         • Identify different application areas of the three data warehouse models: enterprise warehouse, data mart, and
             virtual data warehouse.
         • Explain star, snowflake, and fact constalation schemas for data warehouse
         • Explain the use of star query mode to support roll-up and drill-down operations.
         • Explain what a data cube and what a data cuboid is.
sainathgunda99@gmail.com
DLZNK464L9
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
                           Introduction to Data Warehouse
                                         This file is meant for personal use by sainathgunda99@gmail.com only.
                                Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                         Sharing or publishing the contents in part or full is liable for legal action.
       Agenda
         In this session, we will discuss:
         ● Basic concepts of data warehouse
         ● OLTP VS OLAP
         ● Metadata Repository
         ● Need for a data warehouse
sainathgunda99@gmail.com
DLZNK464L9
                                              This file is meant for personal use by sainathgunda99@gmail.com only.
                                     Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                              Sharing or publishing the contents in part or full is liable for legal action.
     What is a data warehouse?
           ● Defined in many different ways, but not rigorously.
                ○ A decision support database maintained separately from the organization’s operational database.
                ○ Support decision making by providing a solid platform of consolidated, historical data for analysis.
           ● “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support
               of management’s decision-making process.”—W. H. Inmon
                ○ A physically separate store of data transformed from the operational environment.
                ○ Operational update of data does not occur in the data warehouse environment.
sainathgunda99@gmail.com
DLZNK464L9
                ○ Does not require transaction processing, recovery, and concurrency control mechanisms
                ○ Requires only two operations in data accessing: Initial loading of data and access to data
                ○ "Data warehousing”:
                        ■ The process of constructing and using data warehouses.
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
     Metadata Repository
                ● Metadata is the data defining warehouse objects. It stores:
                        ● Description of the structure of the data warehouse:
                           ○ Schema, view, dimensions, concept hierarchies, derived data definition, data mart locations, and
                               contents.
                        ● Operational metadata
                           ○ Data lineage (history of migrated data and transformation path), the currency of data (active,
                               archived, or purged), monitoring information (warehouse usage statistics, error reports, audit
                               trails).
sainathgunda99@gmail.com
DLZNK464L9
                        ● The algorithms used for summarization
                        ● The mapping from the operational environment to the data warehouse
                        ● Data related to system performance
                        ● Business data
                           ○ business terms and definitions, ownership of data, charging policies
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
       Online Transactional Processing vs Online Analytical Processing
                           OLTP (operational DBMS)                                              OLAP (data warehouse)
          Users            Clerk, IT professional                                               Knowledge worker
          Function         Day to day operations                                                Decision support
          DB Design
sainathgunda99@gmail.com   application-oriented                                                 subject-oriented
DLZNK464L9
          Data             Current, up-to-date, detailed, flat-                                 Historical, summarized,
                           rotational, isolated                                                 multidimensional,
                                                                                                integrated, consolidated
          Usage            Repetitive                                                           ad-hoc
          Access           read/write, index/hash on primary key                                Lots of scan
                                                     This file is meant for personal use by sainathgunda99@gmail.com only.
                                            Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                     Sharing or publishing the contents in part or full is liable for legal action.
        OLTP vs OLAP
                           OLTP                                                                     OLAP
         Unit Work         Short, simple transaction                                                Complex query
          # Records        Tens                                                                     Millions
          accessed
sainathgunda99@gmail.com
DLZNK464L9
         # Users           Thousands                                                                Hundreds
         DB Size           100MB-GB                                                                 100GB-TB
         Metric            Transaction throughput                                                   Query throughput,
                                                                                                    informative response
                                                  This file is meant for personal use by sainathgunda99@gmail.com only.
                                         Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                  Sharing or publishing the contents in part or full is liable for legal action.
     Need for a Separate Data Warehouse
      ●         High performance for both systems
                 ○ DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
                 ○ Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation
          ● Different functions and different data:
                 ○ Missing data: Decision support requires historical data, which operational DBs do not typically
                         maintain.
                 ○ Data consolidation: DS requires consolidation (aggregation, summarization) of data from
sainathgunda99@gmail.com
DLZNK464L9
                         heterogeneous sources.
                 ○ Data quality: different sources typically use inconsistent data representations, codes, and formats,
                         which have to be reconciled.
     Note: There are more and more systems that perform OLAP analysis directly on relational databases
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
      Summary
          Here is a quick recap:
          ● We discussed that the data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile data
              collection.
          ● We talked about metadata which is a data-defining warehouse object.
          ● We learned the difference between OLTP and OLAP, such as OLTP has an application-oriented DB design,
              whereas OLAP has a subjective-oriented DB design.
          ● We discussed the need for a data warehouse separated from day-to-day operational systems.
sainathgunda99@gmail.com
DLZNK464L9
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
                           Data Warehouse Models and ETL
                                         This file is meant for personal use by sainathgunda99@gmail.com only.
                                Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                         Sharing or publishing the contents in part or full is liable for legal action.
        Agenda
         In this session, we will discuss:
         ● Multi-tiered architecture of a data warehouse
         ● Extraction, Transformation, and Loading ( ETL ) operations
         ● Three data warehouse models
sainathgunda99@gmail.com
DLZNK464L9
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
        Multi-tiered Architecture
                                                                             Monitor
                                                                                &
                                                                            Integrator                                                       OLAP
                               Meta data                                                                                                     Server
               Other
               sources
                                                                                                                                                      Analysis Query
                                                                                                                                                      Reports
sainathgunda99@gmail.com
DLZNK464L9
                            Extract
                                                                  Data                                              Serve                             Data Mining
              Operational   Transform                           Warehouse
              DB’s          Load
                            Refresh
                                                                                                                                                      Front-end tools
                                                                                                                                    OLAP Engine
             Data Sources                                           Data Marts
                                                                   Data Storage
                                             This file is meant for personal use by sainathgunda99@gmail.com only.
                                    Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                             Sharing or publishing the contents in part or full is liable for legal action.
     Extraction,Transformation, and Loading (ETL)
       ●        Data extraction
                  ○ get data from multiple, heterogeneous, and external sources.
           ● Data cleaning
                  ○ detect errors in the data and rectify them when possible.
           ● Data transformation
                  ○ Convert data from legacy or host format to warehouse format.
           ● Load
sainathgunda99@gmail.com
DLZNK464L9        ○ Sort, summarize, consolidate, compute views, check integrity, and build indexes and partitions.
           ● Refresh
                  ○ Propagate the updates from the data sources to the warehouse.
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
      Three Data Warehouse Models
        ●        Enterprise warehouse
                     ○ collects all of the information about subjects spanning the entire organization.
            ● Data Mart
                     ○ A subset of corporate-wide data that is of value to specific groups of users. Its scope is confined to
                         specific, selected groups, such as marketing data mart.
                     ○ Independent vs. dependent (directly from warehouse) data mart
sainathgunda99@gmail.com
DLZNK464L9●      Virtual warehouse
                     ○ A set of views over operational databases.
                     ○ Only some of the possible summary views may be materialized.
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
     Summary
         Here is a quick recap:
         ● We understood Multi-tiered architecture.
         ● We discussed three Data Warehouse Models: Enterprise warehouse, Data Mart, and Virtual warehouse.
         ● We learned the Extraction, Transformation, and Loading (ETL) process, which includes:
               o Data extraction: To get data from multiple, heterogeneous, and external sources.
               o Data cleaning: To detect errors in the data and rectify them.
               o Data transformation: To convert data from legacy or host format to warehouse format.
sainathgunda99@gmail.com
DLZNK464L9
               o Load: To sort, summarize, consolidate, compute views, check integrity, and build indexes and partitions.
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
                                                    Data Cubes
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
        Agenda
         In this session, we will discuss:
         ● Introduction to data cubes
         ● Three types of measures used in a data cube
         ● How a cube is related to cuboids
         ● How data cubes supports OLAP operations: roll-up, drill-down, dice, slice, and pivot
sainathgunda99@gmail.com
DLZNK464L9
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
       Data Cubes
         ●      A data warehouse is based on a multidimensional data model which views data in the form of a data cube.
         ●      Data is represented as data cubes.
         ●      A data cube records some measures along relevant dimensions. A sales data cube may record the sale’s
                amount of items along the dimensions such as date, country, types of products.
         ●      Data cubes can be implemented as multi-dimensional arrays or as relational tables.
         ●      In data warehousing literature, an n-D base cube is called a base cuboid. The topmost 0-D cuboid, which
                holds the highest level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
sainathgunda99@gmail.com
DLZNK464L9
                                                   This file is meant for personal use by sainathgunda99@gmail.com only.
                                          Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                   Sharing or publishing the contents in part or full is liable for legal action.
       A Cube is a lattice of cuboids
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
        A Sample Data Cube
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
     Data Cube Measures: Three categories
     ●        Distributive: if the result derived by applying the function to n aggregate values is the same as that derived
              by applying the function on all the data without partitioning.
                ○ E.g., count(), sum(), min(), max()
        ● Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer),
              each of which is obtained by applying a distributive aggregate function.
                ○ E.g., avg(), min_N(), standard_deviation()
        ● Holistic: if there is no constant bound on the storage size needed to describe a sub aggregate.
sainathgunda99@gmail.com
DLZNK464L9
                ○ E.g., median(), mode(), rank()
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
        Data Cubes implemented as relational tables
        ●      Data schemas used to construct data cubes are somewhat different from those used to construct operational
               relational databases, although the underlying entity-relation modeling is the same.
        ●      A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
                 ○ Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year).
                 ○ Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables.
sainathgunda99@gmail.com
DLZNK464L9
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
       Conceptual Modeling of Data Warehouses
         ●      Modeling data warehouses using dimension tables and fact tables
                ○ Star schema: A fact table in the middle connected to a set of dimension tables.
                ○ Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a
                     set of smaller dimension tables, forming a shape similar to snowflake.
                ○ Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore
sainathgunda99@gmail.com
DLZNK464L9
                     called a galaxy schema or a fact constellation.
                                                     This file is meant for personal use by sainathgunda99@gmail.com only.
                                            Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                     Sharing or publishing the contents in part or full is liable for legal action.
     Example of Star Schema
       time
                                                                                                                  item
           time_key
           day                       Sales Fact Table                                                   item_key
           day_of_the_week                                                                              item_name
           month                                                                                        brand
                                               Time_key
           quarter                                                                                      type
           year
sainathgunda99@gmail.com
DLZNK464L9                                     Item Key                                                 supplier_type
                                            branch_key
                                                                                                        location
                                           location_key
                                                                                                        location_key
          branch
                                             units_sold                                                 street
       branch_key                                                                                       city
       branch_name                         dollars_sold                                                 state_or_province
                                                                                                        country
       branch _type                           avg_sales
                       Measures
                                           This file is meant for personal use by sainathgunda99@gmail.com only.
                                  Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                           Sharing or publishing the contents in part or full is liable for legal action.
     Example of Snowflake Schema
       time                                                                                                       item
                                                                                                                                                   Supplier
                                                                                                                                                   Supplier_key
           time_key                                                                                     item_key                                   Supplier_type
           day                       Sales Fact Table                                                   item_name
           day_of_the_week                                                                              brand
           month                                                                                        type
                                               Time_key
           quarter                                                                                      supplier_type
           year
sainathgunda99@gmail.com
DLZNK464L9                                     Item Key
                                                                                                        location
                                            branch_key                                                  location_key
                                                                                                        Street
                                           location_key
                                                                                                        City_key
           branch
                                             units_sold
       branch_key                                                                       city
       Branch_nam                          dollars_sold
                                                                                        City_key
       e                                      avg_sales                                 City
       branch_type                                                                      State_or_pro
                       Measures
                                                                                        vience
                                           This file is meant for personal use by sainathgunda99@gmail.com only.
                                  Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                           Sharing or publishing the contents in part or full is liable for legal action.
     Example of Fact Constellation
                                                                                                                                                  Shipping Fact Table
       time
                                                                                                                                                     time_key
          time_key                                                                                             item
          day                                                                                                                                        item_key
                                     Sales Fact Table                                                    item_key
          day_of_the_week                                                                                item_name                                  shipper_key
          month                                Time Key                                                  brand_type
          quarter                                                                                                                                   from _location
                                                                                                         supplier_type
          year
sainathgunda99@gmail.com                       Item Key
DLZNK464L9
                                                                                                                                                     to_location
                                            branch_key
                                                                                                         location                                   dollars_cost
                                           location_key                                                  location_key
          branch                                                                                                                                    units_shipped
                                              units_sold                                                 street
      branch_key
                                            dollars_sold                                                  Shipper
      Branch_name
      branch_type                             avg_sales                                                   shipper_key
                      Measures                                                                            shipper_name
                                                                                                          location_key
                                                                                                          shipper_type
                                          This file is meant for personal use by sainathgunda99@gmail.com only.
                                 Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                          Sharing or publishing the contents in part or full is liable for legal action.
      A Star-Net Query Model
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
                                        OLAP and Star Net video)
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                                                                                                            33
                                    Sharing or publishing the contents in part or full is liable for legal action.
     Summary
          Here is a quick recap:
          ● We discussed data cubes provide a multi-dimensional view of data in a data warehouse.
          ● We talked about the architecture of a data warehouse using a pictorial representation.
          ● We discussed the 3 different categories of data cube measures: Distributive, Algebraic, and Holistic.
          ● We discussed cube is a lattice of cuboids using a pictorial representation.
          ● We discussed Star schema, Snowflake schema, and Fact constellation schema to store the data.
          ● We learnt Star-Net Query Model is used to support OLAP queries.
sainathgunda99@gmail.com
DLZNK464L9
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9                 Introduction to Classification
                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                        Sharing or publishing the contents in part or full is liable for legal action.
     Learning objectives
      Upon completion, you will be able to:
      •      Explain Classification along with Model construction, Model Evaluation, and Performance Improvement
             techniques.
         • Explain Decision Tree and scenarios of splitting.
         • Illustrate the concepts of Entropy, Information Gain, and Gini Impurity.
         • Explain the concepts of Pruning.
         • Illustrate Naïve Bayes classification method
sainathgunda99@gmail.com
         • Explain Model evaluation and selection.
DLZNK464L9
         • List techniques to improve classification effectiveness.
                                              This file is meant for personal use by sainathgunda99@gmail.com only.
                                     Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                              Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9                 Classification Overview
                                     This file is meant for personal use by sainathgunda99@gmail.com only.
                            Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                     Sharing or publishing the contents in part or full is liable for legal action.
        Agenda
       In this session, we will discuss:
       ● Supervised vs Unsupervised learning
       ● Classification vs Regression
       ● Classification - Model construction
       ● Classification - Model Evaluation
sainathgunda99@gmail.com
DLZNK464L9
                                              This file is meant for personal use by sainathgunda99@gmail.com only.
                                     Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                              Sharing or publishing the contents in part or full is liable for legal action.
       Supervised vs Unsupervised Learning
        ●     Supervised learning (e.g., classification)
                 ○         Supervision: The training data (observations, measurements, etc.) is accompanied by labels indicating
                           the class of the observations.
                 ○         New data is classified based on the model derived from the training set.
                 ○         Risk of overfitting: model fits too well on the training set and not perform well on new data
        ●     Unsupervised learning (e.g., clustering, association rules)
sainathgunda99@gmail.com
DLZNK464L9
                 ○         Do not need class labels
                 ○         Discover useful patterns in the dataset (clusters, association rules, features).
                 ○         Need to verify the patterns learned are really patterns and not accidental.
                                                         This file is meant for personal use by sainathgunda99@gmail.com only.
                                                Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                         Sharing or publishing the contents in part or full is liable for legal action.
     Classification vs Regression
      ●        Classification
                 ○ Predicts categorical class labels (nominal).
                 ○ The number of labels/classes predicted can range from 1 to many.
                         ○ 1 class: one-class classification/description
                         ○ 2 classes: binary classification problem
                         ○ N classes: multiple-class classification problem
                 ○ One observation can have one or multiple labels (product rankings).
sainathgunda99@gmail.com
DLZNK464L9
                 ○ Many problems can be represented as classification problems.
                         ○ Election: win or lose, which candidates will win?
                         ○ Spending habits: conservative, impulsive, compulsive spending
                         ○ Handwriting identification: 26 letters or millions of words
                 ○ Goodness is measured by how well predicted labels match the given labels (“gold standard”).
                                              This file is meant for personal use by sainathgunda99@gmail.com only.
                                     Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                              Sharing or publishing the contents in part or full is liable for legal action.
     Classification vs Regression
       ●        Regression (Numeric Prediction)
                   ○ Models continuous-valued functions, i.e., predict unknown values.
                   ○ Goodness is measured by the closeness of the predicted value to the true value: Mean Squared Error
                         or Mean Absolute Error.
           ● Many applications
                   ○ Next season’s crop yields
sainathgunda99@gmail.com
DLZNK464L9
                   ○ Tomorrow’s stock prices
                                              This file is meant for personal use by sainathgunda99@gmail.com only.
                                     Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                              Sharing or publishing the contents in part or full is liable for legal action.
      Classification - A Two-Step Process
       ●       Learning step (induction): construct the model
       ●       Classification step (deduction): using the model to classify new data
       ●       Model construction: Describing a set of predetermined classes
                 ○ Each tuple/sample belongs to a predefined class, as determined by the class label attribute.
                 ○ The set of tuples used for model construction is a training set (more general, “development set”).
                         ○ The set of tuples used for tuning the model hyperparameters or preventing overfitting is the validation set (optional).
sainathgunda99@gmail.com ○  Training and validation sets are from the same labeled set.
DLZNK464L9
                         ○ Iterations of evaluate model effectiveness and tuning the parameters and/or features.
                 ○ The model is represented as classification rules, decision trees, mathematical formulae, etc.
                                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                        Sharing or publishing the contents in part or full is liable for legal action.
     Classification - A Two-Step Process
      ●        Model evaluation: Estimate model performance on new data
                 ○ Evaluate the trained model on a separate test set, in the same way as evaluation during the
                         construction
                          ■ The known label of the test sample is compared with the classified result from the model.
                          ■ Performance is measured by the percentage of test set samples correctly classified by the model
                               (many metrics).
sainathgunda99@gmail.com
DLZNK464L9                ■ Test set is independent of training or validation sets (otherwise biased).
         ● If the performance is acceptable, use the model to classify new data.
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
      Classification - Model Construction Process
                                    Training &                                                                  Classification
                                    Validation Data                                                             Algorithms
                                                                                     Iterations
                     Name        Rank                       Years               Tenure
                     Mike
sainathgunda99@gmail.com
DLZNK464L9
                                 Assistant Prof             3                   no
                     Mary        Assistant prof             7                   yes                                  Classifier or
                                                                                                                        Model
                     Bill        professor                  2                   yes
                     Jim         Associate Prof             7                   yes
                     Dave        Assistant prof             6                   no
                                                                                                            IF Rank = ‘Professor’
                     Anne        Associate Prof             3                   no
                                                                                                            OR years >6
                                                                                                            Then tenured = ‘yes’
                      Attributes/features                                      Class labels
                                                       This file is meant for personal use by sainathgunda99@gmail.com only.
                                              Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                       Sharing or publishing the contents in part or full is liable for legal action.
       Classification - Model Test and Deploy
                                                                                                              Trained Classifier pr
                                  Test Data                                                                   Trained Model
                                                                                                                                   deploy
sainathgunda99@gmail.com
DLZNK464L9
                Name       Rank                        Years                     Tenure
                Tom        Assistant Prof              2                         no                                                     Unseen Data
                Merlisa    Professor                   7                         yes
                                                                                                                               Jeff, Professor,4
                George     Professor                   2                         yes
                Joseph     Associate Prof              7                         yes
                                                                                                                                         Tenured?
                                                     This file is meant for personal use by sainathgunda99@gmail.com only.
                                                                                                                                              Yes
                                            Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                     Sharing or publishing the contents in part or full is liable for legal action.
     Summary
         Here is a quick recap:
         ● We discussed that supervised learning needs a class label to train a model, whereas unsupervised learning
             does not need a class label.
         ● We learned classification vs. regression that classification is to predict nominal class labels, whereas regression
             works with continuous-valued functions, i.e., predicts unknown values.
         ● We talked about model construction uses training and validation datasets.
         ● We discussed model evaluation on a separate test dataset before deployment to classify future/unseen
sainathgunda99@gmail.com
DLZNK464L9
             objects.
         ● We talked about the risk of overfitting in supervised learning and the need to control that risk.
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
                           Introduction to Decision Tree
                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                        Sharing or publishing the contents in part or full is liable for legal action.
       Agenda
         In this session, we will discuss:
         ● Introduction to Decision Tree
         ● Algorithm for Decision Tree induction
         ● Splitting scenario for Decision Tree
sainathgunda99@gmail.com
DLZNK464L9
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
     Decision Trees
      ●    Decision trees can be used to solve both classification and regression problems. The name implies that it
           employs a flowchart-like tree structure to display the predictions that result from a series of feature-based
           splits. It begins with a root node and ends with a decision made by the leaves.
      ●    Decision trees can approximate any classification/regression function arbitrarily closely.
          ● Risks overfitting
          ● Many decision tree algorithms differ in two aspects:
sainathgunda99@gmail.com
DLZNK464L9
                  ○ Attribute selection measures
                  ○ Tree pruning methods
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
      Decision Tree: An Introductory Example
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
       Decision Tree: An Introductory Example
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
      Algorithm for Decision Tree Induction
     ●         Basic (Hunt’s) algorithm (a greedy algorithm)
                ○ Tree is constructed in a top-down, recursive, divide-and-conquer manner.
                ○ At the start, all the training examples are at the root.
                ○ Examples are partitioned/split recursively based on selected attributes.
                ○ Splitting attributes are selected on the basis of a heuristic or statistical measure (e.g., information
                        gain).
        ● Condition for stopping partitioning when growing a full tree (one of these)
sainathgunda99@gmail.com
DLZNK464L9
                ○ All samples for a given node belong to the same class.
                ○ There are no suitable attributes for further partitioning – the most frequent class of the training set is
                        used to label this leaf node
                ○ There are no samples for the node: use the most frequent label of parent node
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
       Decision Tree: Splitting results in partitions that are ‘purer’
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
       Splitting Scenarios
       Discrete valued attribute
      Continues valued attribute
sainathgunda99@gmail.com
DLZNK464L9
      Discrete valued attribute in
      binary tree
                                              This file is meant for personal use by sainathgunda99@gmail.com only.
                                     Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                              Sharing or publishing the contents in part or full is liable for legal action.
      Splitting Scenarios
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
     Summary
        Here is a quick recap:
        ● We talked about decision trees, a flowchart-like tree structure used to solve classification and regression
             problems.
        ● We discussed a greedy algorithm in which a decision tree is constructed in a top-down, recursive, divide-and-
             conquer manner.
        ● We learned splitting results in partitions (child nodes) that are purer in class composition than the parent
DLZNK464L9 node.
sainathgunda99@gmail.com
        ● We learned the splitting scenario for the decision tree using a flowchart-like tree structure.
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
                       Attribute Selection Measure : Entropy
sainathgunda99@gmail.com
DLZNK464L9
                               and Information Gain
                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                        Sharing or publishing the contents in part or full is liable for legal action.
       Agenda
        In this session, we’ll be able to learn about:
        ● Attribute selection measures used in Decision Tree
        ● Information entropy
        ● Conditional entropy
        ● Information Gain index
        ● Gain Ratio
sainathgunda99@gmail.com
DLZNK464L9
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
     Information Entropy
      ●         A measure of uncertainty associated with a random variable (information needed to describe this variable).
      ●         For a discrete random variable, Y, that takes m distinct values (y1, …, ym).
                 ○ H(Y)=                         where pi=P(Y=yi )
                 ○ H(Y) ∈[0, log2 m]
                 ○ In classification, Y is our class label variable. Values such as ym are the class labels.
          ● Interpretation:
                 ○ Higher entropy => higher impurity [many classes with equal chance to appear].
sainathgunda99@gmail.com
DLZNK464L9
                 ○ Lower entropy => low impurity [e.g., one class dominates].
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
       Conditional Entropy
         ●      Conditional entropy (or equivocation) quantifies the amount of information needed to describe the
                outcome of Y given that the value of another random variable X is known.
                 ○ Is the weighted sum of entropies over all conditions.
                  ○                               because splitting on any X results in partitions that are purer (no messier) than
                           the parent
sainathgunda99@gmail.com
DLZNK464L9
                           ○ “=“ when X is independent of Y.
                                                         This file is meant for personal use by sainathgunda99@gmail.com only.
                                                Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                         Sharing or publishing the contents in part or full is liable for legal action.
       Attribute Selection Measure: Information Gain(ID3/C4.5)
         ●      Let pi be the probability that an arbitrary tuple in D belongs to class Ci, i∈[1, m], estimated by |Ci, D|/|D|.
         ●      Entropy (Expected information) of D w.r.t. C:
         ●      Conditional Entropy: Entropy after using A to split D into v partitions:
sainathgunda99@gmail.com
DLZNK464L9
         ●      Information gained by branching on attribute A
                                Gain(A) = Info(D) - InfoA (D)
         ●      Select the attribute with the highest information gain.
                                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                    Sharing or publishing the contents in part or full is liable for legal action.
       Attribute Selection Measure: Information Gain Example: Data
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
       Attribute Selection Measure: Information Gain Example
          ●      Class P: buys_computer = “yes”, 9                                                          Info age(D) = 5/14*I(2,3) + 4/14*I(4,0) +5/14
          ●      Class N: buys_computer = “no”, 5                                                           *I(3,2) = 0.694
                                                                                                             5/14 *I(2,3) means ‘age<=30’ has 5 out of
           Test Age, Income, Student, and Credit_Rating one by one.                                          14 samples, with 2 yes’s and 3 no’s
sainathgunda99@gmail.com
DLZNK464L9
            Age             pi                ni                              l(pi,ni)                           Gain(age) = Info (D) - Infoage (D) = 0.246
                                                                                                                 Gain (income) = 0.029
                                                                                                                 Gain (student) = 0.151
            <=30            2                 3                               0.971                              Gain (credit_rating) = 0.048
            31…40           4                 0                               0
                                                                                                                   Gain(age) is the largest,
                                                                                                                   so split on age
            >40             3                 2                               0.971
                                                     This file is meant for personal use by sainathgunda99@gmail.com only.
                                            Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                     Sharing or publishing the contents in part or full is liable for legal action.
        Gain Ratio for Attribute Selection (C4.5)
          ●      Information Gain measure is biased towards attributes with a large number of distinct values, e.g., Student
                 ID.
          ●      C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalize info gain w.r.t. attribute value
                 distribution).
          ●      The attribute with the maximum gain ratio is selected as the splitting attribute
                          ■ GainRatio(A) = Gain(A)/SplitInfo(A)
sainathgunda99@gmail.com
DLZNK464L9
                           ■   SplitInfo is the entropy of the attribute A w.r.t. its possible values. More values, greater SplitInfo.
          ●     E.g: Income has three values: low, medium, high
                    Splitinfoinome (D)= -4/14*log2(4/14)-6/14*log2(6/14)-4/14*log2(4/14)= 1.557
                    gain_ratio(income) = 0.029/1.557 = 0.019
                                                     This file is meant for personal use by sainathgunda99@gmail.com only.
                                            Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                     Sharing or publishing the contents in part or full is liable for legal action.
       Summary
         Here is a quick recap:
         ● We discussed Information Entropy, a measure of uncertainty associated with a random variable.
         ● We learned Conditional Entropy, which quantifies the amount of information needed to describe the outcome
           of Y given X’s values.
         ● We learned Information Gain and Gain ratio as attribute selection measures.
sainathgunda99@gmail.com
DLZNK464L9
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
                           Attribute Selection Measure :
sainathgunda99@gmail.com
DLZNK464L9
                                   Gini Impurity
                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                        Sharing or publishing the contents in part or full is liable for legal action.
        Agenda
        In this session, we will discuss:
        ● Gini impurity index and its computation.
        ● Comparison between attribute selection measures.
sainathgunda99@gmail.com
DLZNK464L9
                                              This file is meant for personal use by sainathgunda99@gmail.com only.
                                     Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                              Sharing or publishing the contents in part or full is liable for legal action.
       Gini Impurity Index (CART, IBM Intelligent Miner)
         ●      If a data set D contains examples from n classes, gini impurity index, gini(D) is defined as:
                  ○        where pj is the relative frequency of class j in D. ∈ [0-1)
         ●      The data set D is split on A into two subsets D1 and D2, the gini index gini(D) given the partition is defined as
sainathgunda99@gmail.com
DLZNK464L9
                the reduction in impurity:
         ●     The attribute provides the largest reduction in impurity (or the smallest ginisplit(D) ) chosen to split the node
               and grow two branches.
                                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                        Sharing or publishing the contents in part or full is liable for legal action.
         Computation of Gini Index
         ●   Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
                                    Gini (D) = 1-(9/14)^2 - (5/14)^2=0.459
         ●   Try attribute income, first partitions D into 10 in D1: {low, medium} and 4 in D2:{high}
                        Gini income ϵ {low,medium} (D) = (10/14) Gini (D1) + (4/14) Gini (D2)
                                                       =10/14(1-(7/10)^2-(3/10)^2)+4/14(1-(2/4)^2-(2/4)^2)
sainathgunda99@gmail.com                               = 0.443
                                                       = Gini income {high} (D)
DLZNK464L9
                                                                             ϵ
                 then, Gini{low,high} = 0.458;
                  Gini{medium,high} = 0.450.
            Thus, the best split on income is the {low,medium} (and {high}) since it would produce the highest reduction
     in Gini index.
     •    Compared to other attributes age, student, and credit rating,
             we found Gini age ϵ {young,senior} =0.357 is the best splitting condition.
                                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                    Sharing or publishing the contents in part or full is liable for legal action.
     Comparing Attribute Selection Measures
      ●        Optimal tree: minimize the number of tests to classify an input. NP-complete (1967)
      ●        Shallower trees are quicker to learn, while deeper trees tend to be more accurate but risk overfitting.
      ●        The three measures, in general, result in similar performance.
                 ○ Information gain: biased in favor of multivalued attributes.
                 ○ Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others.
                 ○ Gini index: biased in favor of multivalued attributes.
sainathgunda99@gmail.com
DLZNK464L9               ■ Faster to compute because it doesn’t use log().
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
      Summary
         Here is a quick recap:
         ● We discussed the Gini impurity index where if an attribute has a large reduction in impurity, it is chosen to
             split the node and grow two branches.
         ● We talked about the comparison between attribute selection measures. Information gain is biased in favor of
             multivalued attributes, the Gain ratio tends to prefer unbalanced splits in which one partition is much smaller
             than the others, and the Gini index is biased in favor of multi-valued attributes.
         ● But they generally give similar performance.
sainathgunda99@gmail.com
DLZNK464L9
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
                           Decision Tree Pruning
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
       Agenda
        In this session, we’ll learn about:
        ● Overfitting
        ● Introduction to Pruning
        ● Classification in large databases
        ● Importance of decision tree
sainathgunda99@gmail.com
DLZNK464L9
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
     Overfitting and Tree Pruning
      ●        Overfitting: An induced tree may overfit the training data.
                 ○ Too many branches; some may reflect anomalies due to noise or outliers.
                 ○ Poor accuracy for unseen samples.
          ● Two approaches to avoid overfitting
                 ○ Pre Pruning: Halt tree construction early. Do not split a node if this would result in the goodness
                         measure falling below a threshold.
sainathgunda99@gmail.com
DLZNK464L9                 ■ Difficult to choose an appropriate threshold.
                           ■ Set thresholds on tree depth or min number of objects in a node
                 ○ Post Pruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned
                         trees.
          ● May use a validation dataset different from the training data to decide which is the “best-pruned tree”.
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
       Tree Pruning
         ●      General strategy
                  ○ Compare a goodness measurement (Info Gain, Gain Ratio, Gini) with/without a candidate branch.
                  ○ If the measurement is better without the branch, prune it.
                  ○ Pruning a tree from the bottom up.
         ●      In terms of decision trees, the impact of tree punning is more significant than the splitting measures on the
                performance of the resulting trees.
sainathgunda99@gmail.com
DLZNK464L9
                                                   This file is meant for personal use by sainathgunda99@gmail.com only.
                                          Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                   Sharing or publishing the contents in part or full is liable for legal action.
      Classification in Large Databases
      ●        Why is decision tree induction popular?
                ○ Relatively faster learning speed (than other classification methods).
                ○ Convertible to simple and easy-to-understand classification rules.
                ○ Comparable classification accuracy with other methods.
         ● Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
         ● To train DT on a large dataset, when can not read all training examples into memory
                ○        Compute and store only contingency table for each attributes (AVC: Attribute, Value, Class Label, and
sainathgunda99@gmail.com
DLZNK464L9
                        counts)
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
        Summary
        Here is a quick recap:
        ● We discussed that an induced tree might overfit the training data
        ● We discussed pruning and that the impact of tree pruning is more significant than the splitting measures on
          the performance of the resulting trees.
        ● We talked about classification in large databases and scalability.
        ● We learned the importance of a decision tree that is relatively faster and produces human-readable rules.
sainathgunda99@gmail.com
DLZNK464L9
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
                           Bayesian Classification Methods
                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                        Sharing or publishing the contents in part or full is liable for legal action.
       Agenda
         In this session, we will discuss:
         ● Need for Bayesian classification methods
         ● Naive Bayes classifier
         ● Advantages and disadvantages of Naïve Bayes classifier
         ● Avoiding the zero-probability problem
sainathgunda99@gmail.com
DLZNK464L9
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
      Bayesian Classifiers: Why?
      ●         Probability theorem-supported classifiers: performs probabilistic prediction, i.e., predicts class membership
               probabilities.
          ● Foundation: Based on Bayes’ Theorem.
          ● Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with
               decision trees and some shallow neural network classifiers
          ● Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is
               correct — prior knowledge can be combined with observed data.
sainathgunda99@gmail.com
DLZNK464L9
          ● Baseline benchmark: They can provide a standard of optimal decision-making against which other methods
               can be measured.
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
        Deriving the Maximum Posterior: Bayes’ theorem
        ●      Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-
               attribute vector X = (x1, x2, …, xn).
        ●      Suppose there are m classes C1, C2, …, Cm.
        ●      Classification is equivalent to deriving the maximum posteriori, i.e., the maximal P(Ci|X). (P(Ci) is the prior
               probability).
        ●      This can be derived from Bayes’ theorem
sainathgunda99@gmail.com
DLZNK464L9
                                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                    Sharing or publishing the contents in part or full is liable for legal action.
       Deriving the Maximum Posterior
          ●    Since P(X) is constant for all classes, only needs to be computed
          ●    Predicts X belongs to Ci if the probability P(Ci|X) is the highest among all the P(Ck|X) for all m classes
sainathgunda99@gmail.com
DLZNK464L9
                                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                    Sharing or publishing the contents in part or full is liable for legal action.
      Naïve Bayes Classifier
        ●    A simplifying assumption: attributes are (i.e., no dependence relation between attributes given the class):
             conditionally independent
        ●       This greatly reduces the computation cost: Only counts attribute value distribution in different classes.
        ●       If Ak is categorical, P(Ak=xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in
                D)
sainathgunda99@gmail.com
DLZNK464L9
           ● If Ak is continuous-valued, P(Ak=xk|Ci) is usually computed based on Gaussian distribution with a mean μ and
                standard deviation σ and P(Ak=xk|Ci)
                                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                    Sharing or publishing the contents in part or full is liable for legal action.
       Naïve Bayes Classifier: Example Training Dataset
        Class:
        C1: buys_computer = ‘yes’
        C2: buys_compute = ‘no’
        Dataset to be classified :
        X = (age
        <=30,Income=medium,student=yes,Credt_rating=Fair)
sainathgunda99@gmail.com
DLZNK464L9
                                              This file is meant for personal use by sainathgunda99@gmail.com only.
                                     Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                              Sharing or publishing the contents in part or full is liable for legal action.
     Naïve Bayes Classifier: An Example
       ●    X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
       ●        P(Ci):P(buys_computer = “yes”) = 9/14 = 0.643
                  ○      P(buys_computer = “no”) = 5/14= 0.357
           ● Compute P(X|Ci) for each attribute value
                  ○
sainathgunda99@gmail.com
DLZNK464L9                 P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
                  ○        P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
                  ○        P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
                  ○        P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
       Naïve Bayes Classifier: An Example
        ●      Compute P(X|Ci) for each attribute value
                ○     P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
                ○     P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
                ○     P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
                ○     P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
sainathgunda99@gmail.com
        ●
DLZNK464L9
               X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
        ●      Compute overall P(X|Ci) for this X: for the two classes:
                ○ P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
                ○ P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
        ●      Finally, P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
                 ○ P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
        ●      Therefore, X belongs to class (“buys_computer = yes”)
                                                  This file is meant for personal use by sainathgunda99@gmail.com only.
                                         Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                  Sharing or publishing the contents in part or full is liable for legal action.
     Avoiding the Zero-Probability Problem
     ●    Naïve Bayesian prediction requires each conditional probability to be non-zero. Otherwise, the predicted
          probability will be zero.
     ●         Ex. Suppose the “buy” training dataset with 1000 tuples, income=low (0), income= medium (990), and income
               = high (10), now we need to predict the class for a new example with low income.
sainathgunda99@gmail.com
DLZNK464L9
                                              This file is meant for personal use by sainathgunda99@gmail.com only.
                                     Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                              Sharing or publishing the contents in part or full is liable for legal action.
       Avoiding the Zero-Probability Problem
        ●      Use Laplacian correction (or “Laplace estimator”)
                ○ Adding 1 to each attribute value count:
                       ■ Prob(income = low|buy=yes) = 1/1003
                       ■ Prob(income = medium|buy=yes) = 991/1003
                       ■ Prob(income = high|buy=yes) = 11/1003
                ○ The “corrected” probability estimates are close to their original counterparts.
sainathgunda99@gmail.com
DLZNK464L9
                                                  This file is meant for personal use by sainathgunda99@gmail.com only.
                                         Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                  Sharing or publishing the contents in part or full is liable for legal action.
     Naïve Bayes Classifier: Comments
       ●        Advantages
                  ○ Easy to implement, incremental.
                  ○ Good results were obtained for many classification tasks, despite the violation of the conditional
                         independent assumption.
           ● Disadvantages
                  ○ Assumption: class conditional independence, therefore loss of accuracy.
                  ○ Practically, dependencies could exist among variables.
sainathgunda99@gmail.com
DLZNK464L9
                           ■ E.g., age, family history, smoking, lung cancer.
                           ■ Dependencies among these cannot be modeled by Naïve Bayes Classifier.
                           ■ Bayesian Belief Network addresses this issue to some extent.
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
     Summary
          Here is a quick recap:
          ● We learned the Bayesian classification methods are based on Bayes Theorem.
          ● We discussed the Naive Bayes classifier that simplifies the assumption and reduces the computation cost.
          ● We talked about Naïve Bayesian prediction that requires each conditional probability to be non-zero.
              Otherwise, the predicted probability will be zero.
          ● We learned the advantages and disadvantages of the Naïve Bayes classifier, like it is easy to implement but
DLZNK464L9 risks low accuracy.
sainathgunda99@gmail.com
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
                           Model Evaluation and Selection
                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                        Sharing or publishing the contents in part or full is liable for legal action.
        Agenda
        In this session, we’ll learn about:
        ● Classifier Effectiveness Evaluation Metrics
sainathgunda99@gmail.com
DLZNK464L9
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
       Model Evaluation and Selection
        ●      Models are trained, but how good are they for new data?
        ●      Evaluation metrics: How can we measure the effectiveness?
        ●      Use validation sets of labeled tuples instead of the training set to refine/tune and optimize model
               performance.
sainathgunda99@gmail.com
DLZNK464L9
                                                   This file is meant for personal use by sainathgunda99@gmail.com only.
                                          Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                   Sharing or publishing the contents in part or full is liable for legal action.
       Classifier Evaluation Metrics: Confusion Matrix
       Confusion Matrix:
                                                                                                                                             ●         Given m classes, an entry, Ci,j in a
                                                                                                                                                       confusion matrix indicates # of
                                                                                                                                                       tuples in class i that were labeled
                                       C1: POSITIVE                          C2:Negative                                                               by the classifier as class j.
                           Predicted
                                                                                                                                             ●         Shown here a matrix for a binary
                                                                                                                                                       classification problem, but the
                   Actual
                                                                                                                                                       idea applies to multi-class
sainathgunda99@gmail.com
DLZNK464L9                                                                                                                                             classification (add C3, …, Cn)
             C1:POSITIVE               True Positives (TP)                   False Negative                                                  ●         May have extra rows/columns to
                                                                             (FN)                                                                      provide totals.
             C2: Negative              False Positives (FP)                  True Negative
                                                                             (TN)
                                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                        Sharing or publishing the contents in part or full is liable for legal action.
       Classifier Evaluation Metrics: Confusion Matrix
        Example of confusion matrix:
         Actual class/     buy_computer=         buy_computer                          Total
         Predicted         yes                   =no
         class
sainathgunda99@gmail.com
         buy_compute
DLZNK464L9
                           6954                  46                                    7000
         r=yes
         buy_compute       412                   2588                                  3000
         r=no
         Total             7366                  2364                                  10000
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
        Classifier Evaluation Metrics
         ●      Classifier Accuracy, or recognition rate: percentage of                                              •     True Positive Rate = TP/P
                tuples that are correctly classified                                                                 •     False Negative Rate = FN/P
                Accuracy = (TP + TN)/All                                                                             •     False Positive Rate = FP/N
                                                                                                                     •     True Negative Rate = TN/N
         ●      Error rate: 1 – accuracy, or
                 (FP + FN)/All
sainathgunda99@gmail.com
DLZNK464L9
                   A\P     C    ¬C
                     C     TP FN     P
                    ¬C     FP TN     N
                           P’   N’ All
                                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                    Sharing or publishing the contents in part or full is liable for legal action.
        Classifier Evaluation Metrics: Sensitivity and Specificity
      ●      Accuracy and Error Rate are not good for imbalance class classification problem:
              ○    One class may be rare, e.g. fraud, or HIV-positive.
              ○    Significant majority belongs to the negative class and a small number of data belongs
                   to the positive class: accuracy and error rate dominated by negative class.
                ○       Sensitivity: True Positive Rate, recall.
sainathgunda99@gmail.com
DLZNK464L9
                ○       Specificity: True Negative Rate
                                                                           Accuracy = 9820/10000 = 98.2%
                             P       N        Total
                    P        10      90       100                          Sensitivity = 10/100 = 10%
                                                                           Specificity = 9810/9900 = 99.1%
                    N        90      9810     9900
                                                                           Sensitivity + Specificity reflect the true
                    Total    100     9000     10000                        performance
                                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                        Sharing or publishing the contents in part or full is liable for legal action.
     Classifier Evaluation Metrics:Precision and Recall, and F-
     measures
       ●    Precision: Exactness – what % of tuples that the classifier labeled as positive are actually positive.
                                   precision = TP/ (TP+FP)
       ●    Recall/Sensitivity: Completeness – what % of positive tuples did the classifier label as positive?
                                     recall = TP/(TP+FN)
           ● Perfect score for Precision and Recall: 1.0
sainathgunda99@gmail.com
DLZNK464L9 ● Inverse relationship between precision & recall.
           ● F measure (F1 or F-score): harmonic mean of precision and recall,
                           F1 = (2* Precision * recall ) / ( precision + recall )
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
        Classifier Evaluation Metrics:Precision and Recall, and F-
        measures
        Fß: weighted measure of precision and recall:
          ●      Assigns ß times as much weight to recall as to the precision. ß > 1, weight recall more.
                Fβ= (1+β2)*precision*recall / (β2* precision + recall)
sainathgunda99@gmail.com
DLZNK464L9
                                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                    Sharing or publishing the contents in part or full is liable for legal action.
       Classifier Evaluation Metrics: Example
          Actual             Cancer = yes                      Cancer = no                                  Total
          class\Predicted
          class
          Cancer = yes       90                                210                                          300
sainathgunda99@gmail.com
DLZNK464L9
          Cancer = no        140                               9560                                         9700
          Total              230                               9770                                         10000
         ●      Accuracy = (90+9560) / 10000 = 96.50%
         ●      Precision = 90/230 = 39.13%, Recall = 90/300 = 30.00%
         ●      Sensitivity= 90/300 = 30%, Specificity= 9560/9700 = 98.56%
                                                  This file is meant for personal use by sainathgunda99@gmail.com only.
                                         Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                  Sharing or publishing the contents in part or full is liable for legal action.
       Estimate Model Effectiveness
          ●      For classification and numerical regression:
          ●      Performance metrics obtained from one run is often not a reliable estimate of the classifier’s performance.
                  ○ Different splits of the labeled dataset into training and validation often result in different performance
                        scores – which score should we trust? Certainly not the best one!
                  ○ To obtain unbiased performance score: need validation datasets different from training.
                  ○ To obtain a reliable estimate: needs to evaluate the model multiple times.
sainathgunda99@gmail.com
DLZNK464L9
                    Always hold out the final test dataset!
                                                     This file is meant for personal use by sainathgunda99@gmail.com only.
                                            Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                     Sharing or publishing the contents in part or full is liable for legal action.
        Review
                           Labeled data                                         Usage
                           Training / Validation data                           Model Selection
sainathgunda99@gmail.com
DLZNK464L9
                                                                                Performance Tuning,
                                                                                Performance estimate
                           Test data                                            Final Evaluation
                                                   This file is meant for personal use by sainathgunda99@gmail.com only.
                                          Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                   Sharing or publishing the contents in part or full is liable for legal action.
       Summary
       Here is a quick recap:
       ● We discussed the various evaluation metrics using a confusion matrix: Precision, Recall, and F1-score, to
          predict the effectiveness of a classifier.
       ● Regression and classification: unbiased and reliable estimate of model performance requires a separate test
          dataset and multiple runs of the model.
sainathgunda99@gmail.com
DLZNK464L9
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9
                           Performance Estimate Methods
                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                        Sharing or publishing the contents in part or full is liable for legal action.
        Agenda
         In this session, we will discuss:
         ● Hold out, cross-validation, and bootstrap
         ● Model selection: ROC curves
         ● Issues affecting model selection
sainathgunda99@gmail.com
DLZNK464L9
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
       Evaluating Classifier Effectiveness:Holdout
       ●      Holdout method (for large data sets)
               ○ Given data is randomly partitioned into two independent sets.
                     ■ Training set (>=2/3) for model construction
                     ■ Validation set (<= 1/3) for evaluation
               ○ Random sampling: a variation of holdout
                     ■ Repeat holdout k times, effectiveness = avg. of the k effectiveness
sainathgunda99@gmail.com
DLZNK464L9
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
       Evaluating Classifier Effectiveness: Cross-Validation Methods
          ●     Cross-validation (k-fold, k = 3, 5, 10, for medium-sized data)
                 ○ Randomly partition the data into k mutually exclusive subsets, each approximately equal in size.
                 ○ At i-th iteration, use Di as the test/validation set and others as a training set.
                 ○ Variations:
                      ○ Leave-one-out: k folds where k = No. of tuples, for small-sized data.
                      ○ Stratified cross-validation: folds are stratified so that class dist. in each fold is approximate; the same as that in
sainathgunda99@gmail.com
                             the initial data
DLZNK464L9
                                                         This file is meant for personal use by sainathgunda99@gmail.com only.
                                                Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                         Sharing or publishing the contents in part or full is liable for legal action.
      Evaluating Classifier Effectiveness: Bootstrap
       ●    Bootstrap
             ●    Used on small data sets, tends to be optimistic.
             ●    Samples the labeled tuples with replacement
                   ●    I.e., each time a tuple is selected, it is put back into the training set and is equally likely to be selected again.
              ●          Several bootstrap methods and a common one is .632 bootstrap
                           ○   A data set with d tuples is sampled d times, with replacement,
sainathgunda99@gmail.com
DLZNK464L9
                           ○   Resulting in a training set of d samples.
                           ○   The data tuples that did not make it into the training set ended up forming the validation set.
                           ○   About 63.2% of the original data end up in training, and the remaining 36.8% form the
                               validation set (since (1 – 1/d)d ≈ e-1 = 0.368).
                                                      This file is meant for personal use by sainathgunda99@gmail.com only.
                                             Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                      Sharing or publishing the contents in part or full is liable for legal action.
       Evaluating Classifier Effectiveness: Bootstrap
        ●      There are several bootstrap methods and a common one is .632 bootstrap.
                ○    Repeat the sampling procedure k times (e.g. 30 to hundreds), the overall accuracy of the model:
sainathgunda99@gmail.com
DLZNK464L9
                                                  This file is meant for personal use by sainathgunda99@gmail.com only.
                                         Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                  Sharing or publishing the contents in part or full is liable for legal action.
       Compare Performance: Classifier Models M1 vs. M2
         ●      Suppose we have 2 classifiers, M1 and M2, which one is better?
         ●      Use a performance estimate method to obtain K performance scores for each model.
         ●      These performance scores are just estimates of scores on the population, including future data cases.
         ●      What if the differences in scores between the 2 models were simply by chance?
                   ○
sainathgunda99@gmail.com   Use a test of statistical significance:
DLZNK464L9
                           ○    Unpaired two sample T-test: compare two models trained/validated on different data sets [scores are not paired]
                           ○    Paired two sample T-Test: compare two models trained/validated on same data sets [scores are in pairs]
                           ○    ANOVA: compare > two models
                   ○       Obtain a confidence level for the test:
                           ○    “at 95% confidence level, performances were statistically significantly different”
                           ○    or “performance differences were not statistically significant at 95% conf. level”.
                                                            This file is meant for personal use by sainathgunda99@gmail.com only.
                                                   Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                            Sharing or publishing the contents in part or full is liable for legal action.
       Model Selection: ROC Curves
        ●       ROC (Receiver/Relative Operating Characteristics) curves: for visual comparison of
                classification models.
           ●    Originated from the signal detection theory.
           ●    Shows the tradeoff between the true positive rate and the false positive rate.
           ●    The Area Under the ROC Curve (AUC) is a measure of the accuracy of the model.
                ● Vertical axis represents the true positive rate.
                ● Horizontal axis represents the false positive rate.
sainathgunda99@gmail.com
DLZNK464L9
                ● The plot also shows two model performance curves and a diagonal line.
                ● The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate the model.
                ● A model with perfect accuracy will have an area of 1.0.
        ●      To use ROC, the model must return the probability of predicated class for each
               observation.
                                                        This file is meant for personal use by sainathgunda99@gmail.com only.
                                               Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                        Sharing or publishing the contents in part or full is liable for legal action.
        Model Selection: ROC Curves
          ●      Sort observations in decreasing order: the one that is most likely to belong to the positive class appears at the
                top of the list.
          ●     From the first row down, use the prob as the thresholds to predict class labels
                 ●       Compute TP, FP, TN, and FN.
                 ●       Compute TP rate = TP/P, FP rate = FP/N
          ●     Plot TP rate and FP rate pairs.
sainathgunda99@gmail.com
DLZNK464L9
          ●     ROC is useful for balanced classes.
                     o     Another curve known as Precision-Recall Curve is useful for imbalanced classes.
                                                           This file is meant for personal use by sainathgunda99@gmail.com only.
                                                  Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                           Sharing or publishing the contents in part or full is liable for legal action.
        Model Selection: ROC Curves
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.
       Issues Affecting Model Selection
        ●     Effectiveness (precision, recall, etc.)
                 ○         Classifier performance: predicting the class label.
        ●     Speed
                 ○         Time to construct the model (training time).
                 ○
sainathgunda99@gmail.com
                           Time to use the model (classification/prediction time).
DLZNK464L9
        ●     Robustness: handling noise and missing values
        ●     Scalability: efficiency in disk-resident databases or the cloud.
        ●     Interpretability
                 ○         Understanding and insight provided by the model.
        ●     Portability: maintain performance level on new data
        ●     Other measures, e.g., the goodness of rules, such as decision tree size or compactness
              of classification rules.
                                                         This file is meant for personal use by sainathgunda99@gmail.com only.
                                                Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                         Sharing or publishing the contents in part or full is liable for legal action.
       Summary
        Here is a quick recap:
        ● We discussed the Holdout method (for large data sets) in which data is randomly partitioned into two
          independent sets.
        ● We learned Cross-validation (k-fold, k = 3, 5, 10, for medium-sized data) that randomly partitions the data into
          k mutually exclusive subsets, each approximately equal in size.
        ● We discussed that Bootstrap used on small data sets, tends to be optimistic.
sainathgunda99@gmail.com
        ● We discussed T-tests and ROC curves for comparison of classification models.
DLZNK464L9
        ● We learned about other issues that affect the model selection, such as effectiveness, robustness, scalability,
             etc.
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
                           Techniques to improve classification
sainathgunda99@gmail.com    accuracy: Ensemble methods and
                                   Imbalanced classes
DLZNK464L9
                                           This file is meant for personal use by sainathgunda99@gmail.com only.
                                  Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                           Sharing or publishing the contents in part or full is liable for legal action.
       Agenda
        In this session, we’ll be able to learn:
        ● Techniques to improve classification accuracy.
        ● Ensemble methods
        ● Imbalanced class problems.
sainathgunda99@gmail.com
DLZNK464L9
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
     Ensemble Methods: Improve Performance
      ●    Ensemble methods
                  ○ Train k models using the training data.
                  ○ Let all k models to make predictions on a new data.
                  ○ Combine predictions from k models to make the final prediction.
                  ○ Combine a series of k-learned models, M1, M2, …, Mk, with the aim of creating an improved model M*.
           ● Popular ensemble methods
                  ○ Bagging: majority vote or averaging the prediction over a collection of classifiers.
sainathgunda99@gmail.com
DLZNK464L9
                  ○ Boosting: weighted vote with a collection of classifiers.
                  ○ Random Forests: building a large number of ‘random’ trees.
                                             This file is meant for personal use by sainathgunda99@gmail.com only.
                                    Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                             Sharing or publishing the contents in part or full is liable for legal action.
     Bagging: Bootstrap Aggregation
      ●        Analogy: Diagnosis based on multiple doctors’ majority vote.
      ●        Training
                 ○ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement
                         from D (i.e., bootstrap).
                 ○ A classifier model Mi is trained using a training set Di.
         ● Classification: classify an unknown sample X
sainathgunda99@gmail.com
DLZNK464L9       ○ Each classifier Mi returns its class prediction.
                 ○ The bagged classifier M* counts the votes and assigns the class with the most votes to X.
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
       Bagging: Bootstrap Aggregation
       ●      Prediction: can be applied to the prediction of continuous values by taking the average value of each
              prediction for a given test tuple.
       ●      Effectiveness:
                ○ Often significantly better than a single classifier derived from D.
                ○ For noise data: not considerably worse, more robust.
sainathgunda99@gmail.com
DLZNK464L9
                                                  This file is meant for personal use by sainathgunda99@gmail.com only.
                                         Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                  Sharing or publishing the contents in part or full is liable for legal action.
     Boosting
       ●        Analogy: Consult several doctors based on a combination of weighted diagnoses—weight assigned based on
                the previous diagnosis accuracy.
          ● How does boosting work?
                  ○ Equal weights are initially assigned to each training tuple.
                  ○ A series of k classifiers is iteratively learned.
                  ○ After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay
sainathgunda99@gmail.com more attention to the training tuples that were misclassified by M
DLZNK464L9                                                                                  i.
                  ○ The final M* combines the votes of each individual classifier, where the weight of each classifier's vote
                         is a function of its accuracy.
          ● Boosting algorithm can be extended for numeric prediction.
          ● Compared to bagging, boosting tends to have greater accuracy, but it also risks overfitting the model to
                misclassified data.
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
      Adaboost
       ●       Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd), initially, all the weights of tuples are set the same
               (1/d).
         ● Generate k classifiers in k rounds. At round i,
                 ○ Tuples from D are sampled (with replacement) to form a training set Di of the same size.
                 ○ Each tuple’s chance of being selected is based on its weight.
                 ○ A classification model Mi is derived from Di.
                 ○ Its error rate is calculated using Di as a validation set.
sainathgunda99@gmail.com
DLZNK464L9
                 ○ If a tuple is misclassified, its weight is increased, otherwise, it is decreased.
                                                   This file is meant for personal use by sainathgunda99@gmail.com only.
                                          Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                   Sharing or publishing the contents in part or full is liable for legal action.
        Adaboost
         ●      Model error rate: err(Xj) is the misclassification error of tuple Xj(=1 or 0). Classifier Mi error rate is the sum of
                the weights of the misclassified tuples:
sainathgunda99@gmail.com
         ●      The weight of classifier Mi’s vote is
DLZNK464L9
         ●      Sum of classifiers’ weights for the classes is used to select the final class.
                                                     This file is meant for personal use by sainathgunda99@gmail.com only.
                                            Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                     Sharing or publishing the contents in part or full is liable for legal action.
      Random Forest
     ●        Random Forest:
                ○ Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of
                        attributes at each node to determine the split. Grow full trees without pruning.
                ○ During classification, each tree votes, and the most popular class is returned.
        ● Two methods to construct Random Forest:
                ○ Forest-RI (random input selection): At each node, randomly select a few attributes (much less than the no
sainathgunda99@gmail.comof available attributes) as candidates for the split. This is commonly used method.
DLZNK464L9
                ○ Forest-RC (random linear combinations): Used when only a few attributes are available. Creates new
                        attributes (or features) that are a random linear combination of randomly selected attributes (to reduce the
                        correlation between individual classifiers).
     ●    Comparable in accuracy to Adaboost but more robust to errors and outliers.
     ●    Insensitive to the number of attributes selected for consideration at each split and faster than bagging or
          boosting, with no overfitting.
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
     Classification of Class-Imbalanced Data Sets
       ●        Class-imbalance problem: Rare positive examples but numerous negative ones, e.g., outlier detection,
                medical diagnosis, fraud, oil spill, fault, etc.
           ● Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for
                class-imbalanced data.
           ● Evaluation metrics developed
                  ○ Sensitivity + Specificity
sainathgunda99@gmail.com
DLZNK464L9        ○ Precision + Recall
                  ○ F_β score: allow different weights for precision and recall
           ● These metrics better reflect system performance in case of class imbalance.
           ● These metrics help to tune systems to improve rare class performance.
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
     Methods for Class-Imbalanced Problem
       ●    Typical methods for imbalance data in 2-class classification:
             ○ Oversampling: re-sampling of data from the positive class.
                  ○ Generate synthetic samples similar to the original. Works better on numerical attributes.
                          ■    SMOTE: generate a fixed number of examples for each minority sample.
                          ■    ADASYN: generate a different number of examples.
                 ○ Under-sampling: randomly eliminate tuples from the negative class.
                 ○ Threshold-moving: moves the decision threshold, t, so that the rare class tuples are easier to classify,
sainathgunda99@gmail.com
DLZNK464L9
                         and hence, less chance of costly false negative errors.
          ● Class imbalance problem remains difficult on multiclass tasks.
                ● Convert a multiclass task to a set of binary classification tasks.
                                                 This file is meant for personal use by sainathgunda99@gmail.com only.
                                        Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                 Sharing or publishing the contents in part or full is liable for legal action.
     Summary
         Here is a quick recap:
         ● We discussed various ensemble methods to improve the performance of the model, such as Bagging,
             Boosting, and Random Forest.
         ● We discussed Bagging, which can be applied to the prediction of continuous values by taking the average
             value of each prediction for a given test tuple.
         ● We talked about Boosting and one example Adaboost, which tends to have greater accuracy, but it also risks
DLZNK464L9 overfitting the model to misclassified data.
sainathgunda99@gmail.com
         ● We learned the Random forest method in which each classifier is a decision tree classifier and is generated
             using a random selection of attributes at each node to determine the split.
         ● We discussed different solutions to the class-imbalanced classification problem.
                                               This file is meant for personal use by sainathgunda99@gmail.com only.
                                      Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                               Sharing or publishing the contents in part or full is liable for legal action.
        Learning Outcomes
        You should now be able to:
        • Outline Supervised vs. Unsupervised learning, Classification - Model construction, and Model Evaluation.
        • Recall and summarize the Decision Tree.
        • Summarize the concepts of Entropy, Information Gain, and Gini Impurity.
        • Recall and restate the concepts of Pruning.
        • Implement Bayesian classification methods.
sainathgunda99@gmail.com
DLZNK464L9
        • Summarize Model evaluation and selection.
        • List and explain techniques to improve classification accuracy
                                                This file is meant for personal use by sainathgunda99@gmail.com only.
                                       Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                                Sharing or publishing the contents in part or full is liable for legal action.
                                                      Thank You
sainathgunda99@gmail.com
DLZNK464L9
                                    This file is meant for personal use by sainathgunda99@gmail.com only.
                           Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
                                    Sharing or publishing the contents in part or full is liable for legal action.