DATA MINING FOR BUSINESS
INTELLIGENCE
                 DEVIPRIYA P
                     AP
1.1 Why Data Mining?
• The Explosive Growth of Data: from terabytes(10004) to yottabytes(10008)
   – Data collection and data availability
        • Automated data collection tools, database systems, web
   – Major sources of abundant data
        • Business: Web, e-commerce, transactions, stocks, …
        • Science: bioinformatics, scientific simulation, medical research …
        • Society and everyone: news, digital cameras, …
• Data rich but information poor!
   –   What does those data mean?
   –   How to analyze data?
• Data mining — Automated analysis of massive data sets
Evolution of Database Technology
        1.2 What Is Data Mining?
• Data mining (knowledge discovery from data)
  – Extraction of interesting (non-trivial, implicit, previously unknown and
     potentially useful) patterns or knowledge from huge amount of data
  – Data mining: a misnomer?
• Alternative names
  – Knowledge discovery (mining) in databases (KDD), knowledge
       extraction, data/pattern analysis, data archeology, data dredging,
     information harvesting, business intelligence, etc.
                          Data Mining: Concepts and Techniques                 4
                    Potential Applications
• Data analysis and decision support
   – Market analysis and management
       • Target marketing, customer relationship management (CRM),
         market basket analysis, cross selling, market segmentation
   – Risk analysis and management
       • Forecasting, customer retention, improved underwriting, quality
         control, competitive analysis
   – Fraud detection and detection of unusual patterns (outliers)
• Other Applications
   – Text mining (news group, email, documents) and Web mining
   – Stream data mining
   – Bioinformatics and bio-data analysis
                          Data Mining: Concepts and Techniques
         Ex.: Market Analysis and Management
•   Where does the data come from?—Credit card transactions, loyalty cards,
     discount coupons, customer complaint calls, surveys …
•   Target marketing
    –   Find clusters of “model” customers who share the same characteristics: interest,
          income level, spending habits, etc.,
         •   E.g. Most customers with income level 60k – 80k with food expenses $600 - $800 a month live in that area
    –   Determine customer purchasing patterns over time
         •   E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k usually buy this type of CD player
•   Cross-market analysis—Find associations/co-relations between product sales, &
    predict based on such association
    –   E.g. Customers who buy computer A usually buy software B
                                          Data Mining: Concepts and Techniques                                             6
         Ex.: Market Analysis and Management (2)
•   Customer requirement analysis
    –   Identify the best products for different customers
    –   Predict what factors will attract new customers
•   Provision of summary information
    –   Multidimensional summary reports
         •   E.g. Summarize all transactions of the first quarter from three different branches
                     Summarize all transactions of last year from a particular branch
                     Summarize all transactions of a particular product
    –   Statistical summary information
         •   E.g. What is the average age for customers who buy product A?
• Fraud detection
    –   Find outliers of unusual transactions
• Financial planning
    –   Summarize and compare the resources and spending
                                           Data Mining: Concepts and Techniques                   7
Knowledge Discovery (KDD) Process
           KDD Process: Several Key Steps
• Learning the application domain
   –   relevant prior knowledge and goals of application
• Identifying a target data set: data selection
• Data processing
   –   Data cleaning (remove noise and inconsistent data)
   –   Data integration (multiple data sources maybe combined)
   –   Data selection (data relevant to the analysis task are retrieved from database)
   –   Data transformation (data transformed or consolidated into forms appropriate for mining)
         (Done with data preprocessing)
   –   Data mining (an essential process where intelligent methods are applied to extract
         data patterns)
   –   Pattern evaluation (indentify the truly interesting patterns)
   –   Knowledge presentation (mined knowledge is presented to the user with
         visualization or representation techniques)
• Use of discovered knowledge
                               Data Mining: Concepts and Techniques                         9
            Data Mining and Business Intelligence
Increasing
potential
to support                                                           End User
business decisions                Decision
                                  Making
                            Data Presentation                        Business
                            Visualization                            Analyst
                            Techniques
                                Data Mining
                               Information                             Data
                               Discovery                              Analys
                                Data Exploration                           t
                Statistical Summary, Querying, and
                Reporting
           Data Preprocessing/Integration, Data Warehouses
                                                                         DBA
                             Data Sources
     Paper, Files, Web documents, Scientific experiments, Database
     Systems                                                               10
       A typical DM System Architecture
• Database, data warehouse, WWW or other information
   repository (store data)
• Database or data warehouse server (fetch and
   combine data)
• Knowledge base (turn data into meaningful groups
   according to domain knowledge)
• Data mining engine (perform mining tasks)
• Pattern evaluation module (find interesting patterns)
• User interface (interact with the user)
A typical DM System Architecture (2)
            1.3 On What Kinds of Data?
• Database-oriented data sets and applications
  – Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
  – Object-Relational Databases
  – Temporal Databases, Sequence Databases, Time-Series
     databases
  – Spatial Databases and Spatiotemporal Databases
  – Text databases and Multimedia databases
  – Data Streams
  – The World-Wide Web
                      Data Mining: Concepts and Techniques   13
                Relational Databases
• DBMS – database management system, contains a
  collection of
   interrelated databases
   e.g. Faculty database, student database, publications
  database
• Each database contains a collection of tables and functions
  to
   manage and access the data.
   e.g. student_bio, student_graduation, student_parking
• Each table contains columns and rows, with columns as
  attributes of data and rows as records.
• Tables can be used to represent the relationships between
  or among multiple tables.
                     Data Mining: Concepts and Techniques
Relational Databases (2) – AllElectronics store
              Data Mining: Concepts and Techniques
               Relational Databases (3)
• With a relational query language, e.g. SQL, we will be able
  to find
   answers to questions such as:
  – How many items were sold last year?
  – Who has earned commissions higher than 10%?
  – What is the total sales of last month for Dell laptops?
• When data mining is applied to relational databases, we
  can search for trends or data patterns.
• Relational databases are one of the most commonly
  available and
   rich information repositories, and thus are a major data
  form in our study.
                     Data Warehouses
• A repository of information collected from multiple sources, stored
   under a unified schema, and that usually resides at a single site.
• Constructed via a process of data cleaning, data integration, data
   transformation, data loading and periodic data refreshing.
                        Data Mining: Concepts and Techniques
Data Warehouses (3)
           OLAP OPERATIONS
• OLAP operations include drill-down and roll-
  up, which allow the user to view the data at
  differing degrees of summarization.
• For instance, we can drill down on sales data
  summarized by quarter to see data
  summarized by month.
• Similarly, we can roll up on sales data
  summarized by city to view data summarized
  by country
                 Transactional Databases
• Consists of a file where each record represents a transaction
• A transaction typically includes a unique transaction ID and a list of the
  items making up the transaction.
• Either stored in a flat file or unfolded into relational tables
• Easy to identify items that are frequently sold together
                            Data Mining: Concepts and Techniques
                       1.4 Data Mining Functionalities
                           - What kinds of patterns can be mined?
• Concept/Class Description:
  Characterization and
  Discrimination
• Mining Frequent Patterns,
  Associations and Correlations
  – Frequent patterns, frequent
    subsequences, frequent
    substructures
   •   Association Analysis: find frequent
       patterns
   •   Correlation Analysis: additional analysis
       to find statistical correlations between
       associated pairs
   •   https://towardsdatascience.com/
       frequent-pattern-mining-association-
       and-correlations-8fa9f80c22ef
             1.4 Data Mining Functionalities
          - What kinds of patterns can be mined?
• Classification and Prediction
  – Classification
      •   The process of finding a model that describes and distinguishes the data classes or
          concepts, for the purpose of being able to use the model to predict the class of
            objects whose class label is unknown.
      •   The derived model is based on the analysis of a set of training data (data objects
          whose class label is known).
      •   The model can be represented in classification (IF-THEN) rules, decision trees,
            neural networks, etc.
  – Prediction
      •   Predict missing or unavailable numerical data values
1.4 Data Mining Functionalities
    Classification and Prediction
                                    24
                1.4 Data Mining Functionalities
                   - What kinds of patterns can be mined?
• Cluster Analysis
  – Class label is unknown: group data to form new classes
  – Clusters of objects are formed based on the principle of maximizing intra-
     class similarity & minimizing interclass similarity
      •   E.g. Identify homogeneous subpopulations of customers. These clusters may
             represent individual target groups for marketing.
                 Data Mining Functionalities (2)
• Outlier Analysis
   – Data that do no comply with the general behavior or model.
   – Outliers are usually discarded as noise or exceptions.
   – Useful for fraud detection.
       •   E.g. Detect purchases of extremely large amounts
• Evolution Analysis
   – Describes and models regularities or trends for objects whose
        behavior changes over time.
       •   E.g. Identify stock evolution regularities for overall stocks and for the stocks of
              particular companies.
                                                                                                 26
 1.5 Which Technologies Are Used?
• Statistics
        Statistics studies the collection, analysis, interpretation or explanation,
  and presentation of data. Data mining has an inherent connection with statistics.
• Machine Learning
   – Supervised learning
               Supervised learning is basically a synonym for classification
   – Unsupervised learning
               Unsupervised learning is essentially a synonym for clustering.
   – Semi-supervised learning
              Semi-supervised learning is a class of machine learning techniques that make
      use of both labeled and unlabeled examples when learning a model.
   – Active learning
              Active learning is a machine learning approach that lets users play an
      active role in the learning process.
• Database Systems and Data Warehouses
   – Information retrieval (IR) is the science of searching for documents or
     information in documents. Documents can be text or multimedia, and
     may reside on the Web
Data mining adopts techniques from many domains
  1.6 Which Kinds of Applications Are
              Targeted?
• Business Intelligence
  – It is critical for businesses to acquire a better understanding of the
    commercial context of their organization, such as their customers, the
    market, supply and resources, and competitors.
  – Business intelligence (BI) technologies provide historical, current, and
    predictive views of business operations.
• Web Search Engines
  – A Web search engine is a specialized computer server that searches
    for information on the Web. The search results of a user query are
    often returned as a list (sometimes called hits).
  – The hits may consist of web pages, images, and other types of files.
DATA MINING PROCESS
         DATA MINING PROCESS
• Business Understanding
• Data Understanding
• Data Preparation
 –   Data Collection / Selection
 –   Data Cleaning
 –   Data Transformation
 –   Data Reduction
• Modeling Building
      1.7 Major Issues in Data Mining
• Data mining is a dynamic and fast-expanding field
  with great strengths.
• The major issues in data mining research,
  partitioning them into five groups:
  –   Mining methodology,
  –   User interaction,
  –   Efficiency and scalability,
  –   Diversity of data types, and
  –   Data mining and society
      1.7 Major Issues in Data Mining
• Mining Methodology
  –   Mining various and new kinds of knowledge
  –   Mining knowledge in multidimensional space
  –   Data mining—an interdisciplinary effort
  –   Boosting the power of discovery in a networked environment
  –   Handling uncertainty, noise, or incompleteness of data
  –   Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
  –   Interactive mining
  –   Incorporation of background knowledge
  –   Ad hoc data mining and data mining query languages
  –   Presentation and visualization of data mining results
     1.7 Major Issues in Data Mining
• Efficiency and Scalability
  – Efficiency and scalability of data mining algorithms
  – Parallel, distributed, and incremental mining algorithms
• Diversity of Database Types
  – Handling complex types of data
  – Mining dynamic, networked, and global data repositories
• Data Mining and Society
  – Social impacts of data mining
  – Privacy-preserving data mining
  – Invisible data mining