Concepts and Techniques: - Chapter 1
Concepts and Techniques: - Chapter 1
— Chapter 1 —
                          1
            Chapter 1. Introduction
   Why Data Mining?
   What Is Data Mining?
   A Multi-Dimensional View of Data Mining
   What Kind of Data Can Be Mined?
   What Kinds of Patterns Can Be Mined?
   What Technology Are Used?
   What Kind of Applications Are Targeted?
   Major Issues in Data Mining
   A Brief History of Data Mining and Data Mining Society
   Summary
                                                             2
                       Why Data Mining?
                                                                                  3
        Evolution of Database Technology
   1960s:
       Data collection, database creation, IMS and network DBMS
   1970s:
       Relational data model, relational DBMS implementation
   1980s:
       RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
       Application-oriented DBMS (spatial, scientific, engineering, etc.)
   1990s:
       Data mining, data warehousing, multimedia databases, and Web
        databases
   2000s
       Stream data management and mining
       Data mining and its applications
       Web technology (XML, data integration) and global information systems
                                                                                 4
            Chapter 1. Introduction
   Why Data Mining?
   What Is Data Mining?
   A Multi-Dimensional View of Data Mining
   What Kind of Data Can Be Mined?
   What Kinds of Patterns Can Be Mined?
   What Technology Are Used?
   What Kind of Applications Are Targeted?
   Major Issues in Data Mining
   A Brief History of Data Mining and Data Mining Society
   Summary
                                                             5
                What Is Data Mining?
                                                                        6
       Knowledge Discovery (KDD) Process
    This is a view from typical
     database systems and data
     warehousing communities                    Pattern Evaluation
    Data mining plays an essential role
     in the knowledge discovery process
                                        Data Mining
Task-relevant Data
          Data                   Selection
          Warehouse
    Data Cleaning
Data Integration
            Databases
                                                                     7
     Example: A Web Mining Framework
                                                           8
 Data Mining in Business Intelligence
Increasing potential
to support
business decisions                                                       End User
                                  Decision
                                  Making
                                Data Exploration
                 Statistical Summary, Querying, and Reporting
                                                              10
            Chapter 1. Introduction
   Why Data Mining?
   What Is Data Mining?
   A Multi-Dimensional View of Data Mining
   What Kind of Data Can Be Mined?
   What Kinds of Patterns Can Be Mined?
   What Technology Are Used?
   What Kind of Applications Are Targeted?
   Major Issues in Data Mining
   A Brief History of Data Mining and Data Mining Society
   Summary
                                                             11
    Multi-Dimensional View of Data Mining
   Data to be mined
      Database data (extended-relational, object-oriented, heterogeneous,
   Techniques utilized
      Data-intensive, data warehouse (OLAP), machine learning, statistics,
                                                                                14
            Chapter 1. Introduction
   Why Data Mining?
   What Is Data Mining?
   A Multi-Dimensional View of Data Mining
   What Kind of Data Can Be Mined?
   What Kinds of Patterns Can Be Mined?
   What Technology Are Used?
   What Kind of Applications Are Targeted?
   Major Issues in Data Mining
   A Brief History of Data Mining and Data Mining Society
   Summary
                                                             15
Data Mining Function: (1) Generalization
   Information integration and data warehouse construction
       Data cleaning, transformation, integration, and
        multidimensional data model
   Data cube technology
       Scalable methods for computing (i.e., materializing)
        multidimensional aggregates
       OLAP (online analytical processing)
   Multidimensional concept description: Characterization
    and discrimination
       Generalize, summarize, and contrast data
        characteristics, e.g., dry vs. wet region
                                                               16
    Data Mining Function: (2) Association
          and Correlation Analysis
   Frequent patterns (or frequent itemsets)
       What items are frequently purchased together in your
        Walmart?
   Association, correlation vs. causality
       A typical association rule
            Diaper  Beer [0.5%, 75%] (support, confidence)
       Are strongly associated items also strongly correlated?
   How to mine such patterns and rules efficiently in large
    datasets?
   How to use such patterns for classification, clustering,
    and other applications?
                                                                  17
Data Mining Function: (3) Classification
                                                                              18
Data Mining Function: (4) Cluster Analysis
                                                                 19
Data Mining Function: (5) Outlier Analysis
   Outlier analysis
        Outlier: A data object that does not comply with the general
         behavior of the data
        Noise or exception? ― One person’s garbage could be another
         person’s treasure
        Methods: by product of clustering or regression analysis, …
        Useful in fraud detection, rare events analysis
                                                                        20
Time and Ordering: Sequential Pattern,
    Trend and Evolution Analysis
   Sequence, trend and evolution analysis
      Trend, time-series, and deviation analysis: e.g., regression
          cards
      Periodicity analysis
 Similarity-based analysis
                                                                      21
       Structure and Network Analysis
   Graph mining
      Finding frequent subgraphs (e.g., chemical compounds), trees
            family, classmates, …
      Links carry a lot of semantic information: Link mining
   Web mining
      Web is a big information network: from PageRank to Google
                                                                            22
             Evaluation of Knowledge
   Are all mined knowledge interesting?
       One can mine tremendous amount of “patterns” and knowledge
       Some may fit only certain dimension space (time, location, …)
       Some may not be representative, may be transient, …
   Evaluation of mined knowledge → directly mine only
    interesting knowledge?
       Descriptive vs. predictive
       Coverage
       Typicality vs. novelty
       Accuracy
       Timeliness
       …
                                                                        23
            Chapter 1. Introduction
   Why Data Mining?
   What Is Data Mining?
   A Multi-Dimensional View of Data Mining
   What Kind of Data Can Be Mined?
   What Kinds of Patterns Can Be Mined?
   What Technology Are Used?
   What Kind of Applications Are Targeted?
   Major Issues in Data Mining
   A Brief History of Data Mining and Data Mining Society
   Summary
                                                             24
Data Mining: Confluence of Multiple Disciplines
                                                      25
Why Confluence of Multiple Disciplines?
   Tremendous amount of data
       Algorithms must be highly scalable to handle such as tera-bytes of
        data
   High-dimensionality of data
       Micro-array may have tens of thousands of dimensions
   High complexity of data
       Data streams and sensor data
       Time-series data, temporal data, sequence data
       Structure data, graphs, social networks and multi-linked data
       Heterogeneous databases and legacy databases
       Spatial, spatiotemporal, multimedia, text and Web data
       Software programs, scientific simulations
   New and sophisticated applications
                                                                             26
            Chapter 1. Introduction
   Why Data Mining?
   What Is Data Mining?
   A Multi-Dimensional View of Data Mining
   What Kind of Data Can Be Mined?
   What Kinds of Patterns Can Be Mined?
   What Technology Are Used?
   What Kind of Applications Are Targeted?
   Major Issues in Data Mining
   A Brief History of Data Mining and Data Mining Society
   Summary
                                                             27
            Chapter 1. Introduction
   Why Data Mining?
   What Is Data Mining?
   A Multi-Dimensional View of Data Mining
   What Kind of Data Can Be Mined?
   What Kinds of Patterns Can Be Mined?
   What Technology Are Used?
   What Kind of Applications Are Targeted?
   Major Issues in Data Mining
   A Brief History of Data Mining and Data Mining Society
   Summary
                                                             28
           Major Issues in Data Mining (1)
   Mining Methodology
        Mining various and new kinds of knowledge
        Mining knowledge in multi-dimensional space
        Data mining: An interdisciplinary effort
        Boosting the power of discovery in a networked environment
        Handling noise, uncertainty, and incompleteness of data
        Pattern evaluation and pattern- or constraint-guided mining
   User Interaction
        Interactive mining
        Incorporation of background knowledge
        Presentation and visualization of data mining results
                                                                       29
           Major Issues in Data Mining (2)
                                                                         30
            Chapter 1. Introduction
   Why Data Mining?
   What Is Data Mining?
   A Multi-Dimensional View of Data Mining
   What Kind of Data Can Be Mined?
   What Kinds of Patterns Can Be Mined?
   What Technology Are Used?
   What Kind of Applications Are Targeted?
   Major Issues in Data Mining
   A Brief History of Data Mining and Data Mining Society
   Summary
                                                             31
         A Brief History of Data Mining Society
                                                                                  33
Where to Find References? DBLP, CiteSeer, Google
                                                                                                           34
            Chapter 1. Introduction
   Why Data Mining?
   What Is Data Mining?
   A Multi-Dimensional View of Data Mining
   What Kind of Data Can Be Mined?
   What Kinds of Patterns Can Be Mined?
   What Technology Are Used?
   What Kind of Applications Are Targeted?
   Major Issues in Data Mining
   A Brief History of Data Mining and Data Mining Society
   Summary
                                                             35
                              Summary
   Data mining: Discovering interesting patterns and knowledge from
    massive amount of data
   A natural evolution of database technology, in great demand, with
    wide applications
   A KDD process includes data cleaning, data integration, data
    selection, transformation, data mining, pattern evaluation, and
    knowledge presentation
   Mining can be performed in a variety of data
   Data mining functionalities: characterization, discrimination,
    association, classification, clustering, outlier and trend analysis, etc.
   Data mining technologies and applications
   Major issues in data mining
                                                                                36
         Recommended Reference Books
   S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
    Kaufmann, 2002
   R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
   T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
   U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
    Data Mining. AAAI/MIT Press, 1996
   U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
    Discovery, Morgan Kaufmann, 2001
   J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3 rd ed., 2011
   D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
   T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
    and Prediction, 2nd ed., Springer-Verlag, 2009
   B. Liu, Web Data Mining, Springer 2006.
   T. M. Mitchell, Machine Learning, McGraw Hill, 1997
   G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
   P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
   S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
   I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
    Implementations, Morgan Kaufmann, 2nd ed. 2005
37