DATA MINING
LECTURE 1
Introduction
                                                                         2
What is data mining?
 Data mining is the use of efficient techniques for
 the analysis of very large collections of data and the
 extraction of useful and possibly unexpected
 patterns in data.
• Data mining (knowledge discovery from data)
  • Extraction of interesting (non-trivial, implicit, previously unknown and
    potentially useful) patterns or knowledge from huge amount of data
                                                                      3
Why do we need data mining?
• Really, huge amounts of raw data!!
  • In the digital age, TB of data is generated by the second
     • Mobile devices, digital photographs, web documents.
     • Facebook updates, Tweets, Blogs
     • Transactions, sensor data, Queries, clicks, browsing
  • Cheap storage has made possible to maintain this data
  • Major sources of abundant data
     • Business: Web, e-commerce, transactions, stocks, …
     • Science: Remote sensing, bioinformatics, scientific simulation, …
     • Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated
 analysis of massive data sets
• Need to analyze the raw data to extract knowledge
                                                 4
The data is also very complex
• Multiple types of data: tables, time series,
 images, graphs, etc
• Spatial and temporal aspects
• spatial data provides the information that
 identifies the location of features and boundaries
 on Earth.
                                                      5
Example: transaction data
• Billions of real-life customers:
  • WALMART: 20M transactions per day
  • AT&T 300 M calls per day
  • Credit card companies: billions of transactions per day.
                                                 6
Example: document data
• Web as a document repository: estimated 50
 billions of web pages
• Wikipedia: 4 million articles (and counting)
• Online news portals: steady stream ))تدفق مستمرof
 100’s of new articles every day
• Twitter: ~300 million tweets every day
                                                7
Example: network data
• Web: 50 billion pages linked via hyperlinks
• Facebook: 500 million users
• Twitter: 300 million users
• Instant messenger: ~1billion users
• Blogs: 250 million blogs worldwide, presidential
 candidates run blogs
                                                        8
Example: environmental data
• Climate data (just an example)
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php
• “a database of temperature, precipitation and
 pressure records managed by the National Climatic
 Data Center, Arizona State University and the Carbon
 Dioxide Information Analysis Center”
• “6000 temperature stations, 7500 precipitation
 stations, 2000 pressure stations”
  • Spatiotemporal data
                                                            9
Behavioral data
• Mobile phones today record a large amount of information
 about the user behavior
  •   GPS records position
  •   Camera produces images
  •   Communication via phone and SMS
  •   Text via facebook updates
• Amazon collects all the items that you browsed, placed into
 your basket, read reviews about, purchased.
• Google record all your browsing activity via toolbar plugins.
 They also record the queries you asked, the pages you saw
 and the clicks you did.
• Data collected for millions of users on a daily basis
                                                                  10
Types of Attributes
• There are different types of attributes
   • Categorical
     •   Examples: eye color, id number, rankings (e.g, good, fair,
         bad), height in {tall, medium, short}
     •   Nominal (no order) vs Ordinal (order)
     •   Nominal and Ordinal are collectively referred to as Categorical
         or qualitative attributes
   • Numeric(quantitative)
      • Examples: dates, temperature, time, length, value.
      • Interval
      • ratio
                                                                     11
    Types of Attributes
An independent way of distinguishing between attributes is by the
number of values they can take:
• Discrete A discrete attribute has a finite or countably infinite set
  of values.
• Such attributes can be categorical, such as zip codes or ID
  numbers, or numeric, such as counts.
• Discrete attributes are often represented using integer variables.
• Binary attributes are a special case of discrete attributes and assume
  only two values, e.g., true/false, yes/no,male/female, or 0/1.
• Continuous A continuous attribute is one whose values are real numbers.
• Examples include attributes such as temperature, height, or weight.
• Continuous attributes are typically represented as floating-point variables.
12
                                                      13
Record Data
• Much data mining work assumes that the data set is a
 collection of records(data objects), each of which consists
 of a fixed set of data fields (attributes).
                                                           14
Categorical Data
• Data that consists of a collection of records, each
 of which consists of a fixed set of categorical
 attributes
                      Tid Refund Marital    Taxable
                                 Status     Income Cheat
                      1    Yes    Single    High     No
                      2    No     Married   Medium   No
                      3    No     Single    Low      No
                      4    Yes    Married   High     No
                      5    No     Divorced Medium    Yes
                      6    No     Married   Low      No
                      7    Yes    Divorced High      No
                      8    No     Single    Medium   Yes
                      9    No     Married   Medium   No
                      10   No     Single    Medium   Yes
                 10
                                                                                              15
Document Data
• Each document becomes a `term' vector,
  • each term is a component (attribute) of the vector,
  • the value of each component is the number of times the
    corresponding term occurs in the document.
  • Bag-of-words representation – no ordering
                                                                           timeout
                                                                                     season
                           coach
                                                        game
                                                score
                    team
                                         ball
                                                                    lost
                                   pla
                                                               wi
                                                               n
                                    y
       Document 1   3       0       5    0       2      6      0    2       0         2
       Document 2   0       7       0    2       1      0      0    3       0         0
       Document 3   0       1       0    0       1      2      2    0       3         0
                                                 16
Transaction Data
• Each record (transaction) is a set of items.
         TID   Items
         1     Bread, Coke, Milk
         2     Beer, Bread
         3     Beer, Coke, Diaper, Milk
         4     Beer, Bread, Diaper, Milk
         5     Coke, Diaper, Milk
• A document can also be represented as a set of
 words (no counts)
                                   17
Ordered Data
• Genomic sequence data
         GGTTCCGCCTTCAGCCCCGCGCC
         CGCAGGGCCCGCCCCGCGCCGTC
         GAGAAGGGCCCGCCTGGCGGGCG
         GGGGGAGGCGGGGCCGCCCGAGC
         CCAACCGAGTCCGACCAGGTGCC
         CCCTCTGCTCGGCCTAGACCTGA
         GCTCATTAGGCGGCAGCGGACAG
         GCCAAGTAGAACACGCGAAGCGC
         TGGGCTGCCTGCTGCGACCAGGG
                                                    18
Ordered Data
• Time series
  • Sequence of ordered (over “time”) numeric values.
                                                                 19
Graph Data
• Examples: Web graph and HTML Links
                      <a href="papers/papers.html#bbbb">
                      Data Mining </a>
                      <li>
       2              <a href="papers/papers.html#aaaa">
                      Graph Partitioning </a>
                      <li>
 5              1     <a href="papers/papers.html#aaaa">
                      Parallel Solution of Sparse Linear System of Equations </a>
                      <li>
        2             <a href="papers/papers.html#ffff">
                      N-Body Computation and Dense Linear System Solvers
            5
                                                                      Attributes
 So, what is Data?
                                                           Tid Refund Marital     Taxable
• Collection of data objects and                                      Status      Income Cheat
 their attributes                                          1    Yes     Single    125K   No
                                                           2    No      Married   100K   No
• An attribute is a property or                            3    No      Single    70K    No
 characteristic of an object                               4    Yes     Married   120K   No
  • Examples: eye color of a person,                       5    No      Divorced 95K     Yes
    temperature, etc.
                                          Objects
                                                           6    No      Married   60K    No
  • Attribute is also known as                             7    Yes     Divorced 220K    No
    variable, field, characteristic, or                    8    No      Single    85K    Yes
    feature, Dimension                                     9    No      Married   75K    No
• A collection of attributes describe                      10   No      Single    90K    Yes
 an object
                                                      10
  • Object is also known as record,
                                                Size: Number of objects
    point, case, sample, entity, or
                                                Dimensionality: Number of attributes
    instance
                                                                                 21
Data Mining tasks
      • Data Mining Tasks
Data mining tasks are generally divided into two major categories:
• Predictive tasks. The objective of these tasks is to predict the value of a particular
attribute based on the values of other attributes. The attribute
to be predicted is commonly known as the target or dependent variable,
while the attributes used for making the prediction are known as
the explanatory or independent variables.
• Descriptive tasks. Here, the objective is to derive patterns (correlations,
trends, clusters, and anomalies) that summarize the underlying
relationships in data.
                    22
Data Mining tasks
                    23
Data Mining tasks
                                                                         24
Frequent Itemsets and Association Rules
• Given a set of records each of which contain some
    number of items from a given collection;
     • Identify sets of items (itemsets) occurring frequently
       together
     • Produce dependency rules which will predict
       occurrence of an item based on occurrences of other
       items.
                                              Itemsets Discovered:
TID     Items                                    {Milk,Coke}
1       Bread, Coke, Milk                        {Diaper, Milk}
2       Beer, Bread
3       Beer, Coke, Diaper, Milk              Rules Discovered:
4       Beer, Bread, Diaper, Milk                {Milk} --> {Coke}
5       Coke, Diaper, Milk                       {Diaper, Milk} --> {Beer}
       Tan, M. Steinbach and V. Kumar, Introduction to Data Mining
                                                                  25
Clustering Definition
• Given a set of data points, each having a set of
  attributes, and a similarity measure among them,
  find clusters such that
  • Data points in one cluster are more similar to one
    another.
  • Data points in separate clusters are less similar to
    one another.
    Tan, M. Steinbach and V. Kumar, Introduction to Data Mining
                                                                   26
Clustering: Application
• Document Clustering:
  • Goal: To find groups of documents that are similar to
    each other based on the important terms appearing in
    them.
  • Approach: To identify frequently occurring terms in
    each document. Form a similarity measure based on
    the frequencies of different terms. Use it to cluster.
  • Gain: Information Retrieval can utilize the clusters to
    relate a new document or search term to clustered
    documents.
     Tan, M. Steinbach and V. Kumar, Introduction to Data Mining
                                                      27
Classification: Definition
• Given a collection of records (training set )
   • Each record contains a set of attributes, one of the
     attributes is the class.
• Find a model for class attribute as a function
  of the values of other attributes.
• Goal: previously unseen records should be
  assigned a class as accurately as possible.
   • A test set is used to determine the accuracy of the
     model. Usually, the given data set is divided into
     training and test sets, with training set used to build
     the model and test set used to validate it.
                                                                                        28
          Classification Example
     Tid Refund Marital     Taxable                   Refund Marital     Taxable
                Status      Income Cheat                     Status      Income Cheat
     1    Yes     Single    125K   No                 No      Single     75K    ?
     2    No      Married   100K   No                 Yes     Married    50K    ?
     3    No      Single    70K    No                 No      Married    150K   ?
     4    Yes     Married   120K   No                 Yes     Divorced 90K      ?
     5    No      Divorced 95K     Yes                No      Single     40K    ?
     6    No      Married   60K    No                 No      Married    80K    ?            Test
                                                 10
                                                                                             Set
     7    Yes     Divorced 220K    No
     8    No      Single    85K    Yes
     9    No      Married   75K    No                                    Learn
                                              Training
     10   No      Single    90K    Yes                                                   Model
10
                                                Set                     Classifier
                Tan, M. Steinbach and V. Kumar, Introduction to Data Mining
                                                               29
Classification: Application 1
• Ad Click Prediction
   • Goal: Predict if a user that visits a web page will click
     on a displayed ad. Use it to target users with high
     click probability.
   • Approach:
      • Collect data for users over a period of time and record who
        clicks and who does not. The {click, no click} information
        forms the class attribute.
      • Use the history of the user (web pages browsed, queries
        issued) as the features.
      • Learn a classifier model and test on new users.
                                                                             30
Connections of Data Mining with other
areas
• Draws ideas from machine learning/AI, pattern
  recognition, statistics, and database systems
• Traditional Techniques
  may be unsuitable due to
                                                  Statistics/       Machine Learning/
 • Enormity of data                                   AI                  Pattern
 • High dimensionality                                                  Recognition
   of data
                                                          Data Mining
 • Heterogeneous,
   distributed nature
   of data                                                 Database
                                                           systems
      Tan, M. Steinbach and V. Kumar, Introduction to Data Mining
                                                         31
Data Mining: Confluence of Multiple Disciplines
            Database
            Technology                 Statistics
 Machine                                            Visualization
                         Data Mining
 Learning
   Pattern
 Recognition                                      Other
                                                Disciplines
                             32
Knowledge Discovery (KDD) Process
                                            33
Knowledge Discovery (KDD) Process
 • Data cleaning
 • Data integration from multiple sources
 • Data selection for data mining
 • Data transformation
 • Data mining
 • Patterns evaluation
 • Presentation of the mining results
                             34
Knowledge Discovery (KDD) Process
                             35
Knowledge Discovery (KDD) Process
                             36
Knowledge Discovery (KDD) Process
                             37
Knowledge Discovery (KDD) Process
7
                                                                         38
The data analysis pipeline
• Mining is not the only step in the analysis process
       Data                                                      Result
   Preprocessing                   Data Mining               Post-processing
• Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning
 is required to make sense of the data
  • Techniques: Sampling, Dimensionality Reduction, Feature selection.
  • A dirty work, but it is often the
                                most important step for the analysis.
• Post-Processing: Make the data actionable and useful to the user
  • Statistical analysis of importance
  • Visualization.
• Pre- and Post-processing are often data mining tasks as well
• Because of the many ways data can be collected and stored, data
 preprocessing is perhaps the most laborious and time-
 consuming step in the overall knowledge discovery process.
Post-processing
• Visualization
  • The human eye is a powerful analytical tool
  • If we visualize the data properly, we can discover
    patterns
  • Visualization is the way to present the data so that
    patterns can be seen
    • E.g., histograms and plots are a form of visualization
    • There are multiple techniques (a field on its own)
 Data Quality
 • Examples of data quality problems:
   • Noise and outliers
                               Tid Refund Marital            Taxable
   • Missing values                       Status             Income Cheat
   • Duplicate data            1   Yes    Single             125K    No
                                         2   No    Married   100K    No
                                         3   No    Single    70K     No
                                         4   Yes   Married   120K    No
 A mistake or a millionaire?             5   No    Divorced 10000K   Yes
                                         6   No    NULL      60K     No
     Missing values                      7   Yes   Divorced 220K     NULL
                                         8   No    Single    85K     Yes
                                         9   No    Married   90K     No
Inconsistent duplicate entries
                                         9   No    Single    90K     No
                                    10
                                                                  41
Sampling
• Sampling     is the main technique employed for data
  selection.
 • It is often used for both the preliminary investigation of the data and
   the final data analysis.
• Statisticians sample because obtaining the entire set of
  data of interest is too expensive or time consuming.
• Sampling is used in data mining because processing the
  entire set of data of interest is too expensive or time
  consuming.
                                                       42
Sampling …
• The key principle for effective sampling is the
 following:
 • using a sample will work almost as well as using the
   entire data sets, if the sample is representative
 • A sample is representative if it has approximately the
   same property (of interest) as the original set of data
                                                                     43
Types of Sampling
• Simple Random Sampling
  • There is an equal probability of selecting any particular item
• Sampling without replacement
  • As each item is selected, it is removed from the population
• Sampling with replacement
  • Objects are not removed from the population as they are selected
    for the sample.
     •    In sampling with replacement, the same object can be picked up more
         than once
• Stratified sampling
  • Split the data into several partitions; then draw random samples
    from each partition
                                44
Sample Size
  8000 points   2000 Points   500 Points