CP1444: DATA MINING &
WAREHOUSING
SYLLABUS
Module I:
Introduction-: Introduction: -Data, Information, Knowledge, KDD, types of data for mining,
technologies for mining, issues in data mining, data mining functionalities/tasks. Data pre-
processingoverview, Data cleaning, Data integration, Data reduction, Data transformation and
discretization. Data Warehouses-basic concepts, Data Mart, Databases Vs Data warehouses, Data
ware houses Vs Data mart, OLTP Vs OLAP, OLAP operations/functions, OLAP Multi-Dimensional
Models- Data cubes, Star, Snow Flakes, Fact constellation data models.
Module II:
Association rules- Market Basket Analysis, Frequent Item sets, Closed Item sets, and Association
Rules, Frequent Item sets Mining Methods- Apriori Algorithm: Finding Frequent Itemset by
Confined Candidate Generation, Generating Association Rules from Frequent item sets, Improving
the Efficiency of Apriori.
                                                                                               2
Module III:
Classification– Basic Concepts, Decision Tree Induction, Bayesian Classification, Rule Based
Classification, Classification by Back propagation, Support Vector Machines, Associative Classification,
Lazy Learners
Module IV:
Clustering- Cluster analysis: definition and Requirements, Characteristics of clustering techniques, Types
of data in cluster analysis, Overview of Basic Clustering Methods, Partitioning methodsK-Means and K -
medoid methods, Outlier detection- definition and types of outliers, Outlier Detection Methods-
Supervised, Semi-Supervised, and Unsupervised Methods, Statistical Methods, Proximity-Based
Methods, and Clustering-Based Methods (basic concepts only)
                                                                                                         3
CORE TEXT
1. Jiawei Han & Micheline Kamber & Jian Pei Data Mining Concepts &
   Techniques
https://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-Morgan-Ka
ufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Kamber-Ji
an-Pei-Data-Mining.-Concepts-and-Techniques-3rd-Edition-Morgan-Kaufmann-
2011.pdf
ADDITIONAL REFERENCES
2. Sunitha Tiwari & Neha Chaudary, Data Mining and Warehousing, Dhanpat
   Rai & Co.                                                               4
Reference :
Chapter 1
              https://hanj.cs.illinois.edu/bk3/
              bk3_slidesindex.htm
                                                  5
                         WHAT IS DATA MINING?
• To refer to the mining of gold from rocks or sand, we say gold mining
  instead of rock or sand mining.
• Analogously, data mining should have been more appropriately named
  “knowledge mining from data,” which is unfortunately somewhat long.
• Data mining refers to extracting or mining knowledge from large
  amounts of data.
                                                                          6
                                   WHAT IS DATA MINING?
• It is the computational process of discovering patterns in large data sets involving
  methods at the intersection of artificial intelligence, machine learning, statistics, and
  database systems.
• The overall goal of the data mining process is to extract information from a data set
  and transform it into an understandable structure for further use.
• The key properties of data mining are
 • Automatic discovery of patterns
 • Prediction of likely outcomes
 • Creation of actionable information
 • Focus on large datasets and databases
                                                                                        7
Data
• Data is a collection of facts, such as numbers, words, measurements,
  observations, or just descriptions of things.
• Data can be qualitative or quantitative.
• Qualitative data is descriptive information (it describes something)
• Quantitative data is numerical information (numbers)
                                                                     8
Information
• The raw data is collected, and after processing this raw data, the
  outcome is information.
• This information can be defined as when the data is processed,
  organized, and presented in a specific context to serve its use is
  called information.
• The information doesn’t have any existence without data, most
  information has to measure units like quantity, time, etc. There are
  also a lot of differences between data and information. For
  information to be useful, the process data has the following
  characteristics which are:
                                                                     9
  • Time – Information should be available at any point in time whenever it is
    required.
  • Accuracy – Information should be actual and organized only then it can serve
    its purpose.
  • Completeness – Information should be finite and consistent.
• Some examples of information :
  1.   Information about transportation systems such as train schedules.
  2.   Geographical information such as direction.
  3.   Payslips
  4.   Bank passbook
  5.   Printed documents.
                                                                                   10
Knowledge
• Knowledge is information that has been processed, organized, or
  structured in some way, or put into practice in some way.
• Knowledge means the familiarity and awareness of a person, place,
  events, ideas, issues, ways of doing things or anything else, which is
  gathered through learning, perceiving or discovering.
• It is the state of knowing something with cognizance through the
  understanding of concepts, study and experience.
                                                                      11
Why we need Data Mining?
Volume of information is increasing everyday that we can handle from business
transactions, scientific data, sensor data, Pictures, videos, etc. So, we need a system that will
be capable of extracting essence of information available and that can automatically
generate report,views or summary of data for better decision-making.
Why Data Mining is used in Business?
Data mining is used in business to make better managerial decisions by:
1. Automatic summarization of data
2. Extracting essence of information stored.
3. Discovering patterns in raw data.
                                                                                               12
   KNOWLEDGE DISCOVERY FROM DATA, OR KDD
• KDD (Knowledge Discovery in Databases) is a process that involves
  the extraction of useful, previously unknown, and potentially
  valuable information from large datasets.
• The KDD process in data mining typically involves the following
  steps:
                                                                  13
              KNOWLEDGE DISCOVERY FROM DATA, OR KDD
 different     I. Data cleaning (to remove noise and inconsistent data)
 forms of
   data        II. Data integration (where multiple data sources may be combined)
preprocessi    III. Data selection (where data relevant to the analysis task are retrieved from the database)
     ng
               IV. Data transformation (where data are transformed and consolidated into forms appropriate
                   for mining by performing summary or aggregation operations)
               V. Data mining (an essential process where intelligent methods are applied to extract data
                  patterns)
               VI. Pattern evaluation (to identify the truly interesting patterns representing knowledge based
                   on interestingness measures)
               VII. Knowledge presentation (where visualization and knowledge representation techniques
                    are used to present mined knowledge to users)
               The term data mining is often used to refer to the entire knowledge discovery process
                                                                                                      14
       PYQ: What is data mining? Outline the stages in the knowledge discovery process. (1
15
1. Selection: Select a relevant subset of the data for analysis.
2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include tasks such
   as data normalization, missing value handling, and data integration.
3. Transformation: Transform the data into a format suitable for data mining, such as a matrix or a graph.
4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful information and
   insights. This may include tasks such as clustering, classification, association rule mining, and anomaly
   detection.
5. Interpretation: Interpret the results and extract knowledge from the data. This may include tasks such as
   visualizing the results, evaluating the quality of the discovered patterns and identifying relationships and
   associations among the data.
6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and
   meaningful.
7. Deployment: Use the discovered knowledge to solve the business problem and make decisions.
                                                                                                           16
                  WHAT KINDS OF DATA CAN BE MINED?
• The most basic forms of data for mining applications are database
  data, data warehouse data and transactional data.
• Data mining can also be applied to other forms of data (e.g., data
  streams, ordered/sequence data, graph or networked data, spatial
  data, text data, multimedia data, and the WWW).
                                                                  17
                         WHAT KINDS OF DATA CAN BE MINED?
1. Database Data
• A database system, also called a database management system (DBMS), consists of a
  collection of interrelated data, known as a database, and a set of software programs to
  manage and access the data.
• A relational database is a collection of tables, each of which is assigned a unique name.
  Each table consists of a set of attributes (columns or fields) and usually stores a large set
  of tuples (records or rows).
• When mining relational databases, we can go further by searching for trends or data
  patterns. For example, data mining systems can analyze customer data to predict the
  credit risk of new customers based on their income, age, and previous credit
  information.
                                                                                      18
                       WHAT KINDS OF DATA CAN BE MINED?
II. Data Warehouses
• A data warehouse is a repository of information collected from multiple sources,
  stored under a unified schema, and usually residing at a single site.
• Data warehouses are constructed via a process of data cleaning, data integration,
  data transformation, data loading, and periodic data refreshing.
• A data warehouse is usually modeled by a multidimensional data structure, called a
  data cube, in which each dimension corresponds to an attribute or a set of attributes
  in the schema, and each cell stores the value of some aggregate measure such as
  count or sum(sales amount ).
• A data cube provides a multidimensional view of data and allows the
  precomputation and fast access of summarized data.           19
20
                               WHAT KINDS OF DATA CAN BE MINED?
III.   Transactional Data
• Each record in a transactional database captures a transaction, such as a customer’s purchase, a flight booking,
  or a user’s clicks on a web page.
• A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up
  the transaction, such as the items purchased in the transaction.
IV. Other Kinds of Data
• Time-related or sequence data (e.g., historical records, stock exchange data, and time-series and biological
  sequence data),
• data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
• spatial data (e.g., maps),
• engineering design data (e.g., the design of buildings, system components, or integrated circuits),
• hypertext and multimedia data (including text, image, video, and audio data),
• graph and networked data (e.g., social and information networks), and
                                                                                                         21
• the Web (a huge, widely distributed information repository made available by the Internet).
         WHAT KINDS OF PATTERNS CAN BE MINED?
          DATA MINING FUNCTIONALITIES (ESSAY)
1. characterization and discrimination
2. the mining of frequent patterns, associations, and correlations
3. classification and regression
4. clustering analysis and
5. outlier analysis
• Data mining functionalities are used to specify the kinds of patterns to be found in data
  mining tasks.
• In general, such tasks can be classified into two categories: descriptive and predictive.
• Descriptive mining tasks characterize properties of the data in a target data set.
• Predictive mining tasks perform induction on the current data in order to make predictions.
                                                                                              22
                                              DATA MINING FUNCTIONALITIES
1. Class/Concept Description: Characterization and Discrimination
• Data entries can be associated with classes or concepts.
• For example, classes of items for sale include computers and printers, and concepts
  of customers include bigSpenders and budgetSpenders.
• It can be useful to describe individual classes and concepts in summarized, concise,
  and yet precise terms. Such descriptions of a class or a concept are called
  class/concept descriptions.
• These descriptions can be derived using
    • Data Characterization − This refers to summarizing data of class under study. This class under study is called as
     Target Class.
    • Data Discrimination − It refers to the mapping or classification of a class with some predefined group or class.
                                                                                                                    23
                                        DATA MINING FUNCTIONALITIES
II. Mining Frequent Patterns, Associations, and Correlations
• Frequent patterns, are patterns that occur frequently in data.
• There are many kinds of frequent patterns, including
 •   frequent item sets
 •   frequent sub sequences
 •   frequent substructures.
• A frequent itemset typically refers to a set of items that often appear together in a transactional
  data set—for example, milk and bread, which are frequently bought together in grocery stores by
  many customers.
• A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a
  laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.
• A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may be
  combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
  (frequent) structured pattern.                                                                  24
                                           DATA MINING FUNCTIONALITIES
III. Classification and Regression for Predictive Analysis
            Classification is the process of finding a model (or function) that describes and distinguishes
   data classes or concepts.
   The model are derived based on the analysis of a set of training data (i.e., data objects for which the
   class labels are known). The model is used to predict the class label of objects for which the class label
   is unknown.
Regression is used to predict missing or unavailable numerical data values
rather than (discrete) class labels.
Regression analysis is a statistical methodology that is most often used for
numeric prediction
                                                                                                      25
                                      DATA MINING FUNCTIONALITIES
IV. Cluster Analysis
   The objects are clustered or grouped based on the principle of maximizing the
   intraclass similarity and minimizing the interclass similarity.
  That is, clusters of objects are formed so that
  objects within a cluster have high similarity in
  comparison to one another, but are rather
  dissimilar to objects in other clusters
                                                                              26
                               DATA MINING FUNCTIONALITIES
V. Outlier Analysis
  A data set may contain objects that do not comply with the general behavior
  or model of the data.
  These data objects are outliers.
  Many data mining methods discard outliers as noise or exceptions.
  In some applications (e.g., fraud detection) the rare events can be more
  interesting than the more regularly occurring ones. The analysis of outlier data
  is referred to as outlier analysis or anomaly mining
Eg: Outlier analysis may uncover fraudulent usage of credit cards by detecting
   purchases of unusually large amounts for a given account number in
   comparison to regular charges incurred by the same account.           27