0% found this document useful (0 votes)
44 views57 pages

Unit 1 DMDW

The document discusses data mining and the knowledge discovery process. It describes how data mining is the core of knowledge discovery and involves tasks like data cleaning, integration, selection, transformation and pattern evaluation. Several applications of data mining are mentioned like banking, customer analytics, fraud detection and more.

Uploaded by

Chandrika Surya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views57 pages

Unit 1 DMDW

The document discusses data mining and the knowledge discovery process. It describes how data mining is the core of knowledge discovery and involves tasks like data cleaning, integration, selection, transformation and pattern evaluation. Several applications of data mining are mentioned like banking, customer analytics, fraud detection and more.

Uploaded by

Chandrika Surya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Knowledge Discovery (KDD) Process

• Data mining—core of
Pattern Evaluation
knowledge discovery
process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

3
Databases
Applications
• Banking: loan/credit card approval
• predict good customers based on old customers
• Customer relationship management:
• identify those who are likely to leave for a competitor.
• Targeted marketing:
• identify likely responders to promotions
• Fraud detection: telecommunications, financial transactions
• from an online stream of event identify fraudulent events
• Manufacturing and production:
• automatically adjust knobs when process parameter changes
Applications (continued)
• Medicine: disease outcome, effectiveness of treatments
• analyze patient disease history: find relationship between diseases
• Molecular/Pharmaceutical: identify new drugs
• Scientific data analysis:
• identify new galaxies by searching for sub clusters
• Web site/store design and promotion:
• find affinity of visitor to pages and modify layout
Application Areas

Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
What motivated data mining? Why is it important?

There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyse this huge amount of data and
extract useful information from it.

The information and knowledge gained can be used for applications ranging from
business management, production control, and market analysis, to engineering design
and science exploration.
What Is Data Mining?

 Data mining refers to extracting or “mining” knowledge from large amounts of data.

 datamining as a synonym for another popularly used term, Knowledge Discovery


from Data, or KDD.
Data Mining: A KDD Process

Pattern Evaluation
Figure 1.4 Data mining as a step in the process of knowledge
discovery.
Data Mining

Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse

Data Cleaning
Data Integration

Databases
Knowledge discovery as a process is depicted in Figure and consists of an iterative
sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)


2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance)
5. Data mining (an essential process where intelligent methods are applied in order to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures; )

7. Knowledge presentation (where visualization and knowledge representation


techniques are used to present the mined knowledge to the user)
Architecture of a typical data mining system
Based on this view, the architecture of a typical data mining system may have the following major components .

1) Database, datawarehouse, Worldwide Web, or other information repository:

This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories.
Data cleaning and data integration techniques may be performed on the data.

2) Database or data warehouse server:

The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining
request.

3) Knowledge base:

This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such
knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also
be included.

Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g.,
describing data from multiple heterogeneous sources).
4) Data mining engine:

This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.

5)Pattern evaluation module:

This component typically employs interestingness measures and interacts with the data mining modules so as to focus the
search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns.
Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation
of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting
patterns.

6) User interface: This module communicates between users and the data mining system, allowing the user to interact
with the system by specifying a data mining query or task, providing information to help focus the search, and
performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the
user to browse database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different forms.
Architecture of a Typical Data
Mining System
Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server

Data cleaning & data integration Filtering

Data
Databases Warehouse
Data Mining: On What Kind of
Data?
 Relational databases
 Data warehouses
 Transactional databases
 Advanced Data and Information Systems and Advanced Applications
a) Object-oriented and object-relational databases
b) Spatial databases
c) Time-series data and temporal data
d) Text databases and multimedia databases
e) Heterogeneous and legacy databases
f) WWW

Data Warehousing/Mining 1
1) Relational Databases

A relational database is a collection of tables, each of which is assigned a unique name.


Each table consists of a set of attributes (columns or fields) and usually stores a large set
of tuples (records or rows).

When data mining is applied to relational databases, we can go further by searching for trends or data patterns.

For example, data mining systems can analyze customer data to predict the credit risk of new customers based on their
income, age, and previous credit information. Data mining systems may also detect deviations, such as items whose sales
are far from those expected in comparison with the previous year.
2) DataWarehouses

Suppose that AllElectronics is a successful international company, with branches around the world. Each
branch has its own set of databases.
The president of AllElectronics has asked you to provide an analysis of the company’s sales per item type
per branch for the third quarter.
 This is a difficult task, particularly since the relevant data are spread out over several databases,
physically located at numerous sites.

If AllElectronics had a data warehouse, this task would be easy.


data warehouse may store a summary of the transactions per item type for each store or,
summarized to a higher level, for each sales region.

 A data warehouse is usually modelled by a multidimensional database structure, where


each dimension corresponds to an attribute or a set of attributes in the schema, and each
cell stores the value of some aggregate measure, such as count or sales amount.
A data cube provides a
multidimensional view of data
and allows the precomputation and
fast accessing of summarized data.

OLAP operations include


drill-down and roll-up,
which allow the user to
view the
data at differing degrees of
summarization
3) Transactional Databases
A transactional database consists of a file where each record represents a transaction.

A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the
transaction (such as items purchased in a store).

 Which items sold well together.

 “Show me all the items purchased by Sandy Smith” or “How many transactions include item
number I3?”

data mining systems for transactional data can do so by identifying frequent item sets, that is, sets
of items that are frequently sold together.

 For example, given the knowledge that printers are commonly purchased together with computers,
you could offer an expensive model of printers at a discount to customers buying selected computers,
in the hopes of Selling more of the expensive printers.
Object-oriented Databases

Object-oriented databases are based on the object-oriented programming paradigm.

Each entity is considered as an object employees, customers, or items.

 Each object has associated with it the following 1) variables (name, address, DOB) 2) messages 3)
methods (get-photo).

 inheritance(sales_person object would inherit all of the variables of super class of employee).

Object-Relational Databases

the object-relational data model inherits the essential concepts of object-oriented databases, where, in
general terms, each entity is considered as an object.

For data mining in object-relational systems, techniques need to be developed for handling complex object
structures, complex data types, class and subclass hierarchies, property inheritance, and methods and
procedures.
Spatial Databases
Spatial databases contain spatial-related information.

Examples include geographic (map) databases, very large-scale integration (VLSI) or computed-aided design databases, and
medical and satellite image databases

Data mining may uncover patterns describing the characteristics of houses located near a specified
kind of location, such as a park, for instance.
Temporal Databases, Sequence Databases, and Time-Series Databases
 A temporal database typically stores relational data that include time-related attributes.

 A sequence database stores sequences of ordered events, with or without a concrete


notion of time.

 Examples include customer shopping sequences, Web click streams, and biological sequences.

 A time-series database stores sequences of values or events obtained over repeated measurements of
time (e.g., hourly, daily, weekly).

Examples include data collected from the stock exchange, inventory control, and the observation of
natural phenomena (like temperature and wind).

Data mining techniques can be used to find the characteristics of object evolution, or
the trend of changes for objects in the database. Such information can be useful in decision
making and strategy planning.

e.g.,when is the best time to purchase AllElectronics stock.


When to Investment in stock market.
Text Databases and Multimedia Databases
 Text databases are databases that contain word descriptions for objects.

 These word descriptions are usually not simple keywords but rather long sentences or paragraphs, such as
product specifications, error or bug reports, warning messages, summary reports, notes, or other
documents.

By mining text data  one may uncover general and concise descriptions of the text documents, keyword
or content associations, as well as the clustering behavior of text objects.

Multimedia databases store image, audio, and video data. They are used in
applications such as picture content-based retrieval, voice-mail systems, video-on-
demand systems.

For multimedia data mining, storage and search techniques need to be integrated with
standard data mining methods. Promising approaches include the construction of
multimedia data cubes, the extraction of multiple features from multimedia data, and
similarity-based pattern matching.
World wide web

 The World Wide Web and its associated distributed information services, such as Yahoo!, Google, America
Online, and AltaVista, provide rich, worldwide, on-line information services, where data objects are linked
together to facilitate interactive access.

 Capturing user access patterns in such distributed information environments is called Web usage
mining (or Weblog mining).

 Data mining can often provide additional help here than Web search services.

For example, authoritative Web page analysis based on linkages among Web pages can help rank Web pages
based on their importance, influence, and topics.

Automated Web page clustering and classification help group and arrange Web pages in a multidimensional
manner based on their contents.
Data Mining Models and Tasks

30
Data Mining Functionalities—What Kinds of Patterns Can Be Mined?

 Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks.

 In general, data mining tasks can be classified into two categories: descriptive and predictive.

 Descriptive mining tasks characterize the general properties of the data in the database.

 Predictive mining tasks perform inference on the current data in order to make predictions.

1) Concept/Class Description: Characterization and Discrimination


2) Association and correlations
3) Classification and Prediction
4) Cluster analysis
5) Outlier analysis
6) evolution analysis
1) Concept/Class Description: Characterization and Discrimination

 Data can be associated with classes or concepts.


For example, in the AllElectronics store, classes of items for sale include computers and
printers, and concepts of customers include bigSpenders and budgetSpenders.

data characterization derive descriptions of a class by summarizing the data of the class under study

 Data characterization is a summarization of the general characteristics or features of


a target class of data.

The output of data characterization can be presented in various forms.


Examples include pie charts, bar charts, curves, multidimensional data cubes, and
multidimensional tables, including crosstabs.
Example 1.4 Data characterization.
A data mining system should be able to produce a description summarizing the characteristics of customers who
spend more than $1,000 a year at AllElectronics.
The result could be a general profile of the customers, such as they are 40–50 years old, employed, and have excellent credit
ratings.
 Data discrimination is a comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes.

Example  Data discrimination.

A data mining system should be able to compare two groups of AllElectronics customers, such as those
who shop for computer products regularly (more than two times a month) versus those who rarely shop for
such products (i.e., less than three times a year).

The resulting description provides a general comparative profile of the customers, such as 80% of the
customers who frequently purchase computer products are between 20 and 40 years old and have a
university education, whereas 60% of the customers who infrequently buy such products are either seniors or
youths, and have no university degree.
2) Association and correlations
Association analysis Multi-dimensional vs. single-dimensional association

Suppose, as a marketing manager of AllElectronics, you would like to determine which items are
frequently purchased together within the same transactions.

An example of such a rule, mined from the AllElectronics transactional database, is


buys(X; “computer”))buys(X; “software”) [support = 1%; confidence = 50%]

 where X is a variable representing a customer.


 A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance
that she will buy software as well.

 A 1% support means that 1% of all of the transactions under analysis showed that computer and
software were purchased together.

 This association rule involves a single attribute or predicate (i.e., buys) that repeats.

 Association rules that contain a single predicate are referred to as single-dimensional association
rules.
age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “CD player”) [support = 2%, confidence = 60%]

Of all customers under study, 2% are 20-29 years old with an income of 20K-29K and have purchased a
CD player.

There is a 60% probability that customer in this age and income group will purchase a CD player.

Note that this is an association between more than one attribute, or predicate (i.e., age, income, and buys).

 the above rule can be referred to as a multidimensional association rule.

 Typically, association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold
and a minimum confidence threshold.

 Additional analysis can be performed to uncover interesting statistical correlations


between associated attribute-value pairs.
3) Classification and Prediction
• Finding models (functions) that describe and distinguish classes or concepts.
• E.g., classify countries based on climate, or classify cars based on gas mileage
• The derived model may be represented in various forms, such as classification (IF-
THEN) rules, decision trees, mathematical formulae, or neural networks.

 A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch
represents an outcome of the test, and tree leaves represent classes or class distributions.

 A neural network, when used for classification, is typically a collection of neuron-like processing units with weighted
connections between the units.

 other methods for constructing classification models, such as naïve Bayesian classification,
support vector machines, and k-nearest neighbor classification

 Suppose, as sales manager of AllElectronics, you would like to classify a large set of items in the store, based
on three kinds of responses to a sales campaign: good response, mild response, and no response.
 such as price, brand, place made, type, and category.
 Prediction: Predict some unknown or missing numerical values.

 Example:-you would like to predict the amount of revenue that each item will generate
during an upcoming sale at AllElectronics, based on previous sales data.
4) Cluster analysis
 clustering analyzes data objects without consulting a known class label.

 Clustering based on the principle: maximizing the intra-class similarity and minimizing the
interclass similarity.

Figure  A 2-D plot of customer data with respect to customer locations in a city, showing three data clusters. Each
cluster “center” is marked with a “+”.
5) Outlier analysis
Outlier: a data object that does not comply with the general behavior of the
data.
 It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis.

 The analysis of outlier data is referred to as outlier mining.

Example  Outlier analysis may uncover fraudulent usage of credit cards by


detecting purchases of extremely large amounts for a given account number in
comparison to regular charges incurred by the same account.
6) Evolution Analysis
 Data evolution analysis describes and models regularities for objects whose behavior
changes over time.

 Evolution analysis include time-series data analysis, sequence or periodicity pattern


matching, and similarity-based data analysis.

Example:-Stock Market:- shares investment

A data mining study of stock exchange data may identify stock evolution regularities for
overall stocks and for the stocks of particular companies.

 Such regularities may help predict future trends in stock market prices, contributing to your
decision making regarding stock investments.
Are All of the Patterns Interesting?
 A data mining system may generate thousands of patterns all of them are not interesting

 only a small fraction of the patterns potentially generated would actually be of interest
to any given user.

 A pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test
data with some degree of certainty, (3) potentially useful, and (4) novel.

 objective measures of pattern interestingness exist based on statistics and structures


of patterns, e.g., support, confidence, etc.

 question—“Can a data mining system generate all of the interesting patterns?”—refers


to the completeness of a data mining algorithm.

 Search for only interesting patterns: Optimization First general all the patterns
and then filter out the uninteresting ones.
Classification of Data Mining Systems
Classification of Data Mining Systems

Database
Statistics
Technology

Machine
Learning Data Mining Visualization

Information Other
Science Disciplines

Figure  Data mining as a confluence of multiple disciplines.


 Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database
systems, statistics, machine learning, visualization, and information science.

 Depending on the kinds of data to be mined or on the given data mining application, the data mining
system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition,
image analysis, signal processing, computer graphics, Web technology, economics, business,
bioinformatics, or psychology.

1) Classification according to the kinds of databases mined:

 A data mining system can be classified according to the kinds of databases mined.

 classifying according to data models, we may have a relational, transactional, object-relational, or data
warehouse mining system.

 If classifying according to the special types of data handled, we may have a spatial, time-series, text,
stream data, multimedia data mining system, or a WorldWideWeb mining system.
2) Classification according to the kinds of knowledge mined:

 Data mining systems can be categorized according to the kinds of knowledge they mine, that is, based
on Data mining functionalities, such as characterization, discrimination, association and correlation
analysis, classification, prediction, clustering, outlier analysis, and evolution analysis.

3) Classification according to the kinds of techniques utilized:

 Techniques utilized are-


Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc.

4) Classification according to the applications adapted:

 Data mining systems can also be categorized according to the applications they adapt.

 For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.
Data Mining Task Primitives

 Each user will have a data mining task in mind, that is, some form of data analysis that he or she would
like to have performed.

 A data mining task can be specified in the form of a data mining query, which is input to the data
mining system.

 A data mining query is defined in terms of data mining task primitives.


The data mining primitives are

1) The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in
which the user is interested. This includes the database attributes or data warehouse dimensions of
interest.

2) The kind of knowledge to be mined:

This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.

3) The background knowledge to be used in the discovery process:

This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for
evaluating the patterns found.

 Concept hierarchies are a popular form of background knowledge, which allow data to be mined at
multiple levels of abstraction.
4) The interestingness measures and thresholds for pattern evaluation:

 They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns.

For example, interestingness measures for association rules include support and confidence.
Rules whose support and confidence values are below user-specified thresholds are
considered uninteresting.

5) The expected representation for visualizing the discovered patterns:

This refers to the form in which discovered patterns are to be displayed, which may include rules, tables,
charts, graphs, decision trees, and cubes.
Integration of a Data Mining System with a Database or DataWarehouse System

 possible integration schemes include no coupling, loose coupling, semitight coupling, and
tight coupling.

1) No coupling: No coupling means that a DM system will not utilize any function
of a DB or DW system. It may fetch data from a particular source (such as a file
system),process data using some data mining algorithms, and then store the mining results
in another file.

 Without using a DB/DW system, a DM system may spend a substantial amount of time
finding, collecting, cleaning, and transforming data.

 Without any coupling of such systems, a DM system will need to use other tools to
extract data, making it difficult to integrate such a system into an information processing
environment. Thus, no coupling represents a poor design.
2) Loose coupling:
 Loose coupling means that a DM system will use some facilities of a DB or DW system,
fetching data from a data repository, performing data mining, and then storing the
mining results either in a file or in a database or data warehouse.

 Loose coupling is better than no coupling.

 it is difficult for loose coupling to achieve high scalability and good performance with large
data sets.
3) Semitight coupling:

 efficient implementations of a few essential data mining primitives can be provided in


the DB/DW system.

 These primitives can include sorting, indexing, aggregation, histogram analysis,


multiway join, and precomputation of some essential statistical measures, such as sum,
count, max, min, standard deviation, and so on.

 Moreover, some frequently used intermediate mining results can be precomputed


and stored in the DB/DW system.

 Semitight coupling will enhance the performance of a DM system.


4) Tight coupling:

 Tight coupling means that a DM system is smoothly integrated into the DB/DW
system.

 Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a DB or DW system.

 With further technology advances, DM, DB, and DW systems will evolve and integrate
together as one information system with multiple functionalities.

 it facilitates efficient implementations of data mining functions, high system


performance, and an integrated information processing environment.
Major Issues in Data Mining

major issues in data mining are

1) mining methodology, and user interaction issues

2) performance issues and

3) diverse data types issues.


• Mining methodology and user interaction
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad-hoc data mining
• Expression and visualization of data mining results
• Handling noise and incomplete data
• Pattern evaluation: the interestingness problem

• Performance and scalability


• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining methods

Issues relating to the diversity of data types


 Handling relational and complex types of data
 Mining information from heterogeneous databases and global information
systems (WWW)

You might also like