Unit 1 DMDW
Unit 1 DMDW
• Data mining—core of
Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
3
Databases
Applications
• Banking: loan/credit card approval
• predict good customers based on old customers
• Customer relationship management:
• identify those who are likely to leave for a competitor.
• Targeted marketing:
• identify likely responders to promotions
• Fraud detection: telecommunications, financial transactions
• from an online stream of event identify fraudulent events
• Manufacturing and production:
• automatically adjust knobs when process parameter changes
Applications (continued)
• Medicine: disease outcome, effectiveness of treatments
• analyze patient disease history: find relationship between diseases
• Molecular/Pharmaceutical: identify new drugs
• Scientific data analysis:
• identify new galaxies by searching for sub clusters
• Web site/store design and promotion:
• find affinity of visitor to pages and modify layout
Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
What motivated data mining? Why is it important?
There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyse this huge amount of data and
extract useful information from it.
The information and knowledge gained can be used for applications ranging from
business management, production control, and market analysis, to engineering design
and science exploration.
What Is Data Mining?
Data mining refers to extracting or “mining” knowledge from large amounts of data.
Pattern Evaluation
Figure 1.4 Data mining as a step in the process of knowledge
discovery.
Data Mining
Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Databases
Knowledge discovery as a process is depicted in Figure and consists of an iterative
sequence of the following steps:
This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories.
Data cleaning and data integration techniques may be performed on the data.
The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining
request.
3) Knowledge base:
This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such
knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also
be included.
Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g.,
describing data from multiple heterogeneous sources).
4) Data mining engine:
This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.
This component typically employs interestingness measures and interacts with the data mining modules so as to focus the
search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns.
Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation
of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting
patterns.
6) User interface: This module communicates between users and the data mining system, allowing the user to interact
with the system by specifying a data mining query or task, providing information to help focus the search, and
performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the
user to browse database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different forms.
Architecture of a Typical Data
Mining System
Graphical user interface
Pattern evaluation
Data
Databases Warehouse
Data Mining: On What Kind of
Data?
Relational databases
Data warehouses
Transactional databases
Advanced Data and Information Systems and Advanced Applications
a) Object-oriented and object-relational databases
b) Spatial databases
c) Time-series data and temporal data
d) Text databases and multimedia databases
e) Heterogeneous and legacy databases
f) WWW
Data Warehousing/Mining 1
1) Relational Databases
When data mining is applied to relational databases, we can go further by searching for trends or data patterns.
For example, data mining systems can analyze customer data to predict the credit risk of new customers based on their
income, age, and previous credit information. Data mining systems may also detect deviations, such as items whose sales
are far from those expected in comparison with the previous year.
2) DataWarehouses
Suppose that AllElectronics is a successful international company, with branches around the world. Each
branch has its own set of databases.
The president of AllElectronics has asked you to provide an analysis of the company’s sales per item type
per branch for the third quarter.
This is a difficult task, particularly since the relevant data are spread out over several databases,
physically located at numerous sites.
A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the
transaction (such as items purchased in a store).
“Show me all the items purchased by Sandy Smith” or “How many transactions include item
number I3?”
data mining systems for transactional data can do so by identifying frequent item sets, that is, sets
of items that are frequently sold together.
For example, given the knowledge that printers are commonly purchased together with computers,
you could offer an expensive model of printers at a discount to customers buying selected computers,
in the hopes of Selling more of the expensive printers.
Object-oriented Databases
Each object has associated with it the following 1) variables (name, address, DOB) 2) messages 3)
methods (get-photo).
inheritance(sales_person object would inherit all of the variables of super class of employee).
Object-Relational Databases
the object-relational data model inherits the essential concepts of object-oriented databases, where, in
general terms, each entity is considered as an object.
For data mining in object-relational systems, techniques need to be developed for handling complex object
structures, complex data types, class and subclass hierarchies, property inheritance, and methods and
procedures.
Spatial Databases
Spatial databases contain spatial-related information.
Examples include geographic (map) databases, very large-scale integration (VLSI) or computed-aided design databases, and
medical and satellite image databases
Data mining may uncover patterns describing the characteristics of houses located near a specified
kind of location, such as a park, for instance.
Temporal Databases, Sequence Databases, and Time-Series Databases
A temporal database typically stores relational data that include time-related attributes.
Examples include customer shopping sequences, Web click streams, and biological sequences.
A time-series database stores sequences of values or events obtained over repeated measurements of
time (e.g., hourly, daily, weekly).
Examples include data collected from the stock exchange, inventory control, and the observation of
natural phenomena (like temperature and wind).
Data mining techniques can be used to find the characteristics of object evolution, or
the trend of changes for objects in the database. Such information can be useful in decision
making and strategy planning.
These word descriptions are usually not simple keywords but rather long sentences or paragraphs, such as
product specifications, error or bug reports, warning messages, summary reports, notes, or other
documents.
By mining text data one may uncover general and concise descriptions of the text documents, keyword
or content associations, as well as the clustering behavior of text objects.
Multimedia databases store image, audio, and video data. They are used in
applications such as picture content-based retrieval, voice-mail systems, video-on-
demand systems.
For multimedia data mining, storage and search techniques need to be integrated with
standard data mining methods. Promising approaches include the construction of
multimedia data cubes, the extraction of multiple features from multimedia data, and
similarity-based pattern matching.
World wide web
The World Wide Web and its associated distributed information services, such as Yahoo!, Google, America
Online, and AltaVista, provide rich, worldwide, on-line information services, where data objects are linked
together to facilitate interactive access.
Capturing user access patterns in such distributed information environments is called Web usage
mining (or Weblog mining).
Data mining can often provide additional help here than Web search services.
For example, authoritative Web page analysis based on linkages among Web pages can help rank Web pages
based on their importance, influence, and topics.
Automated Web page clustering and classification help group and arrange Web pages in a multidimensional
manner based on their contents.
Data Mining Models and Tasks
30
Data Mining Functionalities—What Kinds of Patterns Can Be Mined?
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks.
In general, data mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
data characterization derive descriptions of a class by summarizing the data of the class under study
A data mining system should be able to compare two groups of AllElectronics customers, such as those
who shop for computer products regularly (more than two times a month) versus those who rarely shop for
such products (i.e., less than three times a year).
The resulting description provides a general comparative profile of the customers, such as 80% of the
customers who frequently purchase computer products are between 20 and 40 years old and have a
university education, whereas 60% of the customers who infrequently buy such products are either seniors or
youths, and have no university degree.
2) Association and correlations
Association analysis Multi-dimensional vs. single-dimensional association
Suppose, as a marketing manager of AllElectronics, you would like to determine which items are
frequently purchased together within the same transactions.
A 1% support means that 1% of all of the transactions under analysis showed that computer and
software were purchased together.
This association rule involves a single attribute or predicate (i.e., buys) that repeats.
Association rules that contain a single predicate are referred to as single-dimensional association
rules.
age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “CD player”) [support = 2%, confidence = 60%]
Of all customers under study, 2% are 20-29 years old with an income of 20K-29K and have purchased a
CD player.
There is a 60% probability that customer in this age and income group will purchase a CD player.
Note that this is an association between more than one attribute, or predicate (i.e., age, income, and buys).
Typically, association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold
and a minimum confidence threshold.
A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch
represents an outcome of the test, and tree leaves represent classes or class distributions.
A neural network, when used for classification, is typically a collection of neuron-like processing units with weighted
connections between the units.
other methods for constructing classification models, such as naïve Bayesian classification,
support vector machines, and k-nearest neighbor classification
Suppose, as sales manager of AllElectronics, you would like to classify a large set of items in the store, based
on three kinds of responses to a sales campaign: good response, mild response, and no response.
such as price, brand, place made, type, and category.
Prediction: Predict some unknown or missing numerical values.
Example:-you would like to predict the amount of revenue that each item will generate
during an upcoming sale at AllElectronics, based on previous sales data.
4) Cluster analysis
clustering analyzes data objects without consulting a known class label.
Clustering based on the principle: maximizing the intra-class similarity and minimizing the
interclass similarity.
Figure A 2-D plot of customer data with respect to customer locations in a city, showing three data clusters. Each
cluster “center” is marked with a “+”.
5) Outlier analysis
Outlier: a data object that does not comply with the general behavior of the
data.
It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis.
A data mining study of stock exchange data may identify stock evolution regularities for
overall stocks and for the stocks of particular companies.
Such regularities may help predict future trends in stock market prices, contributing to your
decision making regarding stock investments.
Are All of the Patterns Interesting?
A data mining system may generate thousands of patterns all of them are not interesting
only a small fraction of the patterns potentially generated would actually be of interest
to any given user.
A pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test
data with some degree of certainty, (3) potentially useful, and (4) novel.
Search for only interesting patterns: Optimization First general all the patterns
and then filter out the uninteresting ones.
Classification of Data Mining Systems
Classification of Data Mining Systems
Database
Statistics
Technology
Machine
Learning Data Mining Visualization
Information Other
Science Disciplines
Depending on the kinds of data to be mined or on the given data mining application, the data mining
system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition,
image analysis, signal processing, computer graphics, Web technology, economics, business,
bioinformatics, or psychology.
A data mining system can be classified according to the kinds of databases mined.
classifying according to data models, we may have a relational, transactional, object-relational, or data
warehouse mining system.
If classifying according to the special types of data handled, we may have a spatial, time-series, text,
stream data, multimedia data mining system, or a WorldWideWeb mining system.
2) Classification according to the kinds of knowledge mined:
Data mining systems can be categorized according to the kinds of knowledge they mine, that is, based
on Data mining functionalities, such as characterization, discrimination, association and correlation
analysis, classification, prediction, clustering, outlier analysis, and evolution analysis.
Data mining systems can also be categorized according to the applications they adapt.
For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.
Data Mining Task Primitives
Each user will have a data mining task in mind, that is, some form of data analysis that he or she would
like to have performed.
A data mining task can be specified in the form of a data mining query, which is input to the data
mining system.
1) The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in
which the user is interested. This includes the database attributes or data warehouse dimensions of
interest.
This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.
This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for
evaluating the patterns found.
Concept hierarchies are a popular form of background knowledge, which allow data to be mined at
multiple levels of abstraction.
4) The interestingness measures and thresholds for pattern evaluation:
They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns.
For example, interestingness measures for association rules include support and confidence.
Rules whose support and confidence values are below user-specified thresholds are
considered uninteresting.
This refers to the form in which discovered patterns are to be displayed, which may include rules, tables,
charts, graphs, decision trees, and cubes.
Integration of a Data Mining System with a Database or DataWarehouse System
possible integration schemes include no coupling, loose coupling, semitight coupling, and
tight coupling.
1) No coupling: No coupling means that a DM system will not utilize any function
of a DB or DW system. It may fetch data from a particular source (such as a file
system),process data using some data mining algorithms, and then store the mining results
in another file.
Without using a DB/DW system, a DM system may spend a substantial amount of time
finding, collecting, cleaning, and transforming data.
Without any coupling of such systems, a DM system will need to use other tools to
extract data, making it difficult to integrate such a system into an information processing
environment. Thus, no coupling represents a poor design.
2) Loose coupling:
Loose coupling means that a DM system will use some facilities of a DB or DW system,
fetching data from a data repository, performing data mining, and then storing the
mining results either in a file or in a database or data warehouse.
it is difficult for loose coupling to achieve high scalability and good performance with large
data sets.
3) Semitight coupling:
Tight coupling means that a DM system is smoothly integrated into the DB/DW
system.
Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a DB or DW system.
With further technology advances, DM, DB, and DW systems will evolve and integrate
together as one information system with multiple functionalities.