Unit-1
Unit-1
1. Introduction ........................................................................................................................................................ 3
Why data Mining? ................................................................................................................................................... 3
What is data mining? .............................................................................................................................................. 3
2. Integration of a Data Mining System with a Data Warehouse ........................................................................ 4
3. Data Mining Primitives....................................................................................................................................... 5
An example for mining classification rules:-............................................ Error! Bookmark not defined.
4. Data & DataTypes ............................................................................................................................................... 8
4.1 Relational Databases: ....................................................................................................................................... 8
4.2 DataWarehouses:.............................................................................................................................................. 8
4.3 Transactional Databases .................................................................................................................................. 9
4.4 Advance Database – Object Relational Databases ....................................................................................... 10
4.5 Application Database: .................................................................................................................................... 10
5. Data mining functionalities ...............................................................................................................................12
5.1 Class/Concept Description: Characterization and Discrimination ..............................................................13
5.2 Mining Frequent Patterns, Associations, and Correlations ..........................................................................14
Frequent patterns...............................................................................................................................................14
Association analysis ...........................................................................................................................................14
5.3 Classification and Regression (for Predictive Analysis) ................................................................................ 15
5.4 Cluster Analysis ...............................................................................................................................................16
5.5 Outlier analysis ................................................................................................................................................16
6. Interestingness of Patterns................................................................................................................................ 17
7. A classification of data mining systems ........................................................................................................... 18
7.1 Classification according to the kinds of databases mined .............................................................................19
7.2 Classification according to the kinds of knowledge mined. ..........................................................................19
7.3 Classification based on level of abstraction. ..................................................................................................19
7.4 Classification according to the kinds of techniques utilized .........................................................................19
7.5 Classification based on data regularities. ...................................................................................................... 20
7.6 Classification according to the application adapted. .................................................................................... 20
8. Major issues in data mining ............................................................................................................................. 20
8.1 Mining methodology ........................................................................................................................................ 20
8.2 User Interaction ...............................................................................................................................................21
8.3 Efficiency and Scalability ............................................................................................................................... 22
8.4 Diversity of data types .................................................................................................................................... 22
8.5 Data mining and society ................................................................................................................................... 23
9. Data Preprocessing ........................................................................................................................................... 24
9.1 Data Cleaning .................................................................................................................................................. 25
1
9.1.1 Missing values ........................................................................................................................................... 25
9.1.2 Noisy data ................................................................................................................................................. 25
9.2 Data Integration.............................................................................................................................................. 26
9.2.1 Entity Identification Problem ................................................................................................................. 27
9.2.2 Redundancy and Correlation Analysis ................................................................................................... 27
9.2.3 Tuple Duplication .................................................................................................................................... 29
9.2.4 Data Value Conflict Detection and Resolution ...................................................................................... 29
9.3 Data Reduction ............................................................................................................................................... 29
9.3.1 Wavelet transforms .................................................................................................................................. 30
9.3.2 Principal components analysis ................................................................................................................31
9.3.3 Attribute Subset Selection ....................................................................................................................... 32
9.3.4 Regression and Log-Linear Models: Parametric Data Reduction ........................................................ 34
9.3.5 Histograms ............................................................................................................................................... 35
9.3.6 Clustering ................................................................................................................................................. 36
9.3.7 Sampling .................................................................................................................................................... 37
9.3.8 Data Cube Aggregation ........................................................................................................................... 38
9.4 Data Transformation and Data Discretization ............................................................................................. 39
9.4.1 Normalization........................................................................................................................................... 39
9.4.2 Discretization ........................................................................................................................................... 40
2
UNIT I DATA MINING
1. Introduction
– Data collection and data availability have improved – e.g. Automated data collection tools,
database systems, Web, computerized society
– Major sources of abundant data have evolved. E.g.:-
Data mining—Automated analysis of massive data sets are necessary to survive business
competition
Data mining (knowledge discovery from data) is extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or knowledge from huge amount of data
Alternative names
– Health care & medical data mining – often adopted such a view in statistics and machine
learning
– Preprocessing of the data (including feature extraction and dimension reduction)
– Classification or/and clustering processes
– Post-processing for presentation
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
– 6
If a data mining system may or may not be integrated with a database or a data warehouse system
The list of Integration Schemes is as follows −
– No Coupling
– Loose Coupling
– Semi−tight Coupling
– Tight coupling
4
No Coupling − In this scheme, the data mining system does not utilize any of the database or data
warehouse functions. It fetches the data from a particular source and processes that data using some
data mining algorithms. The data mining result is stored in another file.
Loose Coupling − In this scheme, the data mining system may use some of the functions of database
and data warehouse system. It fetches the data from the data respiratory managed by these systems and
performs data mining on that data. It then stores the mining result either in a file or in a designated
place in a database or in a data warehouse.
Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a data
warehouse system and in addition to that, efficient implementations of a few data mining primitives
can be provided in the database.
Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into the
database or data warehouse system. The data mining subsystem is treated as one functional component
of an information system.
Each user will have a data mining task in mind, that is, some form of data analysis that he or she would
like to have performed. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the data mining system during discovery in
order to direct the mining process, or examine the findings from different angles or depths.
5
1. The set of task-relevant data to be mined:
This specifies the portions of the database or the set of data in which the user is interested. This includes
the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or
dimensions).
2. The kind of knowledge to be mined:
This specifies the data mining functions to be per-formed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis.
3. The background knowledge to be used in the discovery process:
6
This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and
for evaluating the patterns found. Concept hierarchies are a popular form of back-ground knowledge, which
allow data to be mined at multiple levels of abstraction. User beliefs regarding relationships in the data are
another form of back-ground knowledge.
4. The interestingness measures and thresholds for pattern evaluation:
Different kinds of knowledge may have different interestingness measures. For example, interestingness
measures for association rules include support and confidence. Rules whose support and confidence values are
below user-specified thresholds are considered uninteresting.
5. The expected representation for visualizing the discovered patterns:
This refers to the form in which discovered patterns are to be displayed, which may include rules, tables,
charts, graphs, decision trees, and cubes. A data mining query language (DMQL) can be designed to
incorporate these primitives, allowing users to flexibly interact with data mining systems.
7
4. Data & DataTypes
In Data Mining, Data means information (i.e. Data repositories) that will given to the system.
There are different types of data repositories. They are relational databases, data warehouses,
transactional databases, advanced database systems, flat files, data streams, and the World Wide Web.
Advanced database systems include object-relational databases and specific application-oriented
databases, such as spatial databases, time-series databases, text databases, and multimedia databases.
Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples
(records or rows). Each tuple in a relational table represents an object identified by a unique key and
described by a set of attribute values.
An entity-relationship (ER) data model is used for construction of relational databases. An ER data model
represents the database as a set of entities and their relationships.
Example:
4.2 DataWarehouses:
A data warehouse is a repository of information collected from multiple sources, stored under a unified
schema, and that usually resides at a single site.
A data warehouse is usually modeled by a multidimensional database structure (shown in the figure
8
below), where each dimension corresponds to an attribute or a set of attributes in the schema, and each cell
stores the value of some aggregate measure, such as count or sales amount.
The actual physical structure of a data warehouse may be a relational data store or a multidimensional
data cube.
It consists of a file where each record represents a transaction. A transaction typically includes a unique
transaction identity number (trans ID) and a list of the items making up the transaction (such as items
purchased in a store).
Example:
9
4.4 Advance Database – Object Relational Databases
The object-relational data model inherits the essential concepts of object-oriented databases, where each
entity is considered as an object.
Data and code relating to an object are encapsulated into a single unit. Each object has associated with it
the following:
A set of variables that describe the objects. These correspond to attributes in the entity-relationship
and relational models.
A set of messages that the object can use to communicate with other objects, or with the rest of the
database system.
A set of methods, where each method holds the code to implement a message.
Example
an employee class can contain variables like name, address, and birthdate. The method for the message
get_photo(employee) will retrieve and return a photo of the given employee object.
A temporal database stores relational data that include time-related attributes. These attributes may
involve several timestamps, each having different semantics.
A sequence database stores sequences of ordered events, with or without a concrete notion of time.
Examples include customer shopping sequences, Web click streams, and biological sequences.
A time-series database stores sequences of values or events obtained over repeated measurements of
time (e.g., hourly, daily, weekly). Examples include data collected from the stock exchange, inventory
control, and the observation of natural phenomena (like temperature and wind).
Spatial databases contain spatial-related information. Examples include geographic (map) databases,
very large-scale integration (VLSI) or computed-aided design databases, and medical and satellite
10
image databases.
A spatial database that stores spatial objects that change with time is called a spatiotemporal
database, from which interesting information can be mined. For example, we may be able to group
the trends of moving objects and identify some strangely moving vehicles, or distinguish a bioterrorist
attack from a normal outbreak of the flu based on the geographic spread of a disease with time.
Text databases are databases that contain word descriptions for objects. Example Web Pages, E-Mail
Messages etc.,
Multimedia databases store image, audio, and video data. They are used in applications such as
picture content-based retrieval, voice-mail systems, video-on-demand systems, the World Wide Web,
and speech-based user interfaces that recognize spoken commands.
A heterogeneous database consists of a set of interconnected, autonomous component databases. The
components communicate in order to exchange information and answer queries. Objects in one
component database may differ greatly from objects in other component databases, making it difficult
to assimilate their semantics into the overall heterogeneous database. Example: application of different
hardware and software.
A legacy database is a group of heterogeneous databases that combines different kinds of data
systems, such as relational or object-oriented databases, hierarchical databases, network databases,
spreadsheets, multimedia databases, or file systems. The heterogeneous databases in a legacy database
may be connected by intra or inter-computer networks.
Data Stream, where data flow in and out of an observation platform (or window) dynamically. Such
data streams have the following unique features: huge or possibly infinite volume, dynamically
changing, flowing in and out in a fixed order, allowing only one or a small number of scans, and
demanding fast (often real-time) response time. Typical examples of data streams include various
kinds of scientific and engineering data, time-series data, and data produced in other dynamic
environments, such as power supply, network traffic, stock exchange, telecommunications, Web click
streams, video surveillance, and weather or environment monitoring.
The World Wide Web and its associated distributed information services, such as Yahoo!, Google,
America Online, and AltaVista, provide rich, worldwide, on-line information services, where data
objects are linked together to facilitate interactive access.
11
5. Data mining functionalities
Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks. In
general, such tasks can be classified into two categories:
Descriptive
Predictive
Descriptive mining tasks characterize properties of the data in a target data set.
Predictive mining tasks perform induction on the current data in order to make predictions.
Data mining functionalities, and the kinds of patterns they can discover, are described below.
32
12
5.1 Class/Concept Description: Characterization and Discrimination
Data can be associated with classes or concepts. It can be useful to describe individual classes and
concepts in summarized, concise, and precise terms. Such descriptions of a class or a concept are called
class/concept descriptions.
1. data characterization, by summarizing the data of the class under study (often called the target class) ,
2. data discrimination, by comparison of the target class with one or a set of comparative classes (often
called the contrasting classes),
3. both data characterization and discrimination.
For example, to study the characteristics of software products whose sales increased by 10% in the
last year, the data related to such products can be collected by executing an SQL query.
The output of data characterization can be presented in various forms. Examples include pie charts,
bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
The resulting descriptions can also be presented as generalized relations or in rule form (called
characteristic rules).
Data discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.
For example, the user may like to compare the general features of software products whose sales
increased by 10% in the last year with those whose sales decreased by at least 30% during the same
period.
The output of data discrimination can be presented in various forms. Examples include pie charts,
bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
Discrimination descriptions expressed in rule form are referred to as discriminant rules.
13
5.2 Mining Frequent Patterns, Associations, and Correlations
Frequent patterns
Frequent patterns are patterns that occur frequently in data. There are many kinds of frequent patterns,
including frequent itemsets, frequent subsequences, and frequent substructures.
A frequent itemset typically refers to a set of items that often appear together in a transactional
data set — for example, milk and bread, which are frequently bought together in grocery stores by many
customers.
A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC,
followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.
A substructure can refer to different structural forms, such as graphs, trees, or lattices, which may be
combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent)
structured pattern.
Association analysis
Association analysis is the discovery of association rules showing attribute-value conditions that
occur frequently together in a given set of data. Association analysis is widely used for market basket or
transaction data analysis.
Example:-
Suppose that, as a marketing manager at AllElectronics, you want to know which items are frequently
purchased together (i.e., within the same transaction). An example of such a rule is shown below:-
A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that
she will buy software as well. A 1% support means that 1% of all the transactions under analysis show that
14
computer and software are purchased together. This association rule involves a single attribute or predicate
(i.e., buys) that repeats. These are referred to as single-dimensional association rules.
Classification is the process of finding a model (or function) that describes and distinguishes data
classes or concepts. The model is derived based on the analysis of a set of training data (i.e., data objects
for which the class labels are known). The model is used to predict the class label of objects for which the
class label is unknown. The derived model may be represented in various forms, such as classification (IF-
THEN) rules, decision trees, mathematical formulae, or neural networks.
A decision tree is a flowchart like tree structure. Each node denotes a test on an attribute value, each branch
represents an outcome of the test, and tree leaves represent classes or class distributions. Decision trees can
be easily converted to classification rules.
A neural network is a collection of linear threshold units that can be trained to distinguish objects of
different classes.
Prediction - It is used to predict missing or unavailable numerical data values rather than class labels.
15
Example: Regression analysis
Clustering analyzes data objects without consulting a known class label. The objects are clustered or
grouped based on the principle of maximizing the intra-class similarity and minimizing the inter-class
similarity. That is, clusters of objects are formed so that objects within a cluster have high similarity in
comparison to one another, but are very dissimilar to objects in other clusters. Each cluster that is formed can
be viewed as a class of objects, from which rules can be derived.
A data set may contain objects that do not comply with the general behavior or model of the data.
These data objects are outliers. These data objects are outliers.
Many data mining methods discard outliers as noise or exceptions. However, in some applications (e.g.,
fraud detection) the rare events can be more interesting than the more regularly occurring ones. The analysis
of outlier data is referred to as outlier analysis or anomaly mining.
Outliers may be detected using statistical tests that assume a distribution or probability model for the
data, or using distance measures where objects that are remote from any other cluster are considered outliers.
Structure Analysis
– Graph mining is finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
– Social networks: actors (objects, nodes) and relationships (edges). e.g., author networks in CS,
terrorist networks
– Web community discovery, opinion mining, usage mining
6. Interestingness of Patterns
A data mining system has the potential to generate thousands or even millions of patterns, or rules.
However, a pattern is interesting only if it is
1. potentially useful
2. new (previously unknown)
3. easily understood by humans
4. valid on new or test data with some degree of certainty
A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An interesting
pattern represents knowledge. Several objective measures of pattern interestingness exist. These are based on
the structure of discovered patterns and the statistics underlying them. An objective measure for association
rules of the form X =>Y is rule support, representing the percentage of data samples that the given rule satisfies.
Another objective measure for association rules is confidence, which assesses the degree of certainty of the
detected association. It is defined as the conditional probability that a pattern Y is true given that X is true.
Over 20 interestingness measures have been proposed. Transactions that contain neither of the items A or
17
B. Null- invariance is crucial for correlation analysis. Support and confidence are not good enough to
indicate correlations. Null-invariant measures are better:-
– Kulczynski Measure
– IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in rule implications
49
Data mining is an interdisciplinary field, the confluence of a set of disciplines including database
systems, statistics, machine learning, visualization, and information science.
18
6. Classification according to the application adapted.
Database systems themselves can be classified according to different criteria (such as data models, or the
types of data or applications involved), each of which may require its own data mining technique. Data mining
systems can therefore be classified accordingly.
For instance, if classifying according to data models, we may have a relational, transactional, object-
oriented, object-relational, or data warehouse mining system. If classifying according to the special types of
data handled, we may have spatial, time-series, text, or multimedia data mining system, or a World-Wide
Web mining system. Other system types include heterogeneous data mining systems, and legacy data mining
systems.
Data mining systems can be categorized according to the kinds of knowledge they mine, i.e., based on data
mining functionalities, such as characterization, discrimination, association, classification, clustering,
trend and evolution analysis, deviation analysis, similarity analysis, etc.
A comprehensive data mining system usually provides multiple and/or integrated data mining
functionalities. Moreover, data mining systems can also be distinguished based on the granularity or levels of
abstraction of the knowledge mined, including generalized knowledge (at a high level of abstraction),
primitive-level knowledge (at a raw data level), or knowledge at multiple levels (considering several levels of
abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of
abstraction.
Data mining systems can also be categorized according to the underlying data mining techniques
employed. These techniques can be described according to the degree of user interaction involved (e.g.,
autonomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis
19
employed (e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics,
visualization, pattern recognition, neural networks, and so on). A sophisticated data mining system will often
adopt multiple data mining techniques or work out an effective, integrated technique which combines the
merits of a few individual approaches.
It is classified based on data regularities (commonly occurring patterns) versus those that mine data
irregularities (such as exception or outliers).
They categorized according to the applications such as finance, telecommunication, DNA, Stock markets,
email, so on..
Data mining is a vigorously developing research area. The major issues in data mining can be grouped
under the following categories:-
1. Mining Methodology
2. User Interaction
3. Efficiency and Scalability
4. Diversity of data types
5. Data mining and society
Mining methodologies should consider issues such as data uncertainty, noise, and incompleteness. Some
mining methods explore how user specified measures can be used to assess the interestingness of discovered
patterns as well as guide the discovery process. The following are the broad areas that come under this topic
The user plays an important role in the data mining process. Interesting areas of research include how to
interact with a data mining system, how to incorporate a user’s background knowledge in mining, and how to
visualize and comprehend data mining results. The following are the broad areas that come under this topic
Interactive mining
21
It is important to build flexible user interfaces and an exploratory mining environment, facilitating the
user’s interaction with the system. A user may like to first sample a set of data, explore general
characteristics of the data, and estimate potential mining results. Interactive mining should allow users
to dynamically change the focus of a search, to refine mining requests based on returned results, and to
drill, dice, and pivot through the data and knowledge space interactively, dynamically exploring “cube
space” while mining.
Incorporation of background knowledge
Background knowledge, constraints, rules, and other information regarding the domain under study
should be incorporated into the knowledge discovery process. Such knowledge can be used for pattern
evaluation as well as to guide the search toward interesting patterns.
Ad hoc data mining and data mining query languages: Query languages (e.g., SQL) have played an
important role in flexible searching
Presentation and visualization of data mining results
This requires the system to adopt expressive knowledge representations, user-friendly interfaces, and
visualization techniques.
The following are the broad areas that come under this topic
The following are the broad areas that come under this topic
22
Handling complex types of data
Diverse applications generate a wide spectrum of data types - temporal data, biological sequences,
sensor data, spatial data, hypertext data, multimedia data, software program code, web data, social
network data etc. Data can be structured/ unstructured; static or dynamic. Domain or application
specific data mining systems are being constructed for in-depth mining of specific kinds of data.
Mining dynamic, networked, and global data repositories
Multiple sources of data are connected by the Internet and various kinds of networks, forming gigantic,
distributed, and heterogeneous global information systems and networks. The discovery of knowledge
from different data sources with diverse data semantics poses great challenges to data mining.
The following are the broad areas that come under this topic
23
9. Data Preprocessing
Data has quality if they satisfy the requirements of the intended use. There are many factors
comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and
interpretability. Data preprocessing helps to achieve data quality. The major steps involved in data
preprocessing, are
1. data cleaning
2. data integration
3. data reduction
4. data transformation
Data transformation -2, 32, 100, 59, 48 => -0.02, 0.32, 1.00, 0.59, 0.48
24
9.1 Data Cleaning
Ignore the tuple, if the tuple contains several attributes with missing values
Fill in the missing value manually: In general, this approach is time-consuming and may not be
feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like “Unknown". Although this method is simple, it is not recommended.
Use the attribute mean to fill in the missing value: For example, suppose that the average income of
All Electronics customers is $28,000. Use this value to replace the missing value for income.
Use the attribute mean of all samples belonging to the same class as the given tuple: For example, if
classifying customers according to credit risk, replace the missing value with the average income value
for customers in the same credit risk category as that of the given tuple.
Use the most probable value to fill in the missing value: This may be determined with inference-
based tools using a Bayesian formalism or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to predict the missing values for
income.
i. Binning methods:
Binning methods attempt to smooth sorted data by consulting the “neighborhood" or values around it.
Because binning methods consult the neighborhood of values, they perform local smoothing. Figure
illustrates some binning techniques. Smoothing is achieved by replacing a set of neighborhood values by
mean, median, boundary values, mathematical function etc.
25
Example:-
Price data are first sorted and partitioned into equi-depth bins of depth 3. In smoothing by bin means, each
value in a bin is replaced by the mean value of the bin. The mean of the values 4, 8, and 15 in Bin 1 is 9.
Therefore, each original value in this bin is replaced by the value 9. Similarly, each bin value may be
replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary
value.
Input Data: 4,24,25,21,34,21,15,8,28 Bin boundaries Sorted Input: 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equi-width) Smoothing by bin Smoothing by bin
Bin Mean Min Max
bins means: boundaries
1 4, 8, 15 9 4 15 9, 9, 9 4, 4, 15
2 21, 21, 24 22 21 24 22, 22, 22 21, 21, 24
3 25, 28, 34 29 25 29 29, 29, 29 25, 25, 34
ii. Clustering
Similar values are organized into groups or “clusters”. Values which fall outside of the set of clusters
may be considered outliers. The following figure shows outliers detected from cluster analysis. Outliers may
be identified through a combination of computer and human inspection.
iii. Regression
Data can be smoothed by fitting the data to a function, such as regression. Linear regression involves
finding the “best" line to fit two variables, so that one variable can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two variables are involved
and the data are fit to a multidimensional space.
Detect suspicious values and have them checked by human (e.g., deal with possible outliers)
Data may be arriving from multiple sources in your analysis. This would involve integrating multiple
26
databases, data cubes, or files.
The data analysis task may involve data integration, which combines data from multiple sources into a
coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat
files.
Schema integration and object matching pose issues. This includes how to match attribute names,
meaning, data type, size, and range of permitted values. For example, customer identification may be referred
to as customer id in one data store and cust id in another. A name could be registered as “Bill” in one
database, “William” in another, and “B.” in a third.
Redundancy is another important issue in data integration. An attribute may be redundant if it can be
“derived” from another attribute or set of attributes.
E.g. annual revenue can be derived from monthly revenue. Inconsistencies in attribute or dimension
naming can also cause redundancies.
Some redundancies can be detected by correlation analysis. Given two attributes or variables, such
analysis can measure how strongly one attribute implies the other, based on the available data. For nominal
data, we use the chi-square test. For numeric attributes, we can use the correlation coefficient (ρ). ρ varies
from -1 to +1. A positive value indicates the two concerned variables are positively correlated, i.e., both raise
and fall together. If ρ is zero, there is no correlation. A negative ρ implies that the raise of one variable
indicates the fall of the other
27
Simple Correlation Coefficient
28
9.2.3 Tuple Duplication
Duplication means there are two or more identical tuples in the dataset for a given case. Duplicates
appear due to inaccurate data entry or update, or use of de-normalized tables.
Data integration also involves the detection and resolution of data value conflicts. For example, for the
same real-world entity, attribute values from different sources may differ. This may be due to differences in
representation, scaling, or encoding. For instance, a weight attribute may be stored in metric units in one
system and British imperial units in another. Attributes may also differ on the level of abstraction.
Data in data warehouse is very huge. Complex data analysis and mining on huge amounts of data can
take a long time, making such analysis impractical or infeasible. Data reduction techniques can be applied to
obtain a reduced representation with smaller data volume, maintaining the characteristics of the original data.
1. Data cube aggregation, where aggregation operations are applied to the data in the construction of
a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions
may be detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller data
representations such as parametric models (which need store only the model parameters instead of the
actual data) or nonparametric methods such as clustering, sampling, and the use of histograms.
5. Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction
that is very useful for the automatic generation of concept hierarchies.
.
29
Dimensionality reduction is the process of reducing the number of variables under consideration.
Dimensionality reduction methods include:-
wavelet transforms
principal components analysis
They transform or project the original data onto a smaller space. Attribute subset selection is a method of
dimensionality reduction in which irrelevant, weakly relevant or redundant attributes or dimensions are
detected and removed.
The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied to
a data vector D, transforms it to a numerically different vector, D’, of wavelet coefficients. The two vectors
are of the same length. During the transformation process, store all coefficients that are higher than a
threshold and discard all others.
The DWT is closely related to the discrete Fourier transform (DFT), a signal processing technique involving
sines and cosines. In general, however, the DWT achieves better lossy compression.
The length, L, of the input data vector must be an integer power of two. This condition can be met by
padding the data vector with zeros, as necessary.
Each transform involves applying two functions. The first applies some data smoothing, such as a sum
or weighted average. The second performs a weighted difference.
The two functions are applied to pairs of the input data, resulting in two sets of data of length L/2. In
general, these respectively represent a smoothed version of the input data, and the high-frequency
content of it.
The two functions are recursively applied to the sets of data obtained in the previous loop, until the
resulting data sets obtained are of desired length.
A selection of values from the data sets obtained in the above iterations are designated the wavelet
coefficients of the transformed data.
30
9.3.2 Principal components analysis
Principal components analysis (PCA) searches for ‘c’ k-dimensional orthogonal vectors that can best
be used to represent the data, where ‘c’ << N. The original data is thus projected onto a much smaller
space, resulting in data compression. PCA can be used as a form of dimensionality reduction. The initial data
can then be projected onto this smaller set.
PCA can be applied to ordered and unordered attributes, and can handle sparse data and skewed data.
Multidimensional data can be handled by reducing the problem to fewer dimensions. Principal components
may be used as inputs to multiple regression and cluster analysis. In comparison with wavelet transforms,
PCA tends to be better at handling sparse data, whereas wavelet transforms are more suitable for data of
higher dimensionality.
The input data are normalized, so that each attribute falls within the same range. This step helps ensure
that attributes with large domains will not dominate attributes with smaller domains.
PCA computes ‘N’ orthonormal vectors which provide a basis for the normalized input data. These
are unit vectors that each point in a direction perpendicular to the others. These vectors are referred to
as the principal components. The input data are a linear combination of the principal components.
The principal components are sorted in order of decreasing “significance" or strength. The principal
components essentially serve as a new set of axes for the data, providing important information about
variance.
31
Since the components are sorted according to decreasing order of “significance", the size of the
data can be reduced by eliminating the weaker components, i.e., those with low variance. Using the
strongest principal components, it should be possible to reconstruct a good approximation of the
original data.
Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining
task or redundant. For example, if the task is to classify customers based on whether or not they are likely to
purchase a popular new CD, attributes such as the customer’s telephone number are likely to be irrelevant,
unlike attributes such as age or music-taste. Although it may be possible for a domain expert to pick out
some of the useful attributes, this can be a difficult and time consuming task, especially when the data’s
behavior is not well known.
The goal of attribute subset selection is to find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as possible to the original one using all attributes.
For n attributes, there are 2n possible subsets. An exhaustive search for the optimal subset of attributes
can be prohibitively expensive, especially when n is large. Greedy algorithms are used in this context.
Their strategy is to make a locally optimal choice in the hope that this will lead to a globally optimal solution.
32
Stepwise forward selection:
The procedure starts with an empty set of attributes as the reduced set. The best of the original
attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of
the remaining original attributes is added to the set.
Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step, it removes the worst attribute
remaining in the set.
– Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended for
classification. Decision tree induction constructs a flowchart like structure where each internal
(nonleaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test,
and each external (leaf) node denotes a class prediction. At each node, the algorithm chooses
the “best” attribute to partition the data into individual classes.
– When decision tree induction is used for attribute subset selection, a tree is constructed from
the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set
of attributes appearing in the tree form the reduced subset of attributes.
33
The stopping criteria for the above methods may vary. The procedure may employ a threshold on the
measure used to determine when to stop the attribute selection process.
Numerosity reduction techniques replace the original data volume by alternative, smaller forms of
data representation. These techniques may be parametric or nonparametric.
For parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. Parametric Methods, Examples:-
– Regression
– Log-linear models
– Histograms
– Clustering
– Sampling
– Data cube aggregation.
Regression and log-linear models can be used to approximate the given data.
In linear regression, the data are modeled to fit a straight line. For example, a random variable, Y (called
a response variable), can be modeled as a linear function of another random variable, X (called a predictor
variable), with the equation where the variance of Y is assumed to be constant.
Y = b + wX
Here the coefficients, w and b (called regression coefficients). These coefficients can be solved for by the
method of least squares, which minimizes the error between the actual line separating the data and the
34
estimate of the line.
Log-linear models approximate discrete multidimensional probability distributions. The method can be
used to estimate the probability of each cell in a base cuboid for a set of discretized attributes, based on the
smaller cuboids making up the data cube lattice
9.3.5 Histograms
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets. The
buckets are displayed on a horizontal axis, while the height (and area) of a bucket represents the frequency.
Equi-width: In an equi-width histogram, the width of each bucket range is constant (such as the width
of $10 for the buckets in Figure 3.8).
Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant (that is, each bucket contains roughly the same number of
contiguous data samples).
35
V-Optimal: If we consider all of the possible histograms for a given number of buckets, the V-optimal
histogram is the one with the least variance. Histogram variance is a weighted sum of the original
values that each bucket represents, where bucket weight is equal to the number of values in the bucket.
MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of adjacent values. A
bucket boundary is established between each pair for pairs having the largest differences, as specified
by the user
Histogram Example: The following data are a list of AllElectronics prices for commonly sold items
(rounded to the nearest dollar).
1, 8, 15, 5, 21, 20, 5, 20, 1, 5, 15, 21, 15, 20, 14, 18, 30, 14, 25, 12, 5, 10, 30, 20, 25, 15, 15, 18, 18, 10, 20,
21, 25, 18, 25, 28, 28, 20, 10, 5, 30, 18, 18, 10, 18, 25, 8, 21, 15, 20, 14, 18
The numbers are sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18,
18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Price
Range Frequency Equal Width Histogram Equal Depth Histogram
01 to 10 13 30 30
Frequency
11 to 20 25
20
21 to 30 14 20
10
10
0
01 to 11 to 21 to
10 20 30 0
Price Range 1 to 14 15 to 20 21 to 30
9.3.6 Clustering
Clustering techniques consider data tuples as objects. They partition the objects into groups or
clusters, so that objects within a cluster are “similar" to one another and “dissimilar" to objects in
other clusters. Similarity is commonly defined in terms of how “close" the objects are in space, based on a
distance function. The quality of a cluster may be represented by its diameter, the maximum distance between
any two objects in the cluster. Centroid distance is an alternative measure of cluster quality, and is defined as
the average distance of each cluster object from the cluster centroid.
36
9.3.7 Sampling
Sampling can be used as a data reduction technique since it allows a large data set to be represented by a
much smaller random sample (or subset) of the data. Suppose that a large data set, D, contains N tuples. Let's
have a look at some possible samples for D.
1. Simple random sample without replacement of size n: This is created by drawing n of the N tuples from
D (n < N), where the probably of drawing any tuple in D is 1/N, i.e., all tuples are equally likely.
2. Simple random sample with replacement of size n: This is similar to simple random sample, except that
each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is
placed back in D so that it may be drawn again.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a simple random
sample of m clusters can be obtained, where m < M.
4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified sample of D is
generated by obtaining a Simple random sample at each stratum. This helps to ensure a representative
sample, especially when the data are skewed. For example, a stratified sample may be obtained from
customer data, where stratum is created for each customer age group.
37
9.3.8 Data Cube Aggregation
The lowest level of a data cube represents the aggregated data for an individual entity of interest
all
0-D (apex) cuboid
product date country
1-D cuboids
product,date product,country date, country
2-D cuboids
43
Multiple levels of aggregation in data cubes further reduce the size of data to deal with
Reference appropriate levels - Use the smallest representation which is enough to solve the task
Queries regarding aggregated information should be answered using data cube, when possible
38
9.4 Data Transformation and Data Discretization
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Strategies for data transformation include the following:
1. Smoothing - this works to remove noise from the data. Techniques include binning, regression, and
clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added from the
given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales
data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in
constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as 1.0 to 1.0, or
0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g.,
0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively
organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized to
higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within the
database schema and can be automatically defined at the schema definition level.
9.4.1 Normalization
Here the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0 to
1.0. There are three main methods for data normalization:-
– min-max normalization
– z- score normalization
– decimal scaling.
Min-max normalization performs a linear transformation on the original data. Suppose that minA and
maxA are the minimum and maximum values of an attribute A. Min-max normalization maps a value v
of A to v’ in the range [new minA; new maxA] by computing
39
e.g.:- (8, 10, 15, 20) transforms to (0, .16, 0.58, 1), where new min A = 0, new max A = 1
z-score normalization (or zero-mean normalization), the values for an attribute A are normalized based on
the mean and standard deviation of A. This method of normalization is useful when the actual minimum
and maximum of attribute A are unknown, or when there are outliers which dominate the min-max
normalization. A value v of A is normalized to v’, by the following computation:-
73,600 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then 73,600 maps to 1.225
16,000
Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The
number of decimal points moved depends on the maximum absolute value of A. The value v of A is
normalized to v’ by computing j, where j is the smallest integer, such that,
v’ = v/ 10j, and max(|v’|) <1
Example:- Suppose that the recorded values of A range from 986 to 917. The maximum absolute value of A
is 986. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j D 3) so that 986
normalizes to 0.986 and 917 normalizes to 0.917.
9.4.2 Discretization
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins. This technique is
used in ‘Data Cleaning’. These methods are also used as discretization methods for data reduction and
concept hierarchy generation. Binning does not use class information and is therefore an unsupervised
discretization technique. It is sensitive to the user-specified number of bins, as well as the presence of
outliers. Binning techniques were discussed in Section 9.1.2 Noisy data, under Binning methods
40
Discretization by Histogram Analysis
Like binning, histogram analysis is an unsupervised discretization technique because it does not use class
information. Histograms were introduced in Section 9.3.5 Histograms
Cluster analysis is a popular data discretization method. A clustering algorithm can be applied to
discretize a numeric attribute, A, by partitioning the values of A into clusters or groups. Clustering
takes the distribution of A into consideration, as well as the closeness of data points, and therefore is able to
produce high-quality discretization results. Clustering can be used to generate a concept hierarchy for A by
following either a top-down splitting strategy or a bottom-up merging strategy, where each cluster forms a
node of the concept hierarchy. In the former, each initial cluster or partition may be further decomposed into
several sub clusters, forming a lower level of the hierarchy. In the latter, clusters are formed by repeatedly
grouping neighboring clusters in order to form higher-level concepts.
41